CN111242003B

CN111242003B - Video salient object detection method based on multi-scale constrained self-attention mechanism

Info

Publication number: CN111242003B
Application number: CN202010024556.XA
Authority: CN
Inventors: 程明明; 顾宇超; 卢少平
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2020-01-10
Filing date: 2020-01-10
Publication date: 2022-05-27
Anticipated expiration: 2040-01-10
Also published as: CN111242003A

Abstract

The invention discloses a video salient object detection method based on a multi-scale constrained self-attention mechanism. The method adopts a constrained attention mechanism, namely measuring the similarity of an inquiry element and a plurality of frames in a video segment in a surrounding constrained region to generate an attention diagram, and weighting and collecting characteristic information of the plurality of frames in the surrounding constrained region to strengthen the characteristics of the inquiry element by using the attention diagram as a weight. Meanwhile, by utilizing a multi-branch technology, the sampling of each constrained attention branch is in different scale ranges, and the formed multi-scale constrained self-attention mechanism can be suitable for the input of different scales. The method successfully utilizes the visual connection between the inter-frame elements in the video segments, thereby solving the problem of motion modeling of the salient objects between videos. The video saliency target detection system constructed based on the method can achieve extremely high detection speed and high detection precision.

Description

Video salient object detection method based on multi-scale constrained self-attention mechanism

Technical Field

The invention belongs to the technical field of video processing, and particularly relates to a video salient object detection method based on a multi-scale constrained self-attention mechanism.

Background

Video salient object detection is used to segment the most appealing objects in a video segment. The technique is generally used as a pre-processing for numerous real-time applications, such as video tracking, video segmentation, human-computer interaction, and the like. Therefore, both efficiency and accuracy are important to video saliency model design.

Laurent Itti, 1998, has shown that for dynamic scenes, the motion of objects can attract attention, so the timing characteristics play a decisive role in video saliency. In an early approach, Tao Xi et al proposed in "sales Object Detection With spatialitemeporal Background subjects for Video" in 2016 to learn significant Object information in Video. With the rise of deep learning, early manual features changed to encode spatiotemporal information using different network structures. Among them, Trung-Nghia Le et al propose to use "3D convolution" to extract spatio-temporal Features in "Video saline Object Detection Using spatial iterative Deep Features", and then construct a spatio-temporal map to determine timing consistency. Hongmei Song et al in "guided delay ConvLSTM for Video sales Object Detection" encode timing information primarily by long and short remembering convolutional neural networks. Guinbin Li et al, in "Flow Guided Current New Encoder for Video sales Object Detection", uses optical Flow to acquire motion information. The 3D convolution, optical flow extraction and long-time and short-time memory network generally have large calculation overhead. In addition, these methods can only process adjacent frames at a time, extract motion information step by step, and cannot acquire motion information directly from a plurality of frames in one video segment at the same time, so that the speed is relatively slow.

Recently Wang et al proposed a "non-local neural network" that extends the self-attention mechanism from the machine translation domain to the video classification domain. The method can simulate long-distance time sequence information by measuring the paired similarity between elements and gathering information from multiple frames according to the similarity. The Non-local approach has a large computational and memory overhead, followed by different work exploring how to mitigate the Non-local computational overhead. Kaiyu Yue et al, who simultaneously learned the relationships between space and channel elements in a "Compact Generalized Non-local Network", propose a Compact form of kernel function that reduces computational overhead. Computing the pixel-by-pixel relationships is difficult in the pixel-level intensive prediction task. Since pixel-level dense prediction requires maintaining a high-resolution feature map, it causes huge computational consumption. The method for cross sampling is provided by Zilong Huang et al in ' CCNet ' Criss-CrossAttention for Semantic Segmentation ', sparse elements are learned in Semantic Segmentation to be in pairwise relation, and calculation consumption is reduced. In the field of video significance prediction, no work is done to acquire motion information by analyzing the pair-wise relationship of inter-frame elements, and the method is directly used for video significance detection and still has large calculation overhead.

Disclosure of Invention

The invention provides a video saliency object detection method based on a multi-scale constrained self-attention mechanism, aiming at the problems that the current video saliency object detection speed is low, the time sequence relation cannot be modeled among multiple frames and the like. The method of the invention groups the space characteristics of a plurality of input frames, and measures the relationship of the elements between the frames by utilizing sampling windows with different scales in each group. By the inter-frame space-time continuous motion prior, the method reduces the overhead of dense relation sampling, can quickly acquire motion information among multiple frames, and achieves quick and accurate video significance detection.

Technical scheme of the invention

A video salient object detection method based on a multi-scale constrained self-attention mechanism comprises three steps of space feature training, time sequence feature training and model deployment, and specifically comprises the following steps:

step 1, training spatial characteristics;

step 1.1, collecting an image significance data set;

step 1.2, preprocessing an image significance data set, including random turning and scale transformation;

step 1.3, training a backbone network by utilizing an image significance data set and a BP algorithm to obtain an image significance characteristic extraction network;

Step 2, training time sequence characteristics;

step 2.1, collecting a video significance data set;

step 2.2, performing data enhancement on the video significance data set, including random turning and extracting training frames at different interval lengths;

2.3, in training, extracting a section of video frame in the video significance data set, and extracting spatial features frame by frame through a neural network;

step 2.4, grouping the extracted spatial features, and setting windows with different scales for each group;

2.5, for each group of grouped spatial features, because an object does not have large displacement between frames, generating an attention diagram of each position by measuring the similarity between the features of each position on the feature diagram and the features of a spatial adjacent area of the position on an adjacent frame, wherein the size of the spatial adjacent area is given by a preset scale window in the step 2.4;

step 2.6, for each group of space features after grouping, for each position on the feature map, the time sequence information of the surrounding frames is collected through the weighted attention map generated in the step 2.5, and the space-time feature of the position is obtained;

step 2.7, performing linear fusion on the space-time characteristics obtained by different groups, and obtaining a predicted significance result by the fused space-time characteristics through a decoder;

Step 2.8, repeating the training process until convergence, and obtaining trained neural network parameters;

step 3, model deployment;

step 3.1, acquiring a video to be detected;

3.2, framing the video to be detected, and forming small-batch data by the obtained frames according to a given quantity;

step 3.3, initializing a neural network, and loading the trained parameters in the step 2.8;

and 3.4, performing video significance prediction on each small batch of data formed in the step 3.2, and synthesizing a detection result video.

The invention has the advantages and beneficial effects that: the method can acquire the motion information by measuring the relation of the element levels among multiple frames in the video clip. By restricting the reference window, the method restricts the inquiry range to the position where the motion between the object frames may occur, thereby greatly reducing the cost of modeling the element-by-element relationship between the video frames. By introducing different window sizes in different branches, the method can adapt to different object dimensions and large displacement between frames caused by different speeds. The method brings a new solution to the time sequence modeling of video significance object detection, and achieves extremely high detection speed and precision.

Drawings

Fig. 1 is a flow diagram of video salient object detection based on a multi-scale constrained self-attention mechanism.

Fig. 2 is a specific architecture of an image saliency feature based extraction network, which is based on mobilenetV3, and removes the full connection layer, and sets the convolution step sizes of the 3 rd and 5 th stages to 1.

Fig. 3 is a schematic diagram of a sampling window setting.

Fig. 4 is a schematic diagram of video salient object detection based on a multi-scale constrained self-attention mechanism, which includes an overall frame schematic diagram and a constrained self-attention mechanism module schematic diagram.

Fig. 5 is a graph of the results of the generated attention maps.

FIG. 6 is a significance map generation result on a public data set. Wherein the first line is an input video frame, the second line is an annotated graph, and the third line is a result generated by the method of the invention.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings, but the present invention is not limited to the following.

Referring to fig. 1, a flow chart of video salient object detection based on a multi-scale constrained self-attention mechanism includes three stages of spatial feature training, timing feature training and model deployment:

firstly, the spatial feature training comprises the following steps:

S1, collecting an image significance data set; in particular, we use the picture saliency to detect the training set part of the public data set DUTS.

S2, preprocessing the image saliency data set, namely firstly scaling the image to a fixed size 224 x 224, and then carrying out scale transformation on the image to obtain the image which is (0.5, 0.75, 1, 1.25 and 1.5) times as large as the original image. The input image is subtracted by the mean (0.485, 0.456, 0.406), divided by the standard deviation (0.2299, 0.224, 0.225),

and S3, training a backbone network by using the image significance data set and the BP algorithm. Detailed structure referring to fig. 2, we use mobilene-v 3 as the backbone network, eliminating the full connectivity layer of the mobilene. In order to maintain the spatial information of the feature map, we set the convolution step size of the last two stages of the mobilene to 1, and the spatial feature map output by the network has the resolution of the original map 1/8. And obtaining the image salient features through training to extract a backbone network.

The backbone network obtained through training of the image saliency data set has the capacity of extracting the spatial saliency objects, and provides a good initialization for the time sequence saliency training.

Secondly, the time sequence characteristic training comprises the following steps:

s4. collect video saliency data sets, in particular, we use the training set part of the two data sets of the disclosed video saliency data set DAVIS and davsodod.

And S5, performing data enhancement on the video significance data set, giving k frames to form a batch of video segments as training data, performing random horizontal turning on the training segments, and randomly selecting k frames with different intervals (1-5) to form the training data.

S6, in training, inputting a group of video clips, extracting spatial features frame by frame through a pre-trained image saliency feature extraction network, and splicing along a time dimension;

s7, grouping the spatial characteristics of the input video clips, and setting windows with different scales;

the method specifically comprises the following substeps: for the obtained spatial features X of the video segment, the size of X is T X H W, where T, H,w represents the frame number, length, width and channel number of the video space feature respectively; the method comprises the steps that three convolution kernels with the size of 1X 1 are used for carrying out convolution on a spatial feature X of a video clip, and the X is linearly projected to three subspaces Q, K and V, wherein Q, K and V respectively represent an inquiry feature, a measurement feature and a value feature; and splitting Q, K and V into g feature groups along the feature channel, wherein each feature group has C/g dimensional features, and the features Qi, Ki and Vi of each feature group are subjected to the feature group, wherein i represents the ith group of features. For each set of spatial features, a sampling window of a given scale is set. Referring to FIG. 3, we give a sampling window radius parameter r and a window hole parameter d, for each query element position x _q(gray element in the figure) the sampling window is centered on this element, a perforated square window on a given video segment (black element in the figure represents the sampling point, example window radius r is 1, hole d is 2). The sample position function may be defined as:

wherein x is_q(h, w, t) represents the position of the query element, K_iRepresenting the metric characteristics and T representing the total frame number. The sampling window is centered on the query element, at the same spatial location on successive frames. Since the motion of the object has a continuous trajectory, the sampling window can give a rough localization in the previous and subsequent frames of the query element using the location prior.

S8, traversing all feature elements for each group of grouped spatial features, and generating an attention diagram by measuring the similarity of the features of the query elements and the elements of surrounding frames in a given sampling window;

the method specifically comprises the following substeps: referring to fig. 4, all elements are traversed for the grouped ith set of query features Qi. To Q_iEach element x of (1)_qIn measuring the characteristic K_iTo obtain x_qAnd measuring characteristics of elements in the window range, and generating an attention diagram through a relation measure. For the relationship metric function, we use dot product for similarity measurement. The formula for generating the attention diagram is as follows:

W_att＝f(Q_i，K_i)＝Q_iS(Q_i、K_i)^T

wherein, W_attRepresenting the generated attention map. We performed softmax normalization on the generated attention map. The sampling window defined in this step gives an approximate range of positions of the query element in the adjacent frame, and we try to more accurately locate the position of the element in the adjacent frame by measuring the attention generated by similarity. Our window setup can greatly reduce the number of times the element similarity is calculated in case the object motion is captured. Compared with the similarity calculated by the whole graph in the prior method, the calculation of a large number of irrelevant areas is reduced through the continuity of adjacent frames. Referring to fig. 5, a diagram of the window attention maps at the current frame and the fifth frame is shown, with queries made at two locations given the first frame. We can focus salient objects well between frames.

S9, for each group of spatial features which are grouped, acquiring information of surrounding frames through attention force diagram weighting, and updating the features of the current frame;

the specific steps are that all characteristic elements are traversed to the grouped ith group inquiry characteristics Qi. For Q_iEach query element x in (1)_qIn the value feature V_iAbove, we are on x_qTo the feature vector of the element in the sampling window of (2) to the attention map W _attWeighted summation is carried out on spatial positions as weights, and the obtained characteristic vector is taken as x_qNew features, expressed as:

Y_i＝W_attS(Q_i，V_i)

wherein Y is_iAnd the space-time characteristics of the ith group of multi-frame information gathered through the operation are shown. The new feature we obtain is the weighted aggregation of all incoming frames, where the weights are attention-driven diagrams based on similarity measures, by which the invention models information exchange between frames.

S10, solving space-time characteristics Y of different scale groups_iPerforming linear fusion to obtain fused space-time featuresThe significance result of the prediction obtained by the decoder is characterized.

The specific steps are that, for each group of characteristics { X }_iAnd i is 1, 2, g, different scale windows are given, and the space-time characteristics { Y ] of the multi-frame information gathered under the given scale window are obtained_iI 1, 2,. g }. We find the feature Y from different scale groups_iPerforming linear fusion by using 1 × 1 convolution to obtain space-time characteristics Y under multiple scales, and adding a fusion result into the original characteristics X in a residual error mode, namely:

X′＝X+Y

we feed the spatio-temporal features X' to the neural network decoder to obtain the significance result of this segment of video prediction.

And S11, calculating cross entropy loss by marking the result and the data, and updating parameters by using an SGD optimizer until the network converges to obtain trained network parameters.

Thirdly, deploying the model;

the model is divided into two parts, one is an image saliency segmentation network, the specific architecture refers to step S3, the other is a time-series feature extraction module, namely a restricted self-attention module, the specific process is described in steps S7-S10, and referring to fig. 4, the restricted attention module (CSA in the figure) is involved between the feature encoder and decoder of the image saliency segmentation network. The training process comprises space significance training and time sequence significance training, wherein in the space significance training, the limited self-attention module is removed, an image data set is used for pre-training an image significance segmentation network, in the time sequence significance training, the limited attention module is added, and all modules are trained by a video data set. Through two-step training, the obtained model has the capability of extracting space-time significance, and referring to fig. 1, the following steps are provided for deploying the trained model for real application:

s12, acquiring a video to be detected

And S13, framing the video to be detected, and forming small-batch video clips by the obtained frames according to a given number to serve as the input of the network.

S14, constructing a limited video saliency detection model as shown in the figure 4, and loading model parameters obtained through training.

And S15, performing video significance prediction on the data of each video clip, and splicing detection results to form a video form for output. As shown in fig. 6, the input video segment is 5 frames, and the output is the segmentation result of the video segment. And splicing the output prediction results according to time to obtain the predicted video output.

Claims

1. A method for detecting video salient objects based on a multi-scale constrained self-attention mechanism is characterized by comprising the following steps: the method comprises three steps of space characteristic training, time sequence characteristic training and model deployment, and comprises the following specific steps:

step 1, training spatial characteristics;

step 1.1, collecting an image significance data set;

step 2, training time sequence characteristics;

step 2.1, collecting a video significance data set;

step 2.2, performing data enhancement on the video significance data set, including random inversion, and extracting training frames at different interval lengths;

2.3, in the training, extracting a section of video frame in the video saliency data set, and extracting spatial features frame by frame through a neural network;

2.5, for each group of grouped spatial features, because an object does not have large displacement between frames, generating an attention map of each position by measuring the similarity between the features of each position on a feature map and the features of the spatial adjacent area of the position on an adjacent frame, wherein the size of the spatial adjacent area is given by a preset scale window in the step 2.4;

step 3, model deployment;

step 3.1, acquiring a video to be detected;

and 3.4, performing video significance prediction on the data of each small batch formed in the step 3.2, and synthesizing a detection result video.

2. The method for video salient object detection based on the multi-scale constrained self-attention mechanism as claimed in claim 1, wherein the step 2.4 comprises the following sub-steps: for the spatial features X of the input video clip, the size of X is T H W, wherein T, H and W respectively represent the frame number, length, width and the number of feature channels of the video spatial features and are represented by C; the method comprises the steps that three convolution kernels with the size of 1X 1 are used for carrying out convolution on a spatial feature X of a video clip, and the X is linearly projected to three subspaces Q, K and V, wherein Q, K and V respectively represent an inquiry feature, a measurement feature and a value feature; splitting Q, K and V into g feature groups along a feature channel, wherein each feature group has C/g dimensional features, setting different window radius parameters ri and window cavity parameters di for the features Qi, Ki and Vi of each feature group, and initializing windows with different sizes; the window is global in the time sequence dimension and is a region which is centered on the inquiry point in the space dimension, so the window well positions the inquiry position in the time sequence context and is beneficial to the information transfer between frames.

3. The method for video salient object detection based on the multi-scale constrained self-attention mechanism as claimed in claim 2, wherein the step 2.5 comprises the following sub-steps: traversing all spatial positions for the i-th group of inquiry characteristics Qi which are grouped, extracting the characteristic vector of each position as the inquiry characteristic vector, and extracting the measurement characteristics K of each element characteristic in the window through the window initialized in the step 2.4_jAnd query feature Q_iA dot product operation is performed to obtain a similarity, and an attention map for each inquiry position is generated.

4. The method for video salient object detection based on the multi-scale constrained self-attention mechanism according to claim 2 or 3, wherein the step 2.6 comprises the following steps: for the grouped ith set of query features Qi, all spatial positions are traversed, and the features within the window range in the Vi are weighted and summed by the attention map generated in step 2.5.