CN111161306A

CN111161306A - Video target segmentation method based on motion attention

Info

Publication number: CN111161306A
Application number: CN201911402450.2A
Authority: CN
Inventors: 付利华; 杨寒雪; 杜宇斌; 姜涵煦
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-15
Anticipated expiration: 2039-12-31
Also published as: CN111161306B

Abstract

The invention provides a video target segmentation method based on motion attention, which adds a channel characteristic diagram output by a channel attention module and a position characteristic diagram output by a motion attention module to obtain a segmentation result of a current frame. Wherein, the input of the channel attention module is a current frame feature map F_tAnd the appearance characteristic diagram F of the target object provided by the first frame₀The channel attention module inputs the feature map F by calculation_tAnd F₀The correlation among the channels, the output channel characteristic diagram reflects the object with the appearance closest to the target object in the current frame; the input of the motion attention module is a current frame feature map F_tAnd the position information H of the target object predicted by the memory module in the previous frame motion attention network_t‑1The motion attention module inputs the feature map F by calculation_tAnd H_t‑1Correlation between positions, output position characteristic map reflectionThe approximate position of the target object in the current frame is shown. The invention combines two factors of appearance and position to realize more accurate segmentation of the video target.

Description

Video target segmentation method based on motion attention

Technical Field

The invention belongs to the field of image processing and computer vision, and relates to a video target segmentation method, in particular to a video target segmentation method based on motion attention.

Background

Video object segmentation is a prerequisite for solving a plurality of video tasks, and plays a significant role in the fields of object identification, video compression and the like. Video object segmentation may be defined as tracking the object and segmenting the object according to the object mask. The video target segmentation can be divided into a semi-supervised mode and an unsupervised mode according to the existence of the initial mask, and the semi-supervised segmentation method is that the segmentation mask is manually initialized in the first frame of the video, and then the target object is tracked and segmented. Unsupervised methods automatically segment target objects in a given video according to some mechanism without any prior information.

In a video scene, background clutter, object deformation and rapid movement of an object all affect the segmentation result. The traditional video target segmentation technology adopts a rigid background motion model and combines scene prior to realize the segmentation of a target object. However, the conventional video object segmentation technique has certain limitations in practical application based on the assumption. The existing video object segmentation technology mostly adopts a convolutional neural network, but the existing video object segmentation technology also has various defects, such as the fact that moving objects in a video are segmented by relying on optical flow between frames, and therefore the segmentation effect is easily affected by optical flow estimation errors. In addition, these methods do not fully utilize timing information in the video and do not memorize the relevant features of the target object in the scene.

In order to solve the problems, the invention researches the segmentation problem of the moving target in a semi-supervised mode and provides a video target segmentation method based on movement attention with a memory module.

Disclosure of Invention

The invention aims to solve the problems that: in video target segmentation, a target object of a current frame is determined only by a segmentation result of a previous frame, the accurate position of the target object cannot be obtained, and even the target object drifts due to excessive dependence on the segmentation result of the previous frame; most of the existing video object segmentation methods based on motion information perform segmentation of the object based on optical flow information between the current frame and the previous frame, which not only has a large calculation amount, but also limits the segmentation result to a specific motion mode. Therefore, a new video object segmentation method based on motion information needs to be provided to improve the segmentation effect.

In order to solve the above problems, the present invention provides a video object segmentation method based on motion attention, which fuses motion time sequence information in a video sequence and realizes video object segmentation based on an attention mechanism, and comprises the following steps:

1) constructing a segmented backbone network, and dividing the current frame I_tAnd a first frame I₀Respectively input into the backbone network to obtain corresponding characteristic diagram F_t,F₀；

2) Constructing a motion attention network, and mapping the feature map F of the current frame_tFeature map F of the first frame₀And hidden state H of the previous frame memory module_t-1As an input to the Athletic attention network, the output F of the Athletic attention network_outNamely, the segmentation result of the current frame is obtained;

3) a loss function is constructed, which consists of two parts. The first part is the pixel level loss function; the second part is the structural similarity loss function.

Further, constructing a segmentation backbone network in the step 1) to obtain a feature map F_tAnd F₀The method specifically comprises the following steps:

1.1) modify Resnet-50 network and merge hole convolutions. First, the inflation factor scaled of conv _1 in Resnet-50 is set to 2; secondly, deleting the pooling layer in Resnet-50; then setting the step size stride of two layers of conv _3 and conv _4 in Resnet-50 to 1; finally, the modified Resnet-50 is used as a backbone network, and at the moment, the feature diagram output by the backbone network is 1/8 of the size of the original image;

1.2) will present frame I_tInputting the feature map F into the backbone network to obtain the feature map F about the current frame_t；

1.3) first frame I₀Inputting the feature map F into the backbone network to obtain a feature map F about the first frame₀。

Further, constructing a motion attention network in the step 2) to obtain a segmentation result of the current frame.

The motion attention network consists of a channel attention module, a motion attention module and a memory module. The channel attention module, the motion attention module and the memory module are specifically constructed as follows:

2.1) construction of the channel attention Module, F_t,F₀As an input to the channel attention module. F₀And providing appearance information such as color and posture of the target object. First, F_t,F₀Obtaining a channel weight attention chart X of the target object by matrix multiplication and a softmax function_c，X_cThe relevance between channels in the current frame and the first frame is described, the higher the relevance is, the higher the response value is, and the more similar the characteristics are; then, X is added_cAnd F_tMultiplying, feature enhancing, and the result is F_tAdding residual errors to obtain a channel characteristic diagram;

2.2) construction of the exercise attention Module, F_t,H_t-1As input to the current module, H_t-1Position information of a target object of a current frame predicted based on a previous frame segmentation result and timing information is provided. Firstly, a feature map F is set_tRespectively passing through two convolution layers with convolution kernel of 1 × 1 to obtain two characteristic maps marked as F_aAnd F_b(ii) a Then, F_aAnd H_t-1Obtaining a position weight attention diagram of the target object through matrix multiplication and a softmax function; finally, X is added_sAnd F_bMultiplying, feature enhancing, and the result is F_tAdding residual errors to obtain a position feature map;

2.3) adding the channel characteristic diagram and the position characteristic diagram to obtain a final segmentation result F of the current frame_out。

2.4) constructing a memory module convLSTM, and dividing the current frame into F_outAnd memory cell C output by the previous frame memory module_t-1Hidden state H of previous frame memory module_t-1As input to the current block, the output of this block is memory cell C_tAnd hidden state H_t；

The convLSTM is composed of an input gate, a forgetting gate and an output gate.

Further, the step 2.4) of constructing the memory module convLSTM specifically includes:

2.4.1) first, forget the memory cell C of the previous frame memory module_t-1Partial state information is input, and then the current frame is divided into F_outUseful information is stored in the previous frame of memory cells C_t-1Finally, the memory cell C of the current frame is updated and outputted_t；

2.4.2) first, the output gate divides the current frame into results F by Sigmoid function_outAnd hidden state H of previous frame memory module_t-1Filtering, determining the information to be output, and then calling tanh activation function to modify the memory cell C of the current frame_tThe partial information to be output and the modified memory cell C of the current frame are finally compared_tMatrix multiplication is carried out to obtain and output the hidden state H of the current frame_t。

Advantageous effects

The invention provides a video target segmentation method based on motion attention, which comprises the steps of firstly obtaining feature maps of a first frame and a current frame, and then obtaining a feature map F of the current frame_tAppearance characteristic F of target object provided by first frame₀And the position information H of the target object predicted by the memory module in the previous frame motion attention network_t-1And inputting the current frame motion attention network to obtain the segmentation result of the current frame. By applying the method and the device, the problem of motion mode diversification which cannot be solved by other segmentation methods can be solved. The method is suitable for video target segmentation, and has good robustness and accurate segmentation effect.

The invention has the characteristics that: firstly, the method does not pay attention to the segmentation result of the previous frame, but can more accurately segment the target object by means of the appearance information of the target object in the first frame and the time sequence information of the target object in the video sequence; and secondly, the use of the motion attention network greatly reduces useless characteristics and improves the robustness of the model.

Drawings

FIG. 1 is a flow chart of a video object segmentation method based on attention to movement according to the present invention.

FIG. 2 is a network architecture diagram of the video object segmentation method based on attention to movement according to the present invention;

FIG. 3 is a diagram of the Resnet-50 structure

FIG. 4 is a modified Resnet-50 structure diagram for use in the method for video object segmentation based on attention to movement of the present invention

Detailed Description

The invention provides a video target segmentation method based on motion attention, which comprises the steps of firstly obtaining feature maps of a first frame and a current frame, and then inputting the feature maps of the first frame, the feature map of the current frame and target object position information predicted by a memory module in a motion attention network of the previous frame into the motion attention network to obtain a segmentation result of the current frame. The method is suitable for video target segmentation, and has good robustness and accurate segmentation effect.

The invention is explained in more detail below with reference to specific examples and the accompanying drawings.

The invention comprises the following steps:

1) acquiring YouTube and Davis data sets which are respectively used as a training set and a test set of the model;

2) the training data is pre-processed. Cutting each training sample (video frame) and a first frame mask of a video sequence, adjusting an image to 224 multiplied by 224 resolution, and performing data enhancement by using modes such as rotation and the like;

3) constructing a segmentation backbone network, inputting a first frame segmentation mask of a video sequence and a current frame, and obtaining a segmentation feature map of the first frame segmentation mask and the current frame;

3.1) first, set the inflation factor scaled of conv _1 in Resnet-50 to 2; secondly, deleting the pooling layer in Resnet-50; then setting the step size stride of two layers of conv _3 and conv _4 in Resnet-50 to 1; finally, the modified Resnet-50 is used as a backbone network, and the feature map output by the backbone network is 1/8 of the original image size, as shown in FIG. 4;

3.2) the resolution of the first frame division mask is 224 x 224, and the first frame division mask is input into the backbone network to obtain a feature map F of the first frame division mask₀2048 × 28 × 28 in size;

3.3) the resolution of the current frame is 224 x 224, and the current frame is input into the backbone network to obtain the characteristic diagram F of the current frame_tThe size is 2048 × 28 × 28.

4) Constructing a motion attention network, and inputting a current frame feature map F_tHidden state H of previous frame memory module_t-1And a feature map F of the first frame₀Obtaining a segmentation result F of the current frame_out. The motion attention network consists of a channel attention module, a motion attention module and a memory module.

4.1) constructing a channel attention module. Inputting a current frame feature map F_tAnd a first frame feature map F₀Obtaining a channel characteristic diagram E_cThe method comprises the following steps:

4.1.1) Call the reshape function in python, adjusting F_tSize, converted to feature map F'_tN × 2048 in size; calling reshape function in python, adjust F₀Size, converted to feature map F'₀2048 × n, where n represents the total number of pixels of the current frame;

4.1.2)F′₀and F'_tMatrix multiplication and calling of the softmax function, the mathematical expression of which is as follows:

matrix multiplication can realize utilization and fusion of global information, and the action of the matrix multiplication is similar to full-connection operation. The full-connection operation can consider the data relation of all positions, but destroys the spatial structure, so that the matrix multiplication is used for replacing the full-connection operation, and the spatial information is reserved as much as possible on the basis of using the global information;

4.1.3) deriving a channel weight attention map X_c2048 × 2048 channel weight attention map X_cElement x of j (th) row and i (th) column_jiThe mathematical expression of (a) is as follows:

wherein, F'_tiIs represented by F'_tColumn i, a feature map F of the current frame_tIth channel, F'_0jIs represented by F'₀Line j, first frame feature map F₀The jth channel, x_jiFeature map F representing the first frame₀For the ith channel to the current frame feature map F_tC represents the current frame feature map F_tThe number of channels.

4.1.4) channel weight attention map X_cAnd feature picture F'_tMultiplying and strengthening the feature map F of the current frame_tThe result and the current frame feature map F_tAdding residual errors to obtain a channel characteristic diagram E_cThe mathematical expression is as follows:

wherein β represents the channel attention weight, the initial value is set to zero, and the model assigns β a larger and more reasonable weight, F ', through learning'_tiIs represented by F'_tColumn i, a feature map F of the current frame_tThe ith channel, C, represents the current frame feature map F_tThe number of channels.

4.2) constructing a motion attention module. Inputting a current frame feature map F_tAnd hidden state H of the previous frame memory module_t-1Obtaining a position feature map E_sThe method comprises the following steps:

4.2.1)F_trespectively passing through two convolution layers with convolution kernel of 1 × 1 to obtain two characteristic maps marked as F_aAnd F_bThe sizes are 2048 multiplied by 28;

4.2.2) Call the reshape function in python, adjust F_aSize, converted to feature map F'_a2048 xn in size; the reshape function in python is called,adjusting F_bSize, converted to feature map F'_b2048 xn in size; invoking reshape and transpose functions in python to adjust H_t-1Size, converted to signature H'_t-1The size is n multiplied by 2048, wherein n represents the total number of pixel points of the current frame;

4.2.3)H′_t-1and F'_aMatrix multiplication is carried out, a softmax function is called, and a position weight attention diagram X is obtained_sThe size is 28 × 28, and the mathematical expression is as follows:

wherein N represents the number of pixels of the current frame, F_ajIs represented by F'_aColumn j, denotes F_aJ-th position of (d), h_jIs H'_t-1Row i of (2), represents H_t-1Position i, s_jiIs a position weight attention map X_sValue of element, s, in j-th row and i-column_jiIndicates a hidden state H_t-1Position i to current frame feature map F_tThe effect of the jth position.

4.2.4) location weight attention map and feature map F_bMatrix multiplication is carried out to strengthen the characteristic graph F of the current frame_tIs characterized by the following_tAdding residual errors to obtain a fused position feature map E_sThe mathematical formula is expressed as follows:

where α represents the location attention weight, the initial value is set to zero, the model assigns α a greater and more reasonable weight through learning, F_biRepresentation feature diagram F_bColumn i, a feature map F of the current frame_tThe ith position, N, represents the total number of pixels in the current frame.

4.3) reduction of bitsSet characteristic diagram E_sAnd channel feature map E_cAdding to obtain final segmentation result F of the current frame_out。

4.4) constructing a memory module, and inputting a current frame segmentation result F_outHidden state H of previous frame memory module_t-1And memory cells C of the previous frame memory module_t-1Memory module convLSTM of current frame is formed by forgetting gate f_tAnd input gate i_tAnd an output gate o_tComposition is carried out;

4.4.1) the output tensor of the forgetting gate is between 0 and 1 for each value, 0 represents complete forgetting, and 1 represents complete retention, therefore, the forgetting gate can realize selective discarding of the memory cell C in the previous frame_t-1The mathematical formula is expressed as follows:

f_t＝σ(W_xf*F_out+W_hf*H_t-1+b_f)

wherein denotes a convolution operation, σ denotes a sigmoid function, F_outDenotes the segmentation result of the current frame, H_t-1Indicating the hidden state of the memory module of the previous frame. W_xf,W_hfIs a weight parameter with a value between 0 and 1, b_fFor the offset, the initial value is set to 0.1 and the model is learned as b_fDistributing more reasonable values;

4.4.2) input gate will segment the result F from the current frame_outThe content to be updated is selected, and the mathematical formula is expressed as follows:

i_t＝σ(W_xi*F_out+W_hi*H_t-1+b_i)

wherein denotes a convolution operation, σ denotes a sigmoid function, F_outDenotes the segmentation result of the current frame, H_t-1Indicating the hidden state of the memory module of the previous frame. W_xi,W_hiIs a weight parameter with a value between 0 and 1, b_iFor the offset, the initial value is set to 0.1 and the model is learned as b_iDistributing more reasonable values;

4.4.3) discarding the previous frame of memory cells C using the forgetting gate_t-1And saving the useful information to the previous frameMemory cell C_t-1In the method, the memory cell C of the current frame after the update is output_tThe mathematical formula is expressed as follows:

wherein, denotes a convolution operation,

is the product of Hadamard, Fo_utDenotes the segmentation result of the current frame, H_t-1Indicating the hidden state of the memory module of the previous frame. W_xc,W_hcIs a weight parameter, and has a value of between 0 and 1 b_cFor the offset, the initial value is set to 0.1 and the model is learned as b_cDistributing more reasonable values;

4.4.4) hidden state H output by the current frame memory module_tThe mathematical formula is expressed as follows:

o_t＝σ(W_xo*F_out+W_ho*H_t-1+b_o)

wherein, tanh is an activation function,

is the product of Hadamard, o_tRepresenting output gates in the current frame memory block, representing convolution operations, F_outDenotes the segmentation result of the current frame, H_t-1Indicating the hidden state of the memory module of the previous frame, W_ho,W_xoIs a weight parameter with a value between 0 and 1, b_oFor the offset, the initial value is set to 0.1 and the model is learned as b_oMore reasonable values are assigned.

6) The loss function adopted by the segmentation model consists of two parts. The first part is a pixel-level loss function, the second part is a structural similarity loss function, and the specific design is as follows:

l＝l_cross+l_ssim

6.1)l_crossrepresenting the pixel-level cross entropy loss function, the mathematical formula is expressed as follows:

where T (r, c) denotes a pixel value at the r row and c column of the target mask, and S (r, c) denotes a pixel value at the r row and c column of the division result;

6.2)l_ssimexpressing a structural similarity loss function, comparing differences between the target mask and the segmentation result from three aspects of brightness, contrast and structure, and expressing a mathematical formula as follows:

wherein A is_x，A_yRespectively representing regions of the same size, x, cut from the partition map and the target mask predicted by the model_iIs represented by A_xPixel value, y, of the i-th pixel in the region_iIs represented by A_yThe pixel value of the ith pixel point in the region, N represents the total number of pixel points of the intercepted region, C₁And C₂Is a constant that prevents the denominator from being zero, C₁Set as 6.5025, C₂Set to 58.5225, μ_xIs represented by A_xAverage brightness of_yTABLE A_yAverage brightness of σ_xRepresents A_xDegree of change of medium brightness, σ_yRepresents A_yDegree of change of medium brightness, σ_xyRepresenting a structure-dependent covariance formula.

7) Training the model, selecting the YouTube in the step 1) as a training set, setting the sample number of batch-size of each batch of training as 4, setting the learning rate as 1e-4, modifying the learning rate to 1e-5 after 30 ten thousand times of previous iterative training of the YouTube, performing 10 ten thousand times of iterative training again in the YouTube, setting the weight attenuation rate as 0.0005, and training the model by using the loss function in the step 6) until the model converges.

The invention has wide application in the field of video object segmentation and computer vision, such as: target tracking, image recognition, etc. The present invention will now be described in detail with reference to the accompanying drawings.

(1) Constructing a segmented backbone network, and dividing the current frame I_tAnd a first frame I₀Respectively input into the backbone network to obtain corresponding characteristic diagram F_t,F₀；

(2) Constructing a motion attention network, and obtaining a current frame feature map F_tFirst frame feature map F₀Obtaining a channel feature map as an input of a channel attention module in the exercise attention network, and converting F_tAnd hidden state H of previous frame memory module_t-1Obtaining a position feature map as an input of a motion attention module in the motion attention network, and adding the position feature map and the channel feature map to obtain an output F of the motion attention network_outThat is, the segmentation result of the current frame is obtained, and the segmentation result F of the current frame is obtained_outMemory cell C output by previous frame memory module_t-1And hidden state H of the previous frame memory module_t-1Obtaining memory cells C as input to a memory module in a motor attention network_tAnd hidden state H_t. Memory cell C_tFor storing and updating the time sequence information of the target object based on the segmentation result of the current frame, H_tPosition information of a target object of a next frame predicted based on a current segmentation result and timing information is provided. The memory module convLSTM not only retains the spatial information of the target object of the current frame, but also retains the time sequence information of the target object, so that the long-distance position dependency relationship of the target object can be obtained.

The method is realized by adopting PyTorch framework and Python language under GTX 1080Ti GPU and Ubuntu 14.0464 bit operating system.

The invention provides a video target segmentation method based on motion attention, which is suitable for segmenting moving objects in a video, and has the advantages of good robustness and accurate segmentation result. Experiments show that the method can effectively segment moving objects.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions are included in the scope of the present invention, and therefore, the scope of the present invention should be determined by the protection scope of the claims.

Claims

1. A video target segmentation method based on motion attention fuses time sequence information in a video sequence to carry out video target segmentation, and is characterized in that: the method outputs a channel feature map E of a channel attention module_cAdding the position characteristic diagram output by the motion attention module to obtain a segmentation result of the current frame; wherein, the input of the channel attention module is a current frame feature map F_tAnd the appearance characteristic diagram F of the target object provided by the first frame₀The channel attention module inputs the feature map F by calculation_tAnd F₀The correlation among the channels, the output channel characteristic diagram is used for reflecting the object with the appearance closest to the target object in the current frame; the input of the motion attention module is a current frame feature map F_tAnd the position information H of the target object predicted by the previous frame memory module_t-1The motion attention module inputs the feature map F by calculation_tAnd H_t-1And (4) correlating positions, and outputting a position feature map reflecting the approximate position of the target object in the current frame.

2. A method for video object segmentation based on attention from motion as claimed in claim 1 wherein: the characteristic diagram F_t,F₀The specific acquisition process is as follows:

1.1) constructing a segmentation backbone network, which specifically comprises the following steps: modify the Resnet-50 network and merge hole convolutions: first, the inflation factor scaled of conv _1 in Resnet-50 is set to 2; secondly, deleting the pooling layer in Resnet-50; then setting the step size stride of two layers of conv _3 and conv _4 in Resnet-50 to 1; finally, the modified Resnet-50 is used as a backbone network, and at the moment, the feature diagram output by the backbone network is 1/8 of the size of the original image;

1.2) will present frame I_tInputting the segmented backbone network to obtain a feature map F related to the current frame_t；

1.3) first frame I₀Inputting the segmented backbone network to obtain a feature map F about the first frame₀。

3. A method for video object segmentation based on attention from motion as claimed in claim 1 wherein: the specific working process of the channel attention module is as follows, firstly, F_t,F₀Obtaining a channel weight attention chart X of the target object by matrix multiplication and a softmax function_c(ii) a Then, X is added_cAnd F_tMultiplication of the result with F_tAdding residual errors to obtain a channel characteristic diagram E_c。

4. A method for video object segmentation based on attention from motion as claimed in claim 1 wherein: the specific working process of the motion attention module is as follows, F_t,H_t-1As input to the current module, H_t-1Providing the position information of the target object in the current frame predicted based on the division result of the previous frame and the time sequence information, firstly, the feature map F_tRespectively passing through two convolution layers with convolution kernel of 1 × 1 to obtain two characteristic maps marked as F_aAnd F_b(ii) a Then, F_aAnd H_t-1Obtaining the position weight attention chart X of the target object by matrix multiplication and a softmax function_S(ii) a Finally, X is added_SAnd F_bMultiplying, and feature enhancingThe result is compared with F_tAdding residual errors to obtain a position characteristic diagram E_s。

5. A method for video object segmentation based on attention from motion as claimed in claim 1 wherein: the memory module convLSTM comprises a forgetting gate, an input gate and an output gate, and the memory module divides a current frame into a division result F_outMemory cell C output by previous frame memory module_t-1Hidden state H of previous frame memory module_t-1As an input to the current block, the output of the current block is memory cell C_tAnd hidden state H_tThe specific working process is as follows,

first, the forgetting gate discards the previous frame of memory cells C_t-1Partial state information is input, and then the current frame is divided into F_outUseful information is stored in the previous frame of memory cells C_t-1Finally, the memory cell C of the current frame is updated and outputted_t。

6. The method according to claim 1, wherein the structure loss function in step 3) is composed of two parts: the first part is the pixel level loss function; the second part is the structural similarity loss function.