CN113744306A

CN113744306A - Video target segmentation method based on time sequence content perception attention mechanism

Info

Publication number: CN113744306A
Application number: CN202110634977.9A
Authority: CN
Inventors: 周雪; 杨杰; 邹见效
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2021-12-03
Anticipated expiration: 2041-06-08
Also published as: CN113744306B

Abstract

The invention discloses a video target segmentation method based on a time sequence content perception attention mechanism, which aims at the problem of huge calculation amount of a global matching attention mechanism. And finally, matching is carried out between the features to be matched and the past frame feature vectors obtained by sampling, namely, similarity region matching is carried out on the local part of the whole image, so that the calculation amount of matching is reduced, and the similar features are obtained by taking the current frame features as references, so that the running speed is high, and the matching precision is higher due to the removal of irrelevant noise features.

Description

Video target segmentation method based on time sequence content perception attention mechanism

Technical Field

The invention relates to a machine learning technology, in particular to a video object segmentation technology based on machine learning.

Background

Video object segmentation is a fundamental task in computer vision. The video object segmentation task requires that each pixel on each frame image in the video is given a label, that is, a foreground object and a background area are subjected to binary label separation, and knowledge in the fields of pattern recognition, machine learning and the like is required. The video target segmentation has very important significance for wide application such as video editing, target tracking, scene understanding and the like. With the development of computer science technology, deep learning and real life needs, video object segmentation attracts the attention of many researchers in recent years, and meanwhile, great research progress is achieved. The degree of supervision of the video object segmentation task can be divided into three major categories: unsupervised, semi-supervised, and interactive video object segmentation. Unsupervised video object segmentation needs to find and segment the main objects in the video, which means that the algorithm needs to decide by itself which object is the main one. The semi-supervised video object segmentation task gives a first frame or key frame with mask information. The interactive video object segmentation draws the approximate outline of the video segmentation of the object by a human hand with a mouse, and then carries out video segmentation by using a video segmentation algorithm in the second step.

In the current video target segmentation algorithm based on the attention mechanism, global matching is performed in a time sequence feature matching link, namely, the features of a current frame are sequentially matched with all the features of a past frame, so that the method has a large calculation amount and causes a low model operation speed.

Disclosure of Invention

The invention aims to solve the technical problem of providing a video target segmentation method for adjusting a matching position and executing local matching based on time sequence content perception.

The invention adopts the technical scheme that the video target segmentation method based on the time sequence content perception attention mechanism comprises the following steps:

1) training a video target segmentation system:

1-1) receiving a training sample video and a target mask for each frame in the video;

1-2) splicing an ith frame in a video serving as a past frame and a corresponding target mask along a channel dimension, and then inputting the spliced frame into an encoder B of a feature extraction network, wherein the encoder B outputs features of the ith frame;

1-3) taking the (i + 1) th frame as a current frame and inputting the current frame into an encoder A of a feature extraction network, wherein the encoder A outputs the (i + 1) th frame feature;

1-4) sending the i +1 th frame characteristic and the i frame characteristic into a time sequence content perception attention module, and outputting the time sequence content perception characteristic of the i +1 th frame by the time sequence content perception attention module;

1-5) inputting the time sequence content perception characteristics of the (i + 1) th frame into a decoder, and outputting a target mask of the (i + 1) th frame by the decoder;

1-6) splicing the interested target masks of the (i + 1) th frame and the (i + 1) th frame along the channel dimension, inputting the spliced interested target masks into an encoder B of a feature extraction network, and outputting the updated features of the (i + 1) th frame by the encoder B;

1-7) judging whether the target masks of all frames in the output training sample video are met or the convergence condition of the loss function in the video target segmentation system is met, if so, finishing the training of the video target segmentation system, otherwise, updating i to i +1, taking the characteristics of the updated ith frame as the ith frame in the video, and returning to the step 2);

the specific processing mode when the attention module for time sequence content perception in the step 1-4) receives the i +1 th frame feature and the i frame feature is as follows: firstly, performing optical flow prediction processing on the (i + 1) th frame feature and the ith frame feature to obtain an optical flow field between the current frame feature and the past frame feature, and extracting a feature vector in the ith frame feature by using bilinear interpolation by using the optical flow field between the current frame feature and the past frame feature; matching the extracted feature vector with the features of the (i + 1) th frame to obtain the time sequence content perception features of the (i + 1) th frame;

2) video target segmentation system testing: and inputting the video to be processed into the trained video target segmentation system, and outputting the interested target area in the video by the video target segmentation system.

Aiming at the problem of huge calculation amount of a global matching attention mechanism, redundant calculation of time sequence feature matching is removed, namely the target feature of the current frame is only matched with the same target feature in the past frame and is not matched with all the features of the past frame. The invention obtains the characteristic vector related to the characteristic to be matched (the characteristic of the current frame) by learning a plurality of groups of optical flow fields between the current frame and the past frame and sampling the past frame according to the position mapping relation of the optical flow fields by using the assumption of time sequence continuity. And finally, matching is carried out between the features to be matched and the past frame feature vectors obtained by sampling, namely, similarity region matching is carried out on the local part of the whole image, so that the calculation amount of matching is reduced, and the similar features are obtained by taking the current frame features as references, so that the running speed is high, and the matching precision is higher due to the removal of irrelevant noise features.

The method has the advantages that the matching position is adjusted based on the time sequence content perception, the local matching is executed, the running speed of the time sequence characteristic matching part is effectively increased, and the matching precision is improved.

Drawings

FIG. 1 is a schematic diagram of video object segmentation based on a temporal content aware attention mechanism.

Fig. 2 is a specific structure of a module for perceiving attention based on time series content.

Fig. 3 is a block specific structure of predicting an optical flow field between timing characteristics.

Detailed Description

The specific structure of the video object segmentation system to complete the method of the present invention is shown in fig. 1, and comprises a feature extraction network, a time sequence content perception attention module, and a decoder. The feature extraction network includes an encoder a for the current frame and an encoder B for the past frame.

The whole training process of the video target segmentation system is as follows:

1) collecting a video and providing a target mask for each frame in the video;

2) the first frame image is taken as a past frame and the corresponding target mask is spliced along the channel dimension and sent into a processing encoder B to obtain the characteristics of the first frame;

3) the second frame is taken as a current frame, namely a frame to be processed, and is sent to an encoder A for processing the current frame to obtain the characteristics of the second frame;

4) sending the second frame characteristics and the first frame characteristics into a time sequence content perception attention module to obtain time sequence content perception characteristics of the second frame, and sending the time sequence content perception characteristics of the second frame into a decoder to obtain an interested target mask of the second frame;

5) splicing the second frame image and the interested target mask of the second frame along the channel dimension, and sending the spliced second frame image and the interested target mask of the second frame to an encoder B for processing the past frame to obtain the characteristics of the second frame again;

6) regarding the third frame as the current frame, taking the feature of the second frame obtained again as the past frame, obtaining a prediction target mask of the third frame by adopting the method of the steps 2) -5), and obtaining the third frame feature past frame again by adopting the method of the step 5);

7) repeating the step 6) until the target mask is predicted for all frames in the video; and the Binary Cross Engine loss function is used for convergence, and the training of the video target segmentation system is completed.

The current frame image passes through an encoder A to obtain current frame characteristics, at the moment, the past frame characteristics obtained by an encoder B are used, the two characteristics are spliced along a channel and sent to a time sequence content perception attention module to obtain time sequence characteristics, and then a decoder is used for mapping the time sequence characteristics to an output target mask. Then, the current frame image and the target mask are spliced, and the current frame image and the target mask are processed by an encoder B (the processed frame becomes a past frame), so that new past frame characteristics are obtained for predicting a next frame. The feature extraction network consists of two encoders, where encoder a extracts current frame features and the other encoder B extracts past frame features of past frames and their prediction masks. The input of the time sequence content perception attention module is the current frame feature and the past frame feature obtained by the feature extraction network, and the current frame feature and the past frame feature are spliced along the channel dimension. The specific structure of the time-series content perception attention module is as shown in fig. 2, and firstly, the characteristics similar to the characteristics of the current frame on the past frame characteristics are collected through an optical flow prediction module according to the optical flow to form a similar characteristic set. As shown in fig. 3, the optical flow prediction module includes 2 channel splicing modules C, 6 convolution Conv modules of 3 × 3, the current frame feature and the past frame feature are output to one 3 × 3Conv through the first channel splicing module, and are further divided into 4 paths of parallel 4 3 × 3Conv, where 3 × 3Conv are void convolutions, the void convolution rates D are 2, 4, and 8, respectively, and the output of the parallel 4 paths of 3 × 3Conv is output to the optical flow field between the current frame feature and the past frame feature through the second channel splicing module. Then, extracting a feature vector in the past frame feature by using bilinear interpolation by using an optical flow field between the current frame feature and the past frame feature; and then matching the extracted feature vector (the feature of the past frame after extraction) with the feature of the current frame to obtain the time sequence content perception feature of the current frame. And the matching module is used for matching the current frame features with the similar feature set. Specifically, the time-series content perception attention module predicts the optical flow field between the current frame feature and the past frame feature in sequence to obtain the position of each current frame feature vector in the past frame feature. Since the position is a fractional number, a bilinear interpolation is used to extract the eigenvector of the position. This is performed for all past frame features, resulting in a similar feature set for the current frame feature across all past frames. The size of the similar feature set is far lower than the feature quantity of all the past frames, so that the feature set and the current frame are used for matching, the calculation amount of matching is reduced, and the similar features are obtained by taking the current frame features as references, so that the matching is self-adaptive local matching of time sequence content perception, the operation speed is high, and the precision is not lost. The decoder outputs a target prediction mask for the current frame using the timing characteristics output by the attention module as input. In the feature matching process, the feature to be matched and the sampling feature vector of the past frame calculate the dot product similarity (realized by using matrix multiplication), then the similarity values are normalized by softmax, the probability value obtained by normalization is used for weighting the corresponding sampled feature vector, and the time sequence feature of the position to be matched is obtained. By executing the same operation on each feature to be matched, a time sequence feature with the same size as the current frame feature is obtained. The timing characteristics are entered into the decoder to predict the mask of the target of interest in the current frame. The size of the similar feature set is far lower than the feature quantity of all the past frames, so that the feature set and the current frame are used for matching, the calculation amount of matching is reduced, and the similar features are obtained by taking the current frame features as references, so that the matching is self-adaptive local matching of time sequence content perception, the operation speed is high, and the precision is not lost. The decoder outputs a target prediction mask for the current frame using the timing characteristics output by the attention module as input.

Specifically, two encoders in the feature extraction network are constructed from ResNet50, with encoder B having 4 channels of input data. The decoder is formed by stacking two 2-time upsampling modules, wherein each upsampling module consists of a 2-time upsampling interpolation layer, a convolution layer, a BatchNorm layer and a ReLU layer.

In the testing step, the video to be processed is input to the trained video target segmentation system, and the interested target area in the video can be obtained.

Claims

1. The video target segmentation method based on the time sequence content perception attention mechanism comprises the following steps:

1) training a video target segmentation system: