CN116152710A

CN116152710A - Video instance segmentation method based on cross-frame instance association

Info

Publication number: CN116152710A
Application number: CN202310083300.XA
Authority: CN
Inventors: 刘盛; 陈俊皓; 陈瑞祥; 郭炳男; 张峰
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2023-02-08
Filing date: 2023-02-08
Publication date: 2023-05-23

Abstract

The invention discloses a video instance segmentation method based on cross-frame instance association, which is characterized in that a video frame sequence to be segmented is input into a multi-scale feature extractor to be extracted into feature graphs with different scales, space-time features are extracted from a transformer encoder, then fused space-time features are obtained through a pixel decoder, finally a final embedded vector is obtained through the transformer decoder, and dot product operation is carried out on the embedded vector and the high-resolution space-time features to obtain an instance segmentation result. According to the method, the space-time correlation of the dynamic instance is learned by a multi-scale method facing the space-time characteristics, a more stable cross-frame instance association is constructed, a reliable cross-frame instance association is established, the precision of a video instance segmentation task is improved, and the advanced performance is achieved on two popular data sets compared with the nearest method.

Description

Video instance segmentation method based on cross-frame instance association

Technical Field

The application belongs to the technical field of video instance segmentation, and particularly relates to a video instance segmentation method based on cross-frame instance association.

Background

Video instance segmentation aims to detect, segment, and track target instances in video simultaneously, which is helpful for many downstream tasks, including autopilot, video surveillance, video understanding, etc. Video instance segmentation is more challenging because of the fact that object instances in video are accurately segmented and tracked due to factors such as appearance distortion, fast motion, and occlusion, as compared to image instance segmentation.

With the introduction of DETR and deformable DETR frameworks, the Transformer-based end-to-end video instance segmentation approach is the recent mainstream. Following the paradigm of video input, video output, visTR first applied a transducer to solve the video instance segmentation problem and used instance queries to obtain a sequence of instances from the video, but this approach learns one embedding for each instance of each frame, which makes it difficult to process variable length or long duration video sequences. To reduce the computational effort of VisTR explosiveness and build cross-frame instance associations, subsequent studies have utilized target queries and proposed novel variants, respectively: build context time dependent memory tokens and build cross-frame instance related query separation mechanisms. These methods essentially focus on single frame features and detect instances, and then perform cross-frame instance matching, however this deliberately distinguishes images from video, but irreversibly ignores the rich spatio-temporal context information present in video.

Furthermore, existing approaches focus mainly on network improvement, but lack focus on the data sets required for training and testing. Through research, the conventional data set at present is easy to generate an overfitting problem during training due to insufficient training data quantity.

Disclosure of Invention

It is an object of the present application to provide a video instance segmentation method based on cross-frame instance correlation, which overcomes the above-mentioned problems posed in the background art, and which is also referred to as IAST in the present application.

In order to achieve the above purpose, the technical scheme of the application is as follows:

a video instance segmentation method based on cross-frame instance correlation, comprising:

constructing and training a video instance segmentation network comprising a multi-scale feature extractor, a transformer encoder, a pixel decoder, and a transformer decoder;

inputting a video frame sequence to be segmented into a multi-scale feature extractor to extract feature images with different scales, wherein the feature images are respectively feature images C according to the scale size ₂ 、C ₃ 、C ₄ And C ₅ ；

Extracting the characteristic diagram C ₃ 、C ₄ And C ₅ Input to transformer encoder to extract space-time characteristics

Map C of the characteristic ₂ And spatiotemporal features

Input to the pixel decoder and the spatio-temporal feature +.>

Separated into and characterized by C ₃ 、C ₄ And C ₅ Scale-corresponding features->

And->

Then gradually up-sampling and cross-fusing to obtain the fused space-time characteristics +.>

And post-fusion spatiotemporal characteristics>

Spatiotemporal characteristics after fusion +.>

/>

Features to be characterized

And->

Inputting the final embedded vector into a transformer decoder to obtain a final embedded vector;

embedding vectors and spatio-temporal features

And performing dot product operation to obtain an instance segmentation result.

Further, the training video instance segmentation network includes preprocessing an acquired video frame sequence to generate training sample data, including:

two sections of frame sequences with the same frame number are taken from the collected video frame sequence data set, one section is taken as a target set, and the other section is taken as a source set;

establishing a one-to-one correspondence of image frames between the target set and the source set according to a time sequence;

and copying and pasting the instance in the source set image onto the corresponding image frame in the target set to generate a new frame sequence, and placing the new frame sequence into the video frame sequence data set.

Further, the feature map C to be extracted ₃ 、C ₄ And C ₅ Input to transformer encoder to extract space-time characteristics

Comprising the following steps:

map C of the characteristic ₃ 、C ₄ And C ₅ Performing position coding and then respectively executingInputting a deformable attention module after flattening the row tensor to generate a basic feature F;

map C of the characteristic ₃ 、C ₄ And C ₅ Position coding is carried out, and then the position coding is input into an S2S attention module to generate basic space-time characteristics

The S2S attention module comprises an intra-scale time attention module and a scale space-time attention module, wherein the intra-scale time attention module adopts a time attention mechanism, and the scale space-time attention module adopts a deformable attention mechanism;

fusing two features F and F of the same dimension

Finally, the space-time characteristic is obtained>

Further, the characteristic diagram C ₂ And spatiotemporal features

Input to the pixel decoder and the spatio-temporal feature +.>

And->

And post-fusion spatiotemporal characteristics>

Spatiotemporal characteristics after fusion +.>

Comprising the following steps:

features of time and space

And->

For characteristics of

Upsampling, adjusting to the sum feature +.>

The same scale is then interpolated by bilinear interpolation to the same

Cross fusion to generate the spatiotemporal characteristics after fusion +.>

For the time-space characteristics after fusion

Upsampling, adjusting to the sum feature +.>

The same scale is then interpolated by bilinear interpolation and compared with the features +.>

Cross fusion to generate the spatiotemporal characteristics after fusion +.>

For the time-space characteristics after fusion

Upsampling and adjusting to the characteristic diagram C ₂ The same scale is then interpolated by bilinear interpolation to the feature map C ₂ Cross fusion to generate the spatiotemporal characteristics after fusion +.>

Further, the transformer decoder comprises three decoder units corresponding to different scales and an MLP module connected in series, and the transformer decoder is characterized in that

And->

Input to a transformer decoder to obtain a final embedded vector, comprising:

features to be characterized

And->

Inputting the input features into decoder units of corresponding scales, taking the input features as attention masks, keys and values of the decoder units, wherein the query features of the first decoder unit are initialized query features, and the query features of the subsequent decoder units are features output by the previous decoder unit;

in each decoder unit, a cross attention operation is first performed, and then a self attention operation is performed;

the query features output by the last decoder unit are passed through the MLP module to generate the final embedded vector.

Further, the three decoder units of different scales are cycled for a preset number of iterations.

The video instance segmentation method based on cross-frame instance association improves the problem that the space-time context information in the video is ignored in the existing method. Specifically, based on the framework of deformable DETR, the present application proposes a multi-scale method for spatio-temporal features to learn the spatio-temporal correlation of dynamic instances and construct more stable cross-frame instance associations. Notably, our method can establish reliable cross-frame instance associations compared to previous methods, and does not require complex frame-by-frame processing. In addition, we propose a data enhancement method named sequential copy-paste that effectively alleviates the over-fitting problem caused by insufficient training data and improves the robustness of the model. The method improves the precision of the video instance segmentation task, and achieves leading performance on two popular data sets compared with the latest method.

Drawings

FIG. 1 is a flow chart of a video instance segmentation method based on cross-frame instance association in the present application;

FIG. 2 is a schematic diagram of a video example split network framework of the present application;

fig. 3 is a schematic diagram of a decoder unit.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, a video instance segmentation method based on cross-frame instance association is proposed, including:

step S1, a video instance segmentation network is constructed and trained, wherein the video instance segmentation network comprises a multi-scale feature extractor, a transformer encoder, a pixel decoder and a transformer decoder.

As shown in fig. 2, the video instance segmentation network constructed in the present application includes a multi-scale feature extractor, a transformer encoder, a pixel decoder, and a transformer decoder. Wherein the multi-scale feature extractor extracts multi-scale features from the input data; the transformer encoder is used for capturing multi-scale space-time characteristics; the pixel decoder performs cross fusion on the multi-scale features to generate high-resolution features for mask prediction; the transformer decoder is used to iteratively update the query feature.

Then, a video frame sequence is collected, training sample data is generated after preprocessing, and the constructed video instance segmentation network is trained. The training of the network model is already a relatively mature technology in the art, and will not be described in detail here.

In the training process, in order to enhance the robustness of the model and alleviate the overfitting problem generated during training, the method for preprocessing the collected video frame sequence to generate training sample data comprises the following steps:

For example, a frame sequence is first created as the target set Tgt, and then a video frame sequence which is also a T frame is randomly selected from the data set, and created as the source set Src. Next, the source set is assembled in time order t

And target set->

A one-to-one connection is constructed between them, then the instance is taken from +.>

Copy and paste to

Wherein t is E [1, T]Thereby generating a new sequence of frames.

Finally, the annotation (also referred to as true value) of the new frame sequence is updated, including the mask, target frame, and class of partially occluded objects, while the fully occluded objects are deleted.

It should be noted that, the preprocessing of the image frame in the present application further includes scaling the image frame to a certain size, and normalizing the pixel value of the image frame, which are all common image preprocessing methods, and will not be described herein again. The method and the device realize data enhancement through time sequence copy and paste, and can effectively enhance the robustness of the model and relieve the problem of overfitting generated during training.

S2, inputting the video frame sequence to be segmented into a multi-scale feature extractor to extract feature images of different scales, wherein the feature images are feature images C according to the scale size ₂ 、C ₃ 、C ₄ And C ₅ 。

The application adopts ResNet50 which is more popular in the field of vision as a multi-scale feature extractor, and the required input frame sequence is

Wherein T represents the number of frames, 3 represents the number of RGB channels, H and W are the height and width of the picture, respectively, and the multi-scale feature extractor extracts the multi-scale features +.>

The output is a series of characteristic diagrams with different sizes, wherein 256 represents the channel number, H _i And W is _i The height and width of the feature map, respectively. In this embodiment, four feature maps with different scales are obtained, where the scales are respectively 1/4, 1/8, 1/16 and 1/32 of the original video frame image, and are sequentially denoted as C ₂ 、C ₃ 、C ₄ And C ₅ 。

Step S3, extracting the characteristic diagramC ₃ 、C ₄ And C ₅ Input to transformer encoder to extract space-time characteristics

The present application captures rich multi-scale spatio-temporal features in a transducer encoder (Transformer encoder), including:

step S301, feature map C ₃ 、C ₄ And C ₅ And performing position coding, then respectively performing tensor flattening operation, and inputting the tensor flattened operation into a deformable attention module to generate a basic feature F.

In the transducer encoder { C is selected ₃ ,C ₄ ,C ₅ Position coding to obtain

Wherein A is _i ＝H _i ×W _i . Features { P ₃ ,P ₄ ,P ₅ The tensor flattening flag operation is performed, then the tensor flattening flag operation is input into a deformable attention module, and basic characteristics F of standard deformable attention output are generated through deformable attention calculation. The deformable attention module is a relatively mature technology in the transducer encoder, and will not be described in detail herein.

Step S302, feature map C ₃ 、C ₄ And C ₅ Position coding is carried out, and then the position coding is input into an S2S attention module to generate basic space-time characteristics

The present embodiment of the transducer encoder differs from the conventional transducer encoder in that the transducer encoder of the present embodiment incorporates an S2S attention module based on the conventional transducer encoder.

As shown in fig. 2, the S2S attention module is composed of an intra-scale temporal attention module and an inter-scale spatial temporal attention module. Intra-scale temporal attention block uses a temporal attention mechanism (TA) to model the temporal correlation of cross-frame features using their interdependencies.

Using

Representing the input of intra-scale time attention blocks on different scales, where i E [3,5 ]]Representing different scales; p epsilon [0, A _i -1]Representing pixel locations on the i scale; t E [0, T-1 ]]Representing the number of frames; c is the feature dimension.

The intra-scale time attention calculation process is as follows:

s here _i Representing features having global temporal correlation on the i scale; w (W) _q ，W _k ，W _v The weight matrices at pixel position p for generating queries, keys and values respectively,

LN represents the normalization layer.

Next, inter-scale spatiotemporal attention blocks learn spatiotemporal features and similarities across frames in adjacent scale space using deformable attention mechanisms.

In particular, by

Output representing intra-scale temporal attention block +.>

And serves as an input for a spatial-temporal attention block in the scale. First, bilinear interpolation is used for lowUp-sampling the resolution scale features, connecting with the high resolution scale features along the feature dimension, and using +.>

Projecting it to the C dimension, the process is as follows:

will be

Adjust to->

A flexible deformable attention mechanism (STDeformattn) was then chosen to reduce the expensive computational cost required to compute the spatiotemporal features.

Then, the inter-scale spatiotemporal attention calculation process is as follows:

here STDeformAttn changes keys and values according to position interpolation;

is composed of->

And (3) obtaining the product after the position deviation occurs. Features to be spatio-temporal correlated->

Generating basic spatiotemporal features by connecting along a spatial dimension

Then will->

Remodelling to->

In summary, in the S2S attention module, cross-frame instance correlations are implicitly constructed by modeling the spatio-temporal correlation of cross-frame pixel-by-pixel features.

Step S303, fusing the basic feature F and the basic space-time feature

Finally, the space-time characteristic is obtained>

/>

In this way, the basic features of the standard deformable attention output F and the basic spatiotemporal features of the S2S attention module output are generated in the transducer encoder

Finally, two features F and +.>

Finally, the space-time characteristic is obtained>

And sent to the pixel decoder.

Step S4, feature map C ₂ And spatiotemporal features

Input to pixel decodingAnd>

And->

And post-fusion spatiotemporal characteristics>

Spatiotemporal characteristics after fusion +.>

The spatiotemporal features captured in the transform encoder are input to the pixel decoder and gradually upsampled to cross-fuse the low resolution multi-scale spatiotemporal features to generate high resolution features, which are used in the final mask prediction.

Specifically, the feature C corresponding to the original 1/4 scale obtained in the step S2 is obtained ₂ And spatiotemporal features captured in a transducer encoder

Input to the pixel decoder, the following operations are performed:

step S401, spatial-temporal characteristics

And

the present embodiment takes the resolution of the scale (i.e., width by height) as a condition to separate multi-scale spatiotemporal features corresponding to 1/8, 1/16 and 1/32 scale sizes of the image

And->

Step S402, pair of features

Upsampling, adjusting to the sum feature +.>

The same scale is then interpolated bilinear and is then used to fit +.>

Cross fusion to generate the spatiotemporal characteristics after fusion +.>

Step S403, for the fused space-time characteristics

Upsampling, adjusting to the sum feature +.>

Cross fusion to generate the spatiotemporal characteristics after fusion +.>

Step S404, for the fused space-time characteristics

The embodiment fuses low-resolution multi-scale space-time features to generate high-resolution features, and finally generates a high-resolution feature map

(scale is the feature of original fig. 1/4) for final mask prediction.

Step S5, the characteristics are

And->

Input to a transformer decoder to obtain the final embedded vector.

The low-resolution multi-scale space-time features (the features with the scales of 1/32, 1/16 and 1/8 of the original figures) are sequentially input into different corresponding-scale transformers, and keys and values in the calculation process of the transformers are generated by the low-resolution multi-scale space-time features. The Transformer decoder (Transformer decoder) of this embodiment is composed of three decoders corresponding to different scales, and each decoder of different scales is specifically shown in fig. 3.

In one embodiment, the said feature is

And->

Input to a transformer decoder to obtain a final embedded vector, comprising: />

Features to be characterized

And->

the last decoder unit outputs the final embedded vector.

Specifically, the features with the original scales of 1/32, 1/16 and 1/8 in the pixel decoder are sequentially input into the decoders with the respective scales according to the sequence from low resolution to high resolution, and the input features are sequentially input into the decoders as an attribute mask, keys and values.

The attention calculation process of the transducer decoder is as follows:

X _l ＝softmax(M _l-1 +W _l k′ _l )V _l +X _l-1 ，

where l is the layer number index;

is a Query Features in the C dimension of the l layer;

respectively f _K (X _l-1 ) And f _V (X _l-1 ) Spatio-temporal characteristics under transformation, T being the number of frames, H _l And W is _l Is the spatial resolution; f (f) _Q 、f _K And f _V Is a linear function. In addition, in Japanese patentThree-dimensional attention mask M at symptom location (t, x, y) _l-1 The method comprises the following steps:

here, the

Is the resized three-dimensional mask predicted binarized output (threshold 0.5) for the first-1 transformer decoder layer.

The first Query of the transducer decoder is an initialized set of learnable Query features Init Query, which are iteratively updated as the decoder loops, with the Query features of subsequent decoder units being the features output by the previous decoder unit.

In each decoder unit, a cross attention operation (cross attention) is performed first, followed by a self attention operation (self attention), this embodiment introduces an attention mask in the attention calculation of cross attention, named masked attention in fig. 3. The query features output by the last decoder unit are passed through the MLP module to generate the final embedded vector.

In one embodiment, the three decoder units of different scales iterate a preset number of times.

I.e. after three decoder operations, loop iteration is performed, and the three decoders go through 3 decoding loops in total to generate the final Query Features, and then the Query Features generate a set of n Query embedded vectors (embedding vectors) after passing through the MLP module, where the input Features of the MLP module sequentially go through the linear layer, the batch norm layer, and the ReLU activation function from the input side to the output side.

Step S6, embedding the vectors and the space-time features

And performing dot product operation to obtain an instance segmentation result.

Finally, the set of embedded vectors and the high-resolution features obtained in the pixel decoder are subjected to simple dot product operation to obtain n queried three-dimensional masks, namely masks of each frame of image in the whole video, wherein the obtained n three-dimensional masks are the results of instance segmentation.

Meanwhile, the embedded vector is output as a linear layer of category number through one input as a feature dimension, and category prediction results of all instances of the whole video are obtained.

Table 1 shows a comparison of our method on the backbone and YouTube-VIS 2019 datasets based on convolutional neural networks with the performance of the previous most advanced methods. On ResNet-50, our approach achieved 47.4% mask AP with more advanced performance. Our method was 2.3% higher in mask AP than Seqformer without training with additional data. Compared to baseline (Mask 2 Former), we obtained an absolute gain of 1.0%. Similarly, on ResNet-101, our method is always better than all previous methods, with an overall mask AP of 49.5%. These results indicate that our method can more effectively exploit multi-scale spatio-temporal features to build more stable cross-frame instance associations.

Experimental data in the application show that the method has better segmentation accuracy compared with other methods in the prior art. Experimental data are shown in tables 1 and 2 below:

Method	Data	AP	AP ₅₀	AP ₇₅	AR ₁	AR ₁₀
							MaskTrack R-CNN	V	30.3	51.1	32.6	31.0	35.5
CrossVIS	V	36.3	56.8	38.9	35.6	40.7
							VisTR	V	35.6	56.8	37.0	35.2	40.2
IFC	V	42.8	65.8	43.8	41.1	49.7
							SeqFormer	V	45.1	66.9	50.5	45.6	54.6
SeqFormer	V+C80k	47.4	69.8	51.8	45.5	54.8
							Mask2Former	V	46.4	68.0	50.0	-	-
technical proposal of the application	V	47.4	71.0	53.0	46.1	58.1

TABLE 1

Wherein table 1 shows the performance of the method of the present application compared to other methods on the YouTube-VIS 2019 dataset. "V" means that only YouTube-VIS training set is used, and "V+C80K" means that overlapping categories of composite video of MS-COCO are also used for syndicationTraining. AP represents the average value of the accuracy of the segmentation mask prediction result, AP ₅₀ Representing accuracy at IOU threshold of 0.5, AP ₇₅ Representing accuracy at an IOU threshold of 0.75. AR represents the average recall, AR index 1 represents 1 detection per image, and index 10 represents 10 detections per image.

Method	Data	AP	AP ₅₀	AP ₇₅	AR ₁	AR ₁₀
							MaskTrack R-CNN	V	28.6	48.9	29.6	26.5	33.8
CrossVIS	V	34.2	54.4	37.9	30.4	38.2
							IFC	V	36.6	57.9	39.3	-	-
SeqFormer	V+C80k	40.5	62.4	43.7	36.1	48.1
							Mask2Former	V	40.6	60.9	41.8	-	-
Technical proposal of the application	V	41.6	64.4	44.8	38.2	50.9

TABLE 2

As shown in table 2, the method of the present application achieves 41.6% mask AP (accuracy of segmentation mask prediction result) over the recently introduced YouTube-VIS 2021 dataset, which is at least 1.0% better than the previous most advanced offline video instance segmentation method.

The foregoing examples merely illustrate embodiments of the invention and are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A video instance segmentation method based on cross-frame instance association, the video instance segmentation method based on cross-frame instance association comprising:

Map C of the characteristic ₂ And spatiotemporal features

Input to the pixel decoder and the spatio-temporal feature +.>

And->

Then gradually up-sampling and cross-fusing to obtain fused time-space characteristics

And post-fusion spatiotemporal characteristics>

Spatiotemporal characteristics after fusion +.>

Features to be characterized

And->

embedding vectors and spatio-temporal features

And performing dot product operation to obtain an instance segmentation result.

2. The cross-frame instance correlation based video instance segmentation method of claim 1 wherein training the video instance segmentation network comprises preprocessing an acquired sequence of video frames to generate training sample data, comprising:

3. The method for video instance segmentation based on cross-frame instance correlation according to claim 1, wherein the feature map C to be extracted is ₃ 、C ₄ And C ₅ Input to transformer encoder to extract space-time characteristics

Comprising the following steps:

map C of the characteristic ₃ 、C ₄ And C ₅ Performing position coding, then respectively performing tensor flattening operation, and inputting the tensor flattened operation into a deformable attention module to generate basic characteristics F;

fusing two features F and F of the same dimension

Finally, the space-time characteristic is obtained>

4. The method for video instance segmentation based on cross-frame instance correlation according to claim 1, wherein the feature map C is ₂ And spatiotemporal features

Input to the pixel decoder and the spatio-temporal feature +.>

And->

And post-fusion spatiotemporal characteristics>

Spatiotemporal characteristics after fusion +.>

Comprising the following steps:

features of time and space

And->

For characteristics of

Upsampling, adjusting to the sum feature +.>

The same scale is then interpolated bilinear and is then used to fit +.>

Cross fusion to generate the spatiotemporal characteristics after fusion +.>

For the time-space characteristics after fusion

Upsampling, adjusting to the sum feature +.>

Cross fusion to generate the spatiotemporal characteristics after fusion +.>

For the time-space characteristics after fusion

5. The method for video instance segmentation based on cross-frame instance correlation according to claim 1, wherein the transformer decoder comprises three decoder units corresponding to different scales and one MLP module connected in series, the feature to be described is that

And->

Input to a transformer decoder to obtain a final embedded vector, comprising:

features to be characterized

And->

6. The method of video instance segmentation based on cross-frame instance correlation of claim 5, wherein the three decoder units of different scales iterate a preset number of times.