CN116152710A - Video instance segmentation method based on cross-frame instance association - Google Patents

Video instance segmentation method based on cross-frame instance association Download PDF

Info

Publication number
CN116152710A
CN116152710A CN202310083300.XA CN202310083300A CN116152710A CN 116152710 A CN116152710 A CN 116152710A CN 202310083300 A CN202310083300 A CN 202310083300A CN 116152710 A CN116152710 A CN 116152710A
Authority
CN
China
Prior art keywords
features
scale
cross
video
instance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310083300.XA
Other languages
Chinese (zh)
Inventor
刘盛
陈俊皓
陈瑞祥
郭炳男
张峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202310083300.XA priority Critical patent/CN116152710A/en
Publication of CN116152710A publication Critical patent/CN116152710A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a video instance segmentation method based on cross-frame instance association, which is characterized in that a video frame sequence to be segmented is input into a multi-scale feature extractor to be extracted into feature graphs with different scales, space-time features are extracted from a transformer encoder, then fused space-time features are obtained through a pixel decoder, finally a final embedded vector is obtained through the transformer decoder, and dot product operation is carried out on the embedded vector and the high-resolution space-time features to obtain an instance segmentation result. According to the method, the space-time correlation of the dynamic instance is learned by a multi-scale method facing the space-time characteristics, a more stable cross-frame instance association is constructed, a reliable cross-frame instance association is established, the precision of a video instance segmentation task is improved, and the advanced performance is achieved on two popular data sets compared with the nearest method.

Description

Video instance segmentation method based on cross-frame instance association
Technical Field
The application belongs to the technical field of video instance segmentation, and particularly relates to a video instance segmentation method based on cross-frame instance association.
Background
Video instance segmentation aims to detect, segment, and track target instances in video simultaneously, which is helpful for many downstream tasks, including autopilot, video surveillance, video understanding, etc. Video instance segmentation is more challenging because of the fact that object instances in video are accurately segmented and tracked due to factors such as appearance distortion, fast motion, and occlusion, as compared to image instance segmentation.
With the introduction of DETR and deformable DETR frameworks, the Transformer-based end-to-end video instance segmentation approach is the recent mainstream. Following the paradigm of video input, video output, visTR first applied a transducer to solve the video instance segmentation problem and used instance queries to obtain a sequence of instances from the video, but this approach learns one embedding for each instance of each frame, which makes it difficult to process variable length or long duration video sequences. To reduce the computational effort of VisTR explosiveness and build cross-frame instance associations, subsequent studies have utilized target queries and proposed novel variants, respectively: build context time dependent memory tokens and build cross-frame instance related query separation mechanisms. These methods essentially focus on single frame features and detect instances, and then perform cross-frame instance matching, however this deliberately distinguishes images from video, but irreversibly ignores the rich spatio-temporal context information present in video.
Furthermore, existing approaches focus mainly on network improvement, but lack focus on the data sets required for training and testing. Through research, the conventional data set at present is easy to generate an overfitting problem during training due to insufficient training data quantity.
Disclosure of Invention
It is an object of the present application to provide a video instance segmentation method based on cross-frame instance correlation, which overcomes the above-mentioned problems posed in the background art, and which is also referred to as IAST in the present application.
In order to achieve the above purpose, the technical scheme of the application is as follows:
a video instance segmentation method based on cross-frame instance correlation, comprising:
constructing and training a video instance segmentation network comprising a multi-scale feature extractor, a transformer encoder, a pixel decoder, and a transformer decoder;
inputting a video frame sequence to be segmented into a multi-scale feature extractor to extract feature images with different scales, wherein the feature images are respectively feature images C according to the scale size 2 、C 3 、C 4 And C 5
Extracting the characteristic diagram C 3 、C 4 And C 5 Input to transformer encoder to extract space-time characteristics
Figure BDA0004068141360000021
Map C of the characteristic 2 And spatiotemporal features
Figure BDA0004068141360000022
Input to the pixel decoder and the spatio-temporal feature +.>
Figure BDA0004068141360000023
Separated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->
Figure BDA0004068141360000024
And->
Figure BDA0004068141360000025
Then gradually up-sampling and cross-fusing to obtain the fused space-time characteristics +.>
Figure BDA0004068141360000026
And post-fusion spatiotemporal characteristics>
Figure BDA0004068141360000027
Spatiotemporal characteristics after fusion +.>
Figure BDA0004068141360000028
/>
Features to be characterized
Figure BDA0004068141360000029
And->
Figure BDA00040681413600000210
Inputting the final embedded vector into a transformer decoder to obtain a final embedded vector;
embedding vectors and spatio-temporal features
Figure BDA00040681413600000211
And performing dot product operation to obtain an instance segmentation result.
Further, the training video instance segmentation network includes preprocessing an acquired video frame sequence to generate training sample data, including:
two sections of frame sequences with the same frame number are taken from the collected video frame sequence data set, one section is taken as a target set, and the other section is taken as a source set;
establishing a one-to-one correspondence of image frames between the target set and the source set according to a time sequence;
and copying and pasting the instance in the source set image onto the corresponding image frame in the target set to generate a new frame sequence, and placing the new frame sequence into the video frame sequence data set.
Further, the feature map C to be extracted 3 、C 4 And C 5 Input to transformer encoder to extract space-time characteristics
Figure BDA00040681413600000212
Comprising the following steps:
map C of the characteristic 3 、C 4 And C 5 Performing position coding and then respectively executingInputting a deformable attention module after flattening the row tensor to generate a basic feature F;
map C of the characteristic 3 、C 4 And C 5 Position coding is carried out, and then the position coding is input into an S2S attention module to generate basic space-time characteristics
Figure BDA00040681413600000213
The S2S attention module comprises an intra-scale time attention module and a scale space-time attention module, wherein the intra-scale time attention module adopts a time attention mechanism, and the scale space-time attention module adopts a deformable attention mechanism;
fusing two features F and F of the same dimension
Figure BDA0004068141360000031
Finally, the space-time characteristic is obtained>
Figure BDA0004068141360000032
Further, the characteristic diagram C 2 And spatiotemporal features
Figure BDA0004068141360000033
Input to the pixel decoder and the spatio-temporal feature +.>
Figure BDA0004068141360000034
Separated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->
Figure BDA0004068141360000035
And->
Figure BDA0004068141360000036
Then gradually up-sampling and cross-fusing to obtain the fused space-time characteristics +.>
Figure BDA0004068141360000037
And post-fusion spatiotemporal characteristics>
Figure BDA0004068141360000038
Spatiotemporal characteristics after fusion +.>
Figure BDA0004068141360000039
Comprising the following steps:
features of time and space
Figure BDA00040681413600000310
Separated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->
Figure BDA00040681413600000311
And->
Figure BDA00040681413600000312
For characteristics of
Figure BDA00040681413600000313
Upsampling, adjusting to the sum feature +.>
Figure BDA00040681413600000314
The same scale is then interpolated by bilinear interpolation to the same
Figure BDA00040681413600000315
Cross fusion to generate the spatiotemporal characteristics after fusion +.>
Figure BDA00040681413600000316
For the time-space characteristics after fusion
Figure BDA00040681413600000317
Upsampling, adjusting to the sum feature +.>
Figure BDA00040681413600000318
The same scale is then interpolated by bilinear interpolation and compared with the features +.>
Figure BDA00040681413600000319
Cross fusion to generate the spatiotemporal characteristics after fusion +.>
Figure BDA00040681413600000320
For the time-space characteristics after fusion
Figure BDA00040681413600000321
Upsampling and adjusting to the characteristic diagram C 2 The same scale is then interpolated by bilinear interpolation to the feature map C 2 Cross fusion to generate the spatiotemporal characteristics after fusion +.>
Figure BDA00040681413600000322
Further, the transformer decoder comprises three decoder units corresponding to different scales and an MLP module connected in series, and the transformer decoder is characterized in that
Figure BDA00040681413600000323
And->
Figure BDA00040681413600000324
Input to a transformer decoder to obtain a final embedded vector, comprising:
features to be characterized
Figure BDA00040681413600000325
And->
Figure BDA00040681413600000326
Inputting the input features into decoder units of corresponding scales, taking the input features as attention masks, keys and values of the decoder units, wherein the query features of the first decoder unit are initialized query features, and the query features of the subsequent decoder units are features output by the previous decoder unit;
in each decoder unit, a cross attention operation is first performed, and then a self attention operation is performed;
the query features output by the last decoder unit are passed through the MLP module to generate the final embedded vector.
Further, the three decoder units of different scales are cycled for a preset number of iterations.
The video instance segmentation method based on cross-frame instance association improves the problem that the space-time context information in the video is ignored in the existing method. Specifically, based on the framework of deformable DETR, the present application proposes a multi-scale method for spatio-temporal features to learn the spatio-temporal correlation of dynamic instances and construct more stable cross-frame instance associations. Notably, our method can establish reliable cross-frame instance associations compared to previous methods, and does not require complex frame-by-frame processing. In addition, we propose a data enhancement method named sequential copy-paste that effectively alleviates the over-fitting problem caused by insufficient training data and improves the robustness of the model. The method improves the precision of the video instance segmentation task, and achieves leading performance on two popular data sets compared with the latest method.
Drawings
FIG. 1 is a flow chart of a video instance segmentation method based on cross-frame instance association in the present application;
FIG. 2 is a schematic diagram of a video example split network framework of the present application;
fig. 3 is a schematic diagram of a decoder unit.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a video instance segmentation method based on cross-frame instance association is proposed, including:
step S1, a video instance segmentation network is constructed and trained, wherein the video instance segmentation network comprises a multi-scale feature extractor, a transformer encoder, a pixel decoder and a transformer decoder.
As shown in fig. 2, the video instance segmentation network constructed in the present application includes a multi-scale feature extractor, a transformer encoder, a pixel decoder, and a transformer decoder. Wherein the multi-scale feature extractor extracts multi-scale features from the input data; the transformer encoder is used for capturing multi-scale space-time characteristics; the pixel decoder performs cross fusion on the multi-scale features to generate high-resolution features for mask prediction; the transformer decoder is used to iteratively update the query feature.
Then, a video frame sequence is collected, training sample data is generated after preprocessing, and the constructed video instance segmentation network is trained. The training of the network model is already a relatively mature technology in the art, and will not be described in detail here.
In the training process, in order to enhance the robustness of the model and alleviate the overfitting problem generated during training, the method for preprocessing the collected video frame sequence to generate training sample data comprises the following steps:
two sections of frame sequences with the same frame number are taken from the collected video frame sequence data set, one section is taken as a target set, and the other section is taken as a source set;
establishing a one-to-one correspondence of image frames between the target set and the source set according to a time sequence;
and copying and pasting the instance in the source set image onto the corresponding image frame in the target set to generate a new frame sequence, and placing the new frame sequence into the video frame sequence data set.
For example, a frame sequence is first created as the target set Tgt, and then a video frame sequence which is also a T frame is randomly selected from the data set, and created as the source set Src. Next, the source set is assembled in time order t
Figure BDA0004068141360000051
And target set->
Figure BDA0004068141360000052
A one-to-one connection is constructed between them, then the instance is taken from +.>
Figure BDA0004068141360000053
Copy and paste to
Figure BDA0004068141360000054
Wherein t is E [1, T]Thereby generating a new sequence of frames.
Finally, the annotation (also referred to as true value) of the new frame sequence is updated, including the mask, target frame, and class of partially occluded objects, while the fully occluded objects are deleted.
It should be noted that, the preprocessing of the image frame in the present application further includes scaling the image frame to a certain size, and normalizing the pixel value of the image frame, which are all common image preprocessing methods, and will not be described herein again. The method and the device realize data enhancement through time sequence copy and paste, and can effectively enhance the robustness of the model and relieve the problem of overfitting generated during training.
S2, inputting the video frame sequence to be segmented into a multi-scale feature extractor to extract feature images of different scales, wherein the feature images are feature images C according to the scale size 2 、C 3 、C 4 And C 5
The application adopts ResNet50 which is more popular in the field of vision as a multi-scale feature extractor, and the required input frame sequence is
Figure BDA0004068141360000055
Wherein T represents the number of frames, 3 represents the number of RGB channels, H and W are the height and width of the picture, respectively, and the multi-scale feature extractor extracts the multi-scale features +.>
Figure BDA0004068141360000056
The output is a series of characteristic diagrams with different sizes, wherein 256 represents the channel number, H i And W is i The height and width of the feature map, respectively. In this embodiment, four feature maps with different scales are obtained, where the scales are respectively 1/4, 1/8, 1/16 and 1/32 of the original video frame image, and are sequentially denoted as C 2 、C 3 、C 4 And C 5
Step S3, extracting the characteristic diagramC 3 、C 4 And C 5 Input to transformer encoder to extract space-time characteristics
Figure BDA0004068141360000057
The present application captures rich multi-scale spatio-temporal features in a transducer encoder (Transformer encoder), including:
step S301, feature map C 3 、C 4 And C 5 And performing position coding, then respectively performing tensor flattening operation, and inputting the tensor flattened operation into a deformable attention module to generate a basic feature F.
In the transducer encoder { C is selected 3 ,C 4 ,C 5 Position coding to obtain
Figure BDA0004068141360000061
Wherein A is i =H i ×W i . Features { P 3 ,P 4 ,P 5 The tensor flattening flag operation is performed, then the tensor flattening flag operation is input into a deformable attention module, and basic characteristics F of standard deformable attention output are generated through deformable attention calculation. The deformable attention module is a relatively mature technology in the transducer encoder, and will not be described in detail herein.
Step S302, feature map C 3 、C 4 And C 5 Position coding is carried out, and then the position coding is input into an S2S attention module to generate basic space-time characteristics
Figure BDA0004068141360000062
The present embodiment of the transducer encoder differs from the conventional transducer encoder in that the transducer encoder of the present embodiment incorporates an S2S attention module based on the conventional transducer encoder.
As shown in fig. 2, the S2S attention module is composed of an intra-scale temporal attention module and an inter-scale spatial temporal attention module. Intra-scale temporal attention block uses a temporal attention mechanism (TA) to model the temporal correlation of cross-frame features using their interdependencies.
Using
Figure BDA0004068141360000063
Representing the input of intra-scale time attention blocks on different scales, where i E [3,5 ]]Representing different scales; p epsilon [0, A i -1]Representing pixel locations on the i scale; t E [0, T-1 ]]Representing the number of frames; c is the feature dimension.
The intra-scale time attention calculation process is as follows:
Figure BDA0004068141360000064
Figure BDA0004068141360000065
Figure BDA0004068141360000066
s here i Representing features having global temporal correlation on the i scale; w (W) q ,W k ,W v The weight matrices at pixel position p for generating queries, keys and values respectively,
Figure BDA0004068141360000067
LN represents the normalization layer.
Next, inter-scale spatiotemporal attention blocks learn spatiotemporal features and similarities across frames in adjacent scale space using deformable attention mechanisms.
In particular, by
Figure BDA0004068141360000068
Output representing intra-scale temporal attention block +.>
Figure BDA0004068141360000069
And serves as an input for a spatial-temporal attention block in the scale. First, bilinear interpolation is used for lowUp-sampling the resolution scale features, connecting with the high resolution scale features along the feature dimension, and using +.>
Figure BDA00040681413600000610
Projecting it to the C dimension, the process is as follows:
Figure BDA0004068141360000071
will be
Figure BDA0004068141360000072
Adjust to->
Figure BDA0004068141360000073
A flexible deformable attention mechanism (STDeformattn) was then chosen to reduce the expensive computational cost required to compute the spatiotemporal features.
Then, the inter-scale spatiotemporal attention calculation process is as follows:
Figure BDA0004068141360000074
Figure BDA0004068141360000075
Figure BDA0004068141360000076
here STDeformAttn changes keys and values according to position interpolation;
Figure BDA0004068141360000077
is composed of->
Figure BDA0004068141360000078
And (3) obtaining the product after the position deviation occurs. Features to be spatio-temporal correlated->
Figure BDA0004068141360000079
Generating basic spatiotemporal features by connecting along a spatial dimension
Figure BDA00040681413600000710
Then will->
Figure BDA00040681413600000711
Remodelling to->
Figure BDA00040681413600000712
In summary, in the S2S attention module, cross-frame instance correlations are implicitly constructed by modeling the spatio-temporal correlation of cross-frame pixel-by-pixel features.
Step S303, fusing the basic feature F and the basic space-time feature
Figure BDA00040681413600000713
Finally, the space-time characteristic is obtained>
Figure BDA00040681413600000714
/>
In this way, the basic features of the standard deformable attention output F and the basic spatiotemporal features of the S2S attention module output are generated in the transducer encoder
Figure BDA00040681413600000715
Finally, two features F and +.>
Figure BDA00040681413600000716
Finally, the space-time characteristic is obtained>
Figure BDA00040681413600000717
And sent to the pixel decoder.
Step S4, feature map C 2 And spatiotemporal features
Figure BDA00040681413600000718
Input to pixel decodingAnd>
Figure BDA00040681413600000719
separated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->
Figure BDA00040681413600000720
And->
Figure BDA00040681413600000721
Then gradually up-sampling and cross-fusing to obtain the fused space-time characteristics +.>
Figure BDA00040681413600000722
And post-fusion spatiotemporal characteristics>
Figure BDA00040681413600000723
Spatiotemporal characteristics after fusion +.>
Figure BDA00040681413600000724
The spatiotemporal features captured in the transform encoder are input to the pixel decoder and gradually upsampled to cross-fuse the low resolution multi-scale spatiotemporal features to generate high resolution features, which are used in the final mask prediction.
Specifically, the feature C corresponding to the original 1/4 scale obtained in the step S2 is obtained 2 And spatiotemporal features captured in a transducer encoder
Figure BDA00040681413600000725
Input to the pixel decoder, the following operations are performed:
step S401, spatial-temporal characteristics
Figure BDA0004068141360000081
Separated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->
Figure BDA0004068141360000082
And
Figure BDA0004068141360000083
the present embodiment takes the resolution of the scale (i.e., width by height) as a condition to separate multi-scale spatiotemporal features corresponding to 1/8, 1/16 and 1/32 scale sizes of the image
Figure BDA0004068141360000084
And->
Figure BDA0004068141360000085
Step S402, pair of features
Figure BDA0004068141360000086
Upsampling, adjusting to the sum feature +.>
Figure BDA0004068141360000087
The same scale is then interpolated bilinear and is then used to fit +.>
Figure BDA0004068141360000088
Cross fusion to generate the spatiotemporal characteristics after fusion +.>
Figure BDA0004068141360000089
Step S403, for the fused space-time characteristics
Figure BDA00040681413600000810
Upsampling, adjusting to the sum feature +.>
Figure BDA00040681413600000811
The same scale is then interpolated by bilinear interpolation and compared with the features +.>
Figure BDA00040681413600000812
Cross fusion to generate the spatiotemporal characteristics after fusion +.>
Figure BDA00040681413600000813
Step S404, for the fused space-time characteristics
Figure BDA00040681413600000814
Upsampling and adjusting to the characteristic diagram C 2 The same scale is then interpolated by bilinear interpolation to the feature map C 2 Cross fusion to generate the spatiotemporal characteristics after fusion +.>
Figure BDA00040681413600000815
The embodiment fuses low-resolution multi-scale space-time features to generate high-resolution features, and finally generates a high-resolution feature map
Figure BDA00040681413600000816
(scale is the feature of original fig. 1/4) for final mask prediction.
Step S5, the characteristics are
Figure BDA00040681413600000817
And->
Figure BDA00040681413600000818
Input to a transformer decoder to obtain the final embedded vector.
The low-resolution multi-scale space-time features (the features with the scales of 1/32, 1/16 and 1/8 of the original figures) are sequentially input into different corresponding-scale transformers, and keys and values in the calculation process of the transformers are generated by the low-resolution multi-scale space-time features. The Transformer decoder (Transformer decoder) of this embodiment is composed of three decoders corresponding to different scales, and each decoder of different scales is specifically shown in fig. 3.
In one embodiment, the said feature is
Figure BDA00040681413600000819
And->
Figure BDA00040681413600000820
Input to a transformer decoder to obtain a final embedded vector, comprising: />
Features to be characterized
Figure BDA00040681413600000821
And->
Figure BDA00040681413600000822
Inputting the input features into decoder units of corresponding scales, taking the input features as attention masks, keys and values of the decoder units, wherein the query features of the first decoder unit are initialized query features, and the query features of the subsequent decoder units are features output by the previous decoder unit;
in each decoder unit, a cross attention operation is first performed, and then a self attention operation is performed;
the last decoder unit outputs the final embedded vector.
Specifically, the features with the original scales of 1/32, 1/16 and 1/8 in the pixel decoder are sequentially input into the decoders with the respective scales according to the sequence from low resolution to high resolution, and the input features are sequentially input into the decoders as an attribute mask, keys and values.
The attention calculation process of the transducer decoder is as follows:
X l =softmax(M l-1 +W l k′ l )V l +X l-1
where l is the layer number index;
Figure BDA0004068141360000091
is a Query Features in the C dimension of the l layer;
Figure BDA0004068141360000092
respectively f K (X l-1 ) And f V (X l-1 ) Spatio-temporal characteristics under transformation, T being the number of frames, H l And W is l Is the spatial resolution; f (f) Q 、f K And f V Is a linear function. In addition, in Japanese patentThree-dimensional attention mask M at symptom location (t, x, y) l-1 The method comprises the following steps:
Figure BDA0004068141360000093
here, the
Figure BDA0004068141360000094
Is the resized three-dimensional mask predicted binarized output (threshold 0.5) for the first-1 transformer decoder layer.
The first Query of the transducer decoder is an initialized set of learnable Query features Init Query, which are iteratively updated as the decoder loops, with the Query features of subsequent decoder units being the features output by the previous decoder unit.
In each decoder unit, a cross attention operation (cross attention) is performed first, followed by a self attention operation (self attention), this embodiment introduces an attention mask in the attention calculation of cross attention, named masked attention in fig. 3. The query features output by the last decoder unit are passed through the MLP module to generate the final embedded vector.
In one embodiment, the three decoder units of different scales iterate a preset number of times.
I.e. after three decoder operations, loop iteration is performed, and the three decoders go through 3 decoding loops in total to generate the final Query Features, and then the Query Features generate a set of n Query embedded vectors (embedding vectors) after passing through the MLP module, where the input Features of the MLP module sequentially go through the linear layer, the batch norm layer, and the ReLU activation function from the input side to the output side.
Step S6, embedding the vectors and the space-time features
Figure BDA0004068141360000095
And performing dot product operation to obtain an instance segmentation result.
Finally, the set of embedded vectors and the high-resolution features obtained in the pixel decoder are subjected to simple dot product operation to obtain n queried three-dimensional masks, namely masks of each frame of image in the whole video, wherein the obtained n three-dimensional masks are the results of instance segmentation.
Meanwhile, the embedded vector is output as a linear layer of category number through one input as a feature dimension, and category prediction results of all instances of the whole video are obtained.
Table 1 shows a comparison of our method on the backbone and YouTube-VIS 2019 datasets based on convolutional neural networks with the performance of the previous most advanced methods. On ResNet-50, our approach achieved 47.4% mask AP with more advanced performance. Our method was 2.3% higher in mask AP than Seqformer without training with additional data. Compared to baseline (Mask 2 Former), we obtained an absolute gain of 1.0%. Similarly, on ResNet-101, our method is always better than all previous methods, with an overall mask AP of 49.5%. These results indicate that our method can more effectively exploit multi-scale spatio-temporal features to build more stable cross-frame instance associations.
Experimental data in the application show that the method has better segmentation accuracy compared with other methods in the prior art. Experimental data are shown in tables 1 and 2 below:
Method Data AP AP 50 AP 75 AR 1 AR 10
MaskTrack R-CNN V 30.3 51.1 32.6 31.0 35.5
CrossVIS V 36.3 56.8 38.9 35.6 40.7
VisTR V 35.6 56.8 37.0 35.2 40.2
IFC V 42.8 65.8 43.8 41.1 49.7
SeqFormer V 45.1 66.9 50.5 45.6 54.6
SeqFormer V+C80k 47.4 69.8 51.8 45.5 54.8
Mask2Former V 46.4 68.0 50.0 - -
technical proposal of the application V 47.4 71.0 53.0 46.1 58.1
TABLE 1
Wherein table 1 shows the performance of the method of the present application compared to other methods on the YouTube-VIS 2019 dataset. "V" means that only YouTube-VIS training set is used, and "V+C80K" means that overlapping categories of composite video of MS-COCO are also used for syndicationTraining. AP represents the average value of the accuracy of the segmentation mask prediction result, AP 50 Representing accuracy at IOU threshold of 0.5, AP 75 Representing accuracy at an IOU threshold of 0.75. AR represents the average recall, AR index 1 represents 1 detection per image, and index 10 represents 10 detections per image.
Method Data AP AP 50 AP 75 AR 1 AR 10
MaskTrack R-CNN V 28.6 48.9 29.6 26.5 33.8
CrossVIS V 34.2 54.4 37.9 30.4 38.2
IFC V 36.6 57.9 39.3 - -
SeqFormer V+C80k 40.5 62.4 43.7 36.1 48.1
Mask2Former V 40.6 60.9 41.8 - -
Technical proposal of the application V 41.6 64.4 44.8 38.2 50.9
TABLE 2
As shown in table 2, the method of the present application achieves 41.6% mask AP (accuracy of segmentation mask prediction result) over the recently introduced YouTube-VIS 2021 dataset, which is at least 1.0% better than the previous most advanced offline video instance segmentation method.
The foregoing examples merely illustrate embodiments of the invention and are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (6)

1. A video instance segmentation method based on cross-frame instance association, the video instance segmentation method based on cross-frame instance association comprising:
constructing and training a video instance segmentation network comprising a multi-scale feature extractor, a transformer encoder, a pixel decoder, and a transformer decoder;
inputting a video frame sequence to be segmented into a multi-scale feature extractor to extract feature images with different scales, wherein the feature images are respectively feature images C according to the scale size 2 、C 3 、C 4 And C 5
Extracting the characteristic diagram C 3 、C 4 And C 5 Input to transformer encoder to extract space-time characteristics
Figure FDA0004068141350000011
Map C of the characteristic 2 And spatiotemporal features
Figure FDA0004068141350000012
Input to the pixel decoder and the spatio-temporal feature +.>
Figure FDA0004068141350000013
Separated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->
Figure FDA0004068141350000014
And->
Figure FDA0004068141350000015
Then gradually up-sampling and cross-fusing to obtain fused time-space characteristics
Figure FDA0004068141350000016
And post-fusion spatiotemporal characteristics>
Figure FDA0004068141350000017
Spatiotemporal characteristics after fusion +.>
Figure FDA0004068141350000018
Features to be characterized
Figure FDA0004068141350000019
And->
Figure FDA00040681413500000110
Inputting the final embedded vector into a transformer decoder to obtain a final embedded vector;
embedding vectors and spatio-temporal features
Figure FDA00040681413500000111
And performing dot product operation to obtain an instance segmentation result.
2. The cross-frame instance correlation based video instance segmentation method of claim 1 wherein training the video instance segmentation network comprises preprocessing an acquired sequence of video frames to generate training sample data, comprising:
two sections of frame sequences with the same frame number are taken from the collected video frame sequence data set, one section is taken as a target set, and the other section is taken as a source set;
establishing a one-to-one correspondence of image frames between the target set and the source set according to a time sequence;
and copying and pasting the instance in the source set image onto the corresponding image frame in the target set to generate a new frame sequence, and placing the new frame sequence into the video frame sequence data set.
3. The method for video instance segmentation based on cross-frame instance correlation according to claim 1, wherein the feature map C to be extracted is 3 、C 4 And C 5 Input to transformer encoder to extract space-time characteristics
Figure FDA00040681413500000112
Comprising the following steps:
map C of the characteristic 3 、C 4 And C 5 Performing position coding, then respectively performing tensor flattening operation, and inputting the tensor flattened operation into a deformable attention module to generate basic characteristics F;
map C of the characteristic 3 、C 4 And C 5 Position coding is carried out, and then the position coding is input into an S2S attention module to generate basic space-time characteristics
Figure FDA0004068141350000021
The S2S attention module comprises an intra-scale time attention module and a scale space-time attention module, wherein the intra-scale time attention module adopts a time attention mechanism, and the scale space-time attention module adopts a deformable attention mechanism;
fusing two features F and F of the same dimension
Figure FDA0004068141350000022
Finally, the space-time characteristic is obtained>
Figure FDA0004068141350000023
4. The method for video instance segmentation based on cross-frame instance correlation according to claim 1, wherein the feature map C is 2 And spatiotemporal features
Figure FDA0004068141350000024
Input to the pixel decoder and the spatio-temporal feature +.>
Figure FDA00040681413500000227
Separated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->
Figure FDA0004068141350000025
And->
Figure FDA0004068141350000026
Then gradually up-sampling and cross-fusing to obtain the fused space-time characteristics +.>
Figure FDA0004068141350000027
And post-fusion spatiotemporal characteristics>
Figure FDA0004068141350000028
Spatiotemporal characteristics after fusion +.>
Figure FDA0004068141350000029
Comprising the following steps:
features of time and space
Figure FDA00040681413500000210
Separated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->
Figure FDA00040681413500000211
And->
Figure FDA00040681413500000212
For characteristics of
Figure FDA00040681413500000213
Upsampling, adjusting to the sum feature +.>
Figure FDA00040681413500000214
The same scale is then interpolated bilinear and is then used to fit +.>
Figure FDA00040681413500000215
Cross fusion to generate the spatiotemporal characteristics after fusion +.>
Figure FDA00040681413500000216
For the time-space characteristics after fusion
Figure FDA00040681413500000217
Upsampling, adjusting to the sum feature +.>
Figure FDA00040681413500000218
The same scale is then interpolated by bilinear interpolation and compared with the features +.>
Figure FDA00040681413500000219
Cross fusion to generate the spatiotemporal characteristics after fusion +.>
Figure FDA00040681413500000220
For the time-space characteristics after fusion
Figure FDA00040681413500000221
Upsampling and adjusting to the characteristic diagram C 2 The same scale is then interpolated by bilinear interpolation to the feature map C 2 Cross fusion to generate the spatiotemporal characteristics after fusion +.>
Figure FDA00040681413500000222
5. The method for video instance segmentation based on cross-frame instance correlation according to claim 1, wherein the transformer decoder comprises three decoder units corresponding to different scales and one MLP module connected in series, the feature to be described is that
Figure FDA00040681413500000223
And->
Figure FDA00040681413500000224
Input to a transformer decoder to obtain a final embedded vector, comprising:
features to be characterized
Figure FDA00040681413500000225
And->
Figure FDA00040681413500000226
Inputting the input features into decoder units of corresponding scales, taking the input features as attention masks, keys and values of the decoder units, wherein the query features of the first decoder unit are initialized query features, and the query features of the subsequent decoder units are features output by the previous decoder unit;
in each decoder unit, a cross attention operation is first performed, and then a self attention operation is performed;
the query features output by the last decoder unit are passed through the MLP module to generate the final embedded vector.
6. The method of video instance segmentation based on cross-frame instance correlation of claim 5, wherein the three decoder units of different scales iterate a preset number of times.
CN202310083300.XA 2023-02-08 2023-02-08 Video instance segmentation method based on cross-frame instance association Pending CN116152710A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310083300.XA CN116152710A (en) 2023-02-08 2023-02-08 Video instance segmentation method based on cross-frame instance association

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310083300.XA CN116152710A (en) 2023-02-08 2023-02-08 Video instance segmentation method based on cross-frame instance association

Publications (1)

Publication Number Publication Date
CN116152710A true CN116152710A (en) 2023-05-23

Family

ID=86340262

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310083300.XA Pending CN116152710A (en) 2023-02-08 2023-02-08 Video instance segmentation method based on cross-frame instance association

Country Status (1)

Country Link
CN (1) CN116152710A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117830874A (en) * 2024-03-05 2024-04-05 成都理工大学 Remote sensing target detection method under multi-scale fuzzy boundary condition

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117830874A (en) * 2024-03-05 2024-04-05 成都理工大学 Remote sensing target detection method under multi-scale fuzzy boundary condition
CN117830874B (en) * 2024-03-05 2024-05-07 成都理工大学 Remote sensing target detection method under multi-scale fuzzy boundary condition

Similar Documents

Publication Publication Date Title
Liu et al. Video super-resolution based on deep learning: a comprehensive survey
CN110059772B (en) Remote sensing image semantic segmentation method based on multi-scale decoding network
CN107808389B (en) Unsupervised video segmentation method based on deep learning
CN111652899B (en) Video target segmentation method for space-time component diagram
CN107679462B (en) Depth multi-feature fusion classification method based on wavelets
CN113177882B (en) Single-frame image super-resolution processing method based on diffusion model
CN113837938B (en) Super-resolution method for reconstructing potential image based on dynamic vision sensor
CN112396607A (en) Streetscape image semantic segmentation method for deformable convolution fusion enhancement
CN108280804B (en) Multi-frame image super-resolution reconstruction method
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN111276240B (en) Multi-label multi-mode holographic pulse condition identification method based on graph convolution network
CN111696035A (en) Multi-frame image super-resolution reconstruction method based on optical flow motion estimation algorithm
CN113158723A (en) End-to-end video motion detection positioning system
CN115311720B (en) Method for generating deepfake based on transducer
CN109949217B (en) Video super-resolution reconstruction method based on residual learning and implicit motion compensation
CN116664397B (en) TransSR-Net structured image super-resolution reconstruction method
CN111696038A (en) Image super-resolution method, device, equipment and computer-readable storage medium
CN114170088A (en) Relational reinforcement learning system and method based on graph structure data
CN114494297A (en) Adaptive video target segmentation method for processing multiple priori knowledge
CN116152710A (en) Video instance segmentation method based on cross-frame instance association
CN116071748A (en) Unsupervised video target segmentation method based on frequency domain global filtering
Hua et al. Dynamic scene deblurring with continuous cross-layer attention transmission
CN114581762A (en) Road extraction method based on multi-scale bar pooling and pyramid pooling
CN117058043A (en) Event-image deblurring method based on LSTM
Liu et al. Arbitrary-scale super-resolution via deep learning: A comprehensive survey

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination