CN116152710A - Video instance segmentation method based on cross-frame instance association - Google Patents
Video instance segmentation method based on cross-frame instance association Download PDFInfo
- Publication number
- CN116152710A CN116152710A CN202310083300.XA CN202310083300A CN116152710A CN 116152710 A CN116152710 A CN 116152710A CN 202310083300 A CN202310083300 A CN 202310083300A CN 116152710 A CN116152710 A CN 116152710A
- Authority
- CN
- China
- Prior art keywords
- features
- scale
- cross
- video
- instance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a video instance segmentation method based on cross-frame instance association, which is characterized in that a video frame sequence to be segmented is input into a multi-scale feature extractor to be extracted into feature graphs with different scales, space-time features are extracted from a transformer encoder, then fused space-time features are obtained through a pixel decoder, finally a final embedded vector is obtained through the transformer decoder, and dot product operation is carried out on the embedded vector and the high-resolution space-time features to obtain an instance segmentation result. According to the method, the space-time correlation of the dynamic instance is learned by a multi-scale method facing the space-time characteristics, a more stable cross-frame instance association is constructed, a reliable cross-frame instance association is established, the precision of a video instance segmentation task is improved, and the advanced performance is achieved on two popular data sets compared with the nearest method.
Description
Technical Field
The application belongs to the technical field of video instance segmentation, and particularly relates to a video instance segmentation method based on cross-frame instance association.
Background
Video instance segmentation aims to detect, segment, and track target instances in video simultaneously, which is helpful for many downstream tasks, including autopilot, video surveillance, video understanding, etc. Video instance segmentation is more challenging because of the fact that object instances in video are accurately segmented and tracked due to factors such as appearance distortion, fast motion, and occlusion, as compared to image instance segmentation.
With the introduction of DETR and deformable DETR frameworks, the Transformer-based end-to-end video instance segmentation approach is the recent mainstream. Following the paradigm of video input, video output, visTR first applied a transducer to solve the video instance segmentation problem and used instance queries to obtain a sequence of instances from the video, but this approach learns one embedding for each instance of each frame, which makes it difficult to process variable length or long duration video sequences. To reduce the computational effort of VisTR explosiveness and build cross-frame instance associations, subsequent studies have utilized target queries and proposed novel variants, respectively: build context time dependent memory tokens and build cross-frame instance related query separation mechanisms. These methods essentially focus on single frame features and detect instances, and then perform cross-frame instance matching, however this deliberately distinguishes images from video, but irreversibly ignores the rich spatio-temporal context information present in video.
Furthermore, existing approaches focus mainly on network improvement, but lack focus on the data sets required for training and testing. Through research, the conventional data set at present is easy to generate an overfitting problem during training due to insufficient training data quantity.
Disclosure of Invention
It is an object of the present application to provide a video instance segmentation method based on cross-frame instance correlation, which overcomes the above-mentioned problems posed in the background art, and which is also referred to as IAST in the present application.
In order to achieve the above purpose, the technical scheme of the application is as follows:
a video instance segmentation method based on cross-frame instance correlation, comprising:
constructing and training a video instance segmentation network comprising a multi-scale feature extractor, a transformer encoder, a pixel decoder, and a transformer decoder;
inputting a video frame sequence to be segmented into a multi-scale feature extractor to extract feature images with different scales, wherein the feature images are respectively feature images C according to the scale size 2 、C 3 、C 4 And C 5 ;
Extracting the characteristic diagram C 3 、C 4 And C 5 Input to transformer encoder to extract space-time characteristics
Map C of the characteristic 2 And spatiotemporal featuresInput to the pixel decoder and the spatio-temporal feature +.>Separated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->And->Then gradually up-sampling and cross-fusing to obtain the fused space-time characteristics +.>And post-fusion spatiotemporal characteristics>Spatiotemporal characteristics after fusion +.>/>
Features to be characterizedAnd->Inputting the final embedded vector into a transformer decoder to obtain a final embedded vector;
embedding vectors and spatio-temporal featuresAnd performing dot product operation to obtain an instance segmentation result.
Further, the training video instance segmentation network includes preprocessing an acquired video frame sequence to generate training sample data, including:
two sections of frame sequences with the same frame number are taken from the collected video frame sequence data set, one section is taken as a target set, and the other section is taken as a source set;
establishing a one-to-one correspondence of image frames between the target set and the source set according to a time sequence;
and copying and pasting the instance in the source set image onto the corresponding image frame in the target set to generate a new frame sequence, and placing the new frame sequence into the video frame sequence data set.
Further, the feature map C to be extracted 3 、C 4 And C 5 Input to transformer encoder to extract space-time characteristicsComprising the following steps:
map C of the characteristic 3 、C 4 And C 5 Performing position coding and then respectively executingInputting a deformable attention module after flattening the row tensor to generate a basic feature F;
map C of the characteristic 3 、C 4 And C 5 Position coding is carried out, and then the position coding is input into an S2S attention module to generate basic space-time characteristicsThe S2S attention module comprises an intra-scale time attention module and a scale space-time attention module, wherein the intra-scale time attention module adopts a time attention mechanism, and the scale space-time attention module adopts a deformable attention mechanism;
fusing two features F and F of the same dimensionFinally, the space-time characteristic is obtained>
Further, the characteristic diagram C 2 And spatiotemporal featuresInput to the pixel decoder and the spatio-temporal feature +.>Separated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->And->Then gradually up-sampling and cross-fusing to obtain the fused space-time characteristics +.>And post-fusion spatiotemporal characteristics>Spatiotemporal characteristics after fusion +.>Comprising the following steps:
features of time and spaceSeparated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->And->
For characteristics ofUpsampling, adjusting to the sum feature +.>The same scale is then interpolated by bilinear interpolation to the sameCross fusion to generate the spatiotemporal characteristics after fusion +.>
For the time-space characteristics after fusionUpsampling, adjusting to the sum feature +.>The same scale is then interpolated by bilinear interpolation and compared with the features +.>Cross fusion to generate the spatiotemporal characteristics after fusion +.>
For the time-space characteristics after fusionUpsampling and adjusting to the characteristic diagram C 2 The same scale is then interpolated by bilinear interpolation to the feature map C 2 Cross fusion to generate the spatiotemporal characteristics after fusion +.>
Further, the transformer decoder comprises three decoder units corresponding to different scales and an MLP module connected in series, and the transformer decoder is characterized in thatAnd->Input to a transformer decoder to obtain a final embedded vector, comprising:
features to be characterizedAnd->Inputting the input features into decoder units of corresponding scales, taking the input features as attention masks, keys and values of the decoder units, wherein the query features of the first decoder unit are initialized query features, and the query features of the subsequent decoder units are features output by the previous decoder unit;
in each decoder unit, a cross attention operation is first performed, and then a self attention operation is performed;
the query features output by the last decoder unit are passed through the MLP module to generate the final embedded vector.
Further, the three decoder units of different scales are cycled for a preset number of iterations.
The video instance segmentation method based on cross-frame instance association improves the problem that the space-time context information in the video is ignored in the existing method. Specifically, based on the framework of deformable DETR, the present application proposes a multi-scale method for spatio-temporal features to learn the spatio-temporal correlation of dynamic instances and construct more stable cross-frame instance associations. Notably, our method can establish reliable cross-frame instance associations compared to previous methods, and does not require complex frame-by-frame processing. In addition, we propose a data enhancement method named sequential copy-paste that effectively alleviates the over-fitting problem caused by insufficient training data and improves the robustness of the model. The method improves the precision of the video instance segmentation task, and achieves leading performance on two popular data sets compared with the latest method.
Drawings
FIG. 1 is a flow chart of a video instance segmentation method based on cross-frame instance association in the present application;
FIG. 2 is a schematic diagram of a video example split network framework of the present application;
fig. 3 is a schematic diagram of a decoder unit.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a video instance segmentation method based on cross-frame instance association is proposed, including:
step S1, a video instance segmentation network is constructed and trained, wherein the video instance segmentation network comprises a multi-scale feature extractor, a transformer encoder, a pixel decoder and a transformer decoder.
As shown in fig. 2, the video instance segmentation network constructed in the present application includes a multi-scale feature extractor, a transformer encoder, a pixel decoder, and a transformer decoder. Wherein the multi-scale feature extractor extracts multi-scale features from the input data; the transformer encoder is used for capturing multi-scale space-time characteristics; the pixel decoder performs cross fusion on the multi-scale features to generate high-resolution features for mask prediction; the transformer decoder is used to iteratively update the query feature.
Then, a video frame sequence is collected, training sample data is generated after preprocessing, and the constructed video instance segmentation network is trained. The training of the network model is already a relatively mature technology in the art, and will not be described in detail here.
In the training process, in order to enhance the robustness of the model and alleviate the overfitting problem generated during training, the method for preprocessing the collected video frame sequence to generate training sample data comprises the following steps:
two sections of frame sequences with the same frame number are taken from the collected video frame sequence data set, one section is taken as a target set, and the other section is taken as a source set;
establishing a one-to-one correspondence of image frames between the target set and the source set according to a time sequence;
and copying and pasting the instance in the source set image onto the corresponding image frame in the target set to generate a new frame sequence, and placing the new frame sequence into the video frame sequence data set.
For example, a frame sequence is first created as the target set Tgt, and then a video frame sequence which is also a T frame is randomly selected from the data set, and created as the source set Src. Next, the source set is assembled in time order tAnd target set->A one-to-one connection is constructed between them, then the instance is taken from +.>Copy and paste toWherein t is E [1, T]Thereby generating a new sequence of frames.
Finally, the annotation (also referred to as true value) of the new frame sequence is updated, including the mask, target frame, and class of partially occluded objects, while the fully occluded objects are deleted.
It should be noted that, the preprocessing of the image frame in the present application further includes scaling the image frame to a certain size, and normalizing the pixel value of the image frame, which are all common image preprocessing methods, and will not be described herein again. The method and the device realize data enhancement through time sequence copy and paste, and can effectively enhance the robustness of the model and relieve the problem of overfitting generated during training.
S2, inputting the video frame sequence to be segmented into a multi-scale feature extractor to extract feature images of different scales, wherein the feature images are feature images C according to the scale size 2 、C 3 、C 4 And C 5 。
The application adopts ResNet50 which is more popular in the field of vision as a multi-scale feature extractor, and the required input frame sequence isWherein T represents the number of frames, 3 represents the number of RGB channels, H and W are the height and width of the picture, respectively, and the multi-scale feature extractor extracts the multi-scale features +.>The output is a series of characteristic diagrams with different sizes, wherein 256 represents the channel number, H i And W is i The height and width of the feature map, respectively. In this embodiment, four feature maps with different scales are obtained, where the scales are respectively 1/4, 1/8, 1/16 and 1/32 of the original video frame image, and are sequentially denoted as C 2 、C 3 、C 4 And C 5 。
Step S3, extracting the characteristic diagramC 3 、C 4 And C 5 Input to transformer encoder to extract space-time characteristics
The present application captures rich multi-scale spatio-temporal features in a transducer encoder (Transformer encoder), including:
step S301, feature map C 3 、C 4 And C 5 And performing position coding, then respectively performing tensor flattening operation, and inputting the tensor flattened operation into a deformable attention module to generate a basic feature F.
In the transducer encoder { C is selected 3 ,C 4 ,C 5 Position coding to obtainWherein A is i =H i ×W i . Features { P 3 ,P 4 ,P 5 The tensor flattening flag operation is performed, then the tensor flattening flag operation is input into a deformable attention module, and basic characteristics F of standard deformable attention output are generated through deformable attention calculation. The deformable attention module is a relatively mature technology in the transducer encoder, and will not be described in detail herein.
Step S302, feature map C 3 、C 4 And C 5 Position coding is carried out, and then the position coding is input into an S2S attention module to generate basic space-time characteristics
The present embodiment of the transducer encoder differs from the conventional transducer encoder in that the transducer encoder of the present embodiment incorporates an S2S attention module based on the conventional transducer encoder.
As shown in fig. 2, the S2S attention module is composed of an intra-scale temporal attention module and an inter-scale spatial temporal attention module. Intra-scale temporal attention block uses a temporal attention mechanism (TA) to model the temporal correlation of cross-frame features using their interdependencies.
UsingRepresenting the input of intra-scale time attention blocks on different scales, where i E [3,5 ]]Representing different scales; p epsilon [0, A i -1]Representing pixel locations on the i scale; t E [0, T-1 ]]Representing the number of frames; c is the feature dimension.
The intra-scale time attention calculation process is as follows:
s here i Representing features having global temporal correlation on the i scale; w (W) q ,W k ,W v The weight matrices at pixel position p for generating queries, keys and values respectively,LN represents the normalization layer.
Next, inter-scale spatiotemporal attention blocks learn spatiotemporal features and similarities across frames in adjacent scale space using deformable attention mechanisms.
In particular, byOutput representing intra-scale temporal attention block +.>And serves as an input for a spatial-temporal attention block in the scale. First, bilinear interpolation is used for lowUp-sampling the resolution scale features, connecting with the high resolution scale features along the feature dimension, and using +.>Projecting it to the C dimension, the process is as follows:
will beAdjust to->A flexible deformable attention mechanism (STDeformattn) was then chosen to reduce the expensive computational cost required to compute the spatiotemporal features.
Then, the inter-scale spatiotemporal attention calculation process is as follows:
here STDeformAttn changes keys and values according to position interpolation;is composed of->And (3) obtaining the product after the position deviation occurs. Features to be spatio-temporal correlated->Generating basic spatiotemporal features by connecting along a spatial dimensionThen will->Remodelling to->
In summary, in the S2S attention module, cross-frame instance correlations are implicitly constructed by modeling the spatio-temporal correlation of cross-frame pixel-by-pixel features.
Step S303, fusing the basic feature F and the basic space-time featureFinally, the space-time characteristic is obtained>/>
In this way, the basic features of the standard deformable attention output F and the basic spatiotemporal features of the S2S attention module output are generated in the transducer encoderFinally, two features F and +.>Finally, the space-time characteristic is obtained>And sent to the pixel decoder.
Step S4, feature map C 2 And spatiotemporal featuresInput to pixel decodingAnd>separated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->And->Then gradually up-sampling and cross-fusing to obtain the fused space-time characteristics +.>And post-fusion spatiotemporal characteristics>Spatiotemporal characteristics after fusion +.>
The spatiotemporal features captured in the transform encoder are input to the pixel decoder and gradually upsampled to cross-fuse the low resolution multi-scale spatiotemporal features to generate high resolution features, which are used in the final mask prediction.
Specifically, the feature C corresponding to the original 1/4 scale obtained in the step S2 is obtained 2 And spatiotemporal features captured in a transducer encoderInput to the pixel decoder, the following operations are performed:
step S401, spatial-temporal characteristicsSeparated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->And
the present embodiment takes the resolution of the scale (i.e., width by height) as a condition to separate multi-scale spatiotemporal features corresponding to 1/8, 1/16 and 1/32 scale sizes of the imageAnd->
Step S402, pair of featuresUpsampling, adjusting to the sum feature +.>The same scale is then interpolated bilinear and is then used to fit +.>Cross fusion to generate the spatiotemporal characteristics after fusion +.>
Step S403, for the fused space-time characteristicsUpsampling, adjusting to the sum feature +.>The same scale is then interpolated by bilinear interpolation and compared with the features +.>Cross fusion to generate the spatiotemporal characteristics after fusion +.>
Step S404, for the fused space-time characteristicsUpsampling and adjusting to the characteristic diagram C 2 The same scale is then interpolated by bilinear interpolation to the feature map C 2 Cross fusion to generate the spatiotemporal characteristics after fusion +.>
The embodiment fuses low-resolution multi-scale space-time features to generate high-resolution features, and finally generates a high-resolution feature map(scale is the feature of original fig. 1/4) for final mask prediction.
Step S5, the characteristics areAnd->Input to a transformer decoder to obtain the final embedded vector.
The low-resolution multi-scale space-time features (the features with the scales of 1/32, 1/16 and 1/8 of the original figures) are sequentially input into different corresponding-scale transformers, and keys and values in the calculation process of the transformers are generated by the low-resolution multi-scale space-time features. The Transformer decoder (Transformer decoder) of this embodiment is composed of three decoders corresponding to different scales, and each decoder of different scales is specifically shown in fig. 3.
In one embodiment, the said feature isAnd->Input to a transformer decoder to obtain a final embedded vector, comprising: />
Features to be characterizedAnd->Inputting the input features into decoder units of corresponding scales, taking the input features as attention masks, keys and values of the decoder units, wherein the query features of the first decoder unit are initialized query features, and the query features of the subsequent decoder units are features output by the previous decoder unit;
in each decoder unit, a cross attention operation is first performed, and then a self attention operation is performed;
the last decoder unit outputs the final embedded vector.
Specifically, the features with the original scales of 1/32, 1/16 and 1/8 in the pixel decoder are sequentially input into the decoders with the respective scales according to the sequence from low resolution to high resolution, and the input features are sequentially input into the decoders as an attribute mask, keys and values.
The attention calculation process of the transducer decoder is as follows:
X l =softmax(M l-1 +W l k′ l )V l +X l-1 ,
where l is the layer number index;is a Query Features in the C dimension of the l layer;respectively f K (X l-1 ) And f V (X l-1 ) Spatio-temporal characteristics under transformation, T being the number of frames, H l And W is l Is the spatial resolution; f (f) Q 、f K And f V Is a linear function. In addition, in Japanese patentThree-dimensional attention mask M at symptom location (t, x, y) l-1 The method comprises the following steps:
here, theIs the resized three-dimensional mask predicted binarized output (threshold 0.5) for the first-1 transformer decoder layer.
The first Query of the transducer decoder is an initialized set of learnable Query features Init Query, which are iteratively updated as the decoder loops, with the Query features of subsequent decoder units being the features output by the previous decoder unit.
In each decoder unit, a cross attention operation (cross attention) is performed first, followed by a self attention operation (self attention), this embodiment introduces an attention mask in the attention calculation of cross attention, named masked attention in fig. 3. The query features output by the last decoder unit are passed through the MLP module to generate the final embedded vector.
In one embodiment, the three decoder units of different scales iterate a preset number of times.
I.e. after three decoder operations, loop iteration is performed, and the three decoders go through 3 decoding loops in total to generate the final Query Features, and then the Query Features generate a set of n Query embedded vectors (embedding vectors) after passing through the MLP module, where the input Features of the MLP module sequentially go through the linear layer, the batch norm layer, and the ReLU activation function from the input side to the output side.
Step S6, embedding the vectors and the space-time featuresAnd performing dot product operation to obtain an instance segmentation result.
Finally, the set of embedded vectors and the high-resolution features obtained in the pixel decoder are subjected to simple dot product operation to obtain n queried three-dimensional masks, namely masks of each frame of image in the whole video, wherein the obtained n three-dimensional masks are the results of instance segmentation.
Meanwhile, the embedded vector is output as a linear layer of category number through one input as a feature dimension, and category prediction results of all instances of the whole video are obtained.
Table 1 shows a comparison of our method on the backbone and YouTube-VIS 2019 datasets based on convolutional neural networks with the performance of the previous most advanced methods. On ResNet-50, our approach achieved 47.4% mask AP with more advanced performance. Our method was 2.3% higher in mask AP than Seqformer without training with additional data. Compared to baseline (Mask 2 Former), we obtained an absolute gain of 1.0%. Similarly, on ResNet-101, our method is always better than all previous methods, with an overall mask AP of 49.5%. These results indicate that our method can more effectively exploit multi-scale spatio-temporal features to build more stable cross-frame instance associations.
Experimental data in the application show that the method has better segmentation accuracy compared with other methods in the prior art. Experimental data are shown in tables 1 and 2 below:
Method | Data | AP | AP 50 | AP 75 | AR 1 | AR 10 |
MaskTrack R-CNN | V | 30.3 | 51.1 | 32.6 | 31.0 | 35.5 |
CrossVIS | V | 36.3 | 56.8 | 38.9 | 35.6 | 40.7 |
VisTR | V | 35.6 | 56.8 | 37.0 | 35.2 | 40.2 |
IFC | V | 42.8 | 65.8 | 43.8 | 41.1 | 49.7 |
SeqFormer | V | 45.1 | 66.9 | 50.5 | 45.6 | 54.6 |
SeqFormer | V+C80k | 47.4 | 69.8 | 51.8 | 45.5 | 54.8 |
Mask2Former | V | 46.4 | 68.0 | 50.0 | - | - |
technical proposal of the application | V | 47.4 | 71.0 | 53.0 | 46.1 | 58.1 |
TABLE 1
Wherein table 1 shows the performance of the method of the present application compared to other methods on the YouTube-VIS 2019 dataset. "V" means that only YouTube-VIS training set is used, and "V+C80K" means that overlapping categories of composite video of MS-COCO are also used for syndicationTraining. AP represents the average value of the accuracy of the segmentation mask prediction result, AP 50 Representing accuracy at IOU threshold of 0.5, AP 75 Representing accuracy at an IOU threshold of 0.75. AR represents the average recall, AR index 1 represents 1 detection per image, and index 10 represents 10 detections per image.
Method | Data | AP | AP 50 | AP 75 | AR 1 | AR 10 |
MaskTrack R-CNN | V | 28.6 | 48.9 | 29.6 | 26.5 | 33.8 |
CrossVIS | V | 34.2 | 54.4 | 37.9 | 30.4 | 38.2 |
IFC | V | 36.6 | 57.9 | 39.3 | - | - |
SeqFormer | V+C80k | 40.5 | 62.4 | 43.7 | 36.1 | 48.1 |
Mask2Former | V | 40.6 | 60.9 | 41.8 | - | - |
Technical proposal of the application | V | 41.6 | 64.4 | 44.8 | 38.2 | 50.9 |
TABLE 2
As shown in table 2, the method of the present application achieves 41.6% mask AP (accuracy of segmentation mask prediction result) over the recently introduced YouTube-VIS 2021 dataset, which is at least 1.0% better than the previous most advanced offline video instance segmentation method.
The foregoing examples merely illustrate embodiments of the invention and are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.
Claims (6)
1. A video instance segmentation method based on cross-frame instance association, the video instance segmentation method based on cross-frame instance association comprising:
constructing and training a video instance segmentation network comprising a multi-scale feature extractor, a transformer encoder, a pixel decoder, and a transformer decoder;
inputting a video frame sequence to be segmented into a multi-scale feature extractor to extract feature images with different scales, wherein the feature images are respectively feature images C according to the scale size 2 、C 3 、C 4 And C 5 ;
Extracting the characteristic diagram C 3 、C 4 And C 5 Input to transformer encoder to extract space-time characteristics
Map C of the characteristic 2 And spatiotemporal featuresInput to the pixel decoder and the spatio-temporal feature +.>Separated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->And->Then gradually up-sampling and cross-fusing to obtain fused time-space characteristicsAnd post-fusion spatiotemporal characteristics>Spatiotemporal characteristics after fusion +.>
Features to be characterizedAnd->Inputting the final embedded vector into a transformer decoder to obtain a final embedded vector;
2. The cross-frame instance correlation based video instance segmentation method of claim 1 wherein training the video instance segmentation network comprises preprocessing an acquired sequence of video frames to generate training sample data, comprising:
two sections of frame sequences with the same frame number are taken from the collected video frame sequence data set, one section is taken as a target set, and the other section is taken as a source set;
establishing a one-to-one correspondence of image frames between the target set and the source set according to a time sequence;
and copying and pasting the instance in the source set image onto the corresponding image frame in the target set to generate a new frame sequence, and placing the new frame sequence into the video frame sequence data set.
3. The method for video instance segmentation based on cross-frame instance correlation according to claim 1, wherein the feature map C to be extracted is 3 、C 4 And C 5 Input to transformer encoder to extract space-time characteristicsComprising the following steps:
map C of the characteristic 3 、C 4 And C 5 Performing position coding, then respectively performing tensor flattening operation, and inputting the tensor flattened operation into a deformable attention module to generate basic characteristics F;
map C of the characteristic 3 、C 4 And C 5 Position coding is carried out, and then the position coding is input into an S2S attention module to generate basic space-time characteristicsThe S2S attention module comprises an intra-scale time attention module and a scale space-time attention module, wherein the intra-scale time attention module adopts a time attention mechanism, and the scale space-time attention module adopts a deformable attention mechanism;
4. The method for video instance segmentation based on cross-frame instance correlation according to claim 1, wherein the feature map C is 2 And spatiotemporal featuresInput to the pixel decoder and the spatio-temporal feature +.>Separated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->And->Then gradually up-sampling and cross-fusing to obtain the fused space-time characteristics +.>And post-fusion spatiotemporal characteristics>Spatiotemporal characteristics after fusion +.>Comprising the following steps:
features of time and spaceSeparated into and characterized by C 3 、C 4 And C 5 Scale-corresponding features->And->
For characteristics ofUpsampling, adjusting to the sum feature +.>The same scale is then interpolated bilinear and is then used to fit +.>Cross fusion to generate the spatiotemporal characteristics after fusion +.>
For the time-space characteristics after fusionUpsampling, adjusting to the sum feature +.>The same scale is then interpolated by bilinear interpolation and compared with the features +.>Cross fusion to generate the spatiotemporal characteristics after fusion +.>
5. The method for video instance segmentation based on cross-frame instance correlation according to claim 1, wherein the transformer decoder comprises three decoder units corresponding to different scales and one MLP module connected in series, the feature to be described is thatAnd->Input to a transformer decoder to obtain a final embedded vector, comprising:
features to be characterizedAnd->Inputting the input features into decoder units of corresponding scales, taking the input features as attention masks, keys and values of the decoder units, wherein the query features of the first decoder unit are initialized query features, and the query features of the subsequent decoder units are features output by the previous decoder unit;
in each decoder unit, a cross attention operation is first performed, and then a self attention operation is performed;
the query features output by the last decoder unit are passed through the MLP module to generate the final embedded vector.
6. The method of video instance segmentation based on cross-frame instance correlation of claim 5, wherein the three decoder units of different scales iterate a preset number of times.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310083300.XA CN116152710A (en) | 2023-02-08 | 2023-02-08 | Video instance segmentation method based on cross-frame instance association |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310083300.XA CN116152710A (en) | 2023-02-08 | 2023-02-08 | Video instance segmentation method based on cross-frame instance association |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116152710A true CN116152710A (en) | 2023-05-23 |
Family
ID=86340262
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310083300.XA Pending CN116152710A (en) | 2023-02-08 | 2023-02-08 | Video instance segmentation method based on cross-frame instance association |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116152710A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117830874A (en) * | 2024-03-05 | 2024-04-05 | 成都理工大学 | Remote sensing target detection method under multi-scale fuzzy boundary condition |
-
2023
- 2023-02-08 CN CN202310083300.XA patent/CN116152710A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117830874A (en) * | 2024-03-05 | 2024-04-05 | 成都理工大学 | Remote sensing target detection method under multi-scale fuzzy boundary condition |
CN117830874B (en) * | 2024-03-05 | 2024-05-07 | 成都理工大学 | Remote sensing target detection method under multi-scale fuzzy boundary condition |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Video super-resolution based on deep learning: a comprehensive survey | |
CN110059772B (en) | Remote sensing image semantic segmentation method based on multi-scale decoding network | |
CN107808389B (en) | Unsupervised video segmentation method based on deep learning | |
CN111652899B (en) | Video target segmentation method for space-time component diagram | |
CN107679462B (en) | Depth multi-feature fusion classification method based on wavelets | |
CN113177882B (en) | Single-frame image super-resolution processing method based on diffusion model | |
CN113837938B (en) | Super-resolution method for reconstructing potential image based on dynamic vision sensor | |
CN112396607A (en) | Streetscape image semantic segmentation method for deformable convolution fusion enhancement | |
CN108280804B (en) | Multi-frame image super-resolution reconstruction method | |
CN112232134B (en) | Human body posture estimation method based on hourglass network and attention mechanism | |
CN111276240B (en) | Multi-label multi-mode holographic pulse condition identification method based on graph convolution network | |
CN111696035A (en) | Multi-frame image super-resolution reconstruction method based on optical flow motion estimation algorithm | |
CN113158723A (en) | End-to-end video motion detection positioning system | |
CN115311720B (en) | Method for generating deepfake based on transducer | |
CN109949217B (en) | Video super-resolution reconstruction method based on residual learning and implicit motion compensation | |
CN116664397B (en) | TransSR-Net structured image super-resolution reconstruction method | |
CN111696038A (en) | Image super-resolution method, device, equipment and computer-readable storage medium | |
CN114170088A (en) | Relational reinforcement learning system and method based on graph structure data | |
CN114494297A (en) | Adaptive video target segmentation method for processing multiple priori knowledge | |
CN116152710A (en) | Video instance segmentation method based on cross-frame instance association | |
CN116071748A (en) | Unsupervised video target segmentation method based on frequency domain global filtering | |
Hua et al. | Dynamic scene deblurring with continuous cross-layer attention transmission | |
CN114581762A (en) | Road extraction method based on multi-scale bar pooling and pyramid pooling | |
CN117058043A (en) | Event-image deblurring method based on LSTM | |
Liu et al. | Arbitrary-scale super-resolution via deep learning: A comprehensive survey |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |