CN114943911A

CN114943911A - Video object instance segmentation method based on improved hierarchical deep self-attention network

Info

Publication number: CN114943911A
Application number: CN202111291423.XA
Authority: CN
Inventors: 王秋睿
Original assignee: Capital University of Physical Education and Sports
Current assignee: Capital University of Physical Education and Sports
Priority date: 2021-11-03
Filing date: 2021-11-03
Publication date: 2022-08-26

Abstract

The patent application of the invention is named as a video object example segmentation method based on an improved hierarchical deep self-attention network. The technology used in the present application is mainly used for instance segmentation of video objects, that is, given a piece of video, the instance segmentation is automatically performed on the object of each frame in the video segment. The technology is mainly used for realizing automatic video object instance segmentation, so that the video object can be segmented more conveniently. The technical key point of the application is that a hierarchical depth self-attention module P-MSA is provided and used in a connecting part of an encoder and a decoder, so that the purposes of better distinguishing and strengthening encoding characteristics are achieved, and better video object example segmentation is finally achieved.

Description

Video object instance segmentation method based on improved hierarchical deep self-attention network

One, the technical field

Video object instance segmentation, video object segmentation, computer vision, transformations attention mechanism, deep learning, artificial intelligence

Second, background art

2.1 general technical method introduction

The convolutional neural network is a method for extracting features by utilizing a convolutional kernel to slide on a picture or the features, and is a widely adopted technology.

The encoder-decoder architecture mode is a technology commonly used in the field of image object segmentation and video object segmentation, and the method adopting the technology is mainly different in the structure of a feature extraction network of an encoder and the feature fusion mode of a decoder.

2.2 introduction to similar methods

Document [1] Swin Transformers proposes a multi-scale hierarchical Transformers model, the structure of which is shown in fig. 1, and the size of each window input to the Transformer module is changed in order to obtain features of different scales.

The Transformers model for each layer includes a W-MSA (Window multiple attention) module without offset and an SW-MSA (offset Window multiple attention) module with an offset of one-half window size, and the structure is shown in FIG. 2. The Transformers model for each layer was concatenated with W-MSA first, and then with SW-MSA, as shown in FIG. 3.

Document [2] STEm-Seg proposes a scheme of an encoder (encoder) and a multi-scale fusion decoder (decoder) of a pyramid structure, and the result is shown in fig. 4.

Third, the invention

In the existing video object instance segmentation method, important features and non-important features cannot be well distinguished in the feature extraction process, so the accuracy of the segmentation result needs to be further improved.

On the basis of fusing an innovative technology and an existing method, the method realizes automatic video object instance segmentation, and optimizes the effect of feature extraction by adopting an improved hierarchical deep self-attention network.

The application of the invention mainly comprises:

(1) an improved hierarchical deep self-attention module P-MSA is provided, and more kinds of feature extraction can be realized.

(2) The P-MSA module is applied to the existing video object example segmentation method, and the better video object example segmentation effect is realized by combining with feature fusion.

Description of the drawings

FIG. 1 is an introduction diagram of the model of reference [1], with different sizes of self-attention windows for each level to extract features. Wherein, the window of the lower floor is small, and the window of the upper floor is large.

FIG. 2 is an illustration of the receptor field regions and the connection modes of feature extraction of the module W-MSA and the module SW-MSA of reference [1 ]. The total score is 2 modules, the former ouqiang divides the whole feature region into 4 equal parts along the x and y axes, and the latter module is shifted by half the size of the module along the x and y axes to form 9 regions.

FIG. 3 is a network architecture level connection scheme illustration of the modules W-MSA, SW-MSA and others of reference [1 ].

FIG. 4 is a system architecture diagram of reference [2], upon which the method of the present application is improved.

FIG. 5 shows the differences of the improved module P-MSA of the present application with respect to the modules W-MSA, SW-MSA of reference [1], the main differences being: (1) the P-MSA adopts a parallel structure, and the reference [1] connects W-MSA and SW-MSA in series; (2) the P-MSA can calculate the characteristics under different displacements by improving the calculation mode of the window characteristics, can extract and strengthen more kinds of characteristics, and has better characteristic expression capability.

Fig. 6 is a system architecture diagram of the present application, and a flowchart of an example video object segmentation method proposed by the present application is based on a STEm-Seg (reference [2]), and a P-MSA module is added between an encoder and a decoder.

Fifth, detailed description of the invention

The original method is shown in the literature [1] and shown in FIG. 5(a), and is composed of W-MSA and SW-MSA connected in series, wherein the SW-MSA is shifted by half of a window size relative to the W-MSA. The improved method P-MSA provided by the application is shown in FIG. 5(b), and is formed by parallel synthesis of one W-MSA and k SW-MSAs, and the obtained results are used for calculating an algorithm average value. Wherein, the number of k is (L/s) × (L/s) -1 (integer), L is the window size, and s is the step length of each time of shifting on the x axis or the y axis.

On the basis of the document [2], a P-MSA module provided by the application is connected behind a convolution module for extracting each pyramid feature, and the output of each P-MSA module is used as the output of an encoder under the scale feature and is input into a decoder as the output of the encoder under the scale feature in the connection mode of the document [2 ].

The difference between the methods of the present application and the reference [2] is that the P-MSA module proposed in the present application is added between the encoder and the decoder. Meanwhile, compared with the method in reference [2], the method has better characteristic region partition and better segmentation prediction accuracy.

[1]Ze Liu，Yutong Lin，Yue Cao，Han Hu，Yixuan Wei，Zheng Zhang，Stephen Lin，Baining Guo：Swin Transformer：Hierarchical Vision Transformer using Shifted Windows.CoRRabs/2103.14030(2021)

[2]Ali Athar，Sabarinath Mahadevan，Aljosa Osep，Laura Leal-Taixé，Bastian Leibe：STEm-Seg： Spatio-Temporal Embeddings for Instance Segmentation in Videos.ECCV(11)2020：158-177

Claims

1. A video object example segmentation method based on an improved hierarchical deep self-attention network is characterized by comprising the following steps: inputting a multi-frame image of a video segment, adopting an encoder-decoder framework under the condition of not needing to input a first frame of object outline, wherein the encoder adopts a convolution characteristic pyramid structure, and adding the improved hierarchical depth self-attention module P-MSA at the joint of the encoder and the decoder to realize automatic segmentation of the object outlines in all frames.

2. A hierarchical deep self-attention module P-MSA as claimed in claim 1, characterized in that: for each input two-dimensional feature, the two-dimensional feature can be moved by any distance in the size range of the convolution window along the x-axis direction or the y-axis direction or the combination of the two directions, and all feature outputs with different moving distances are connected in parallel, so that a new feature extraction and enhancement mode is realized.