CN115797835A

CN115797835A - Non-supervision video target segmentation algorithm based on heterogeneous Transformer

Info

Publication number: CN115797835A
Application number: CN202211532178.1A
Authority: CN
Inventors: 王一帆; 袁亦忱; 卢湖川; 王立君
Original assignee: Dalian Weishi Technology Co ltd; Dalian University of Technology; Ningbo Research Institute of Dalian University of Technology
Current assignee: Dalian Weishi Technology Co ltd; Dalian University of Technology; Ningbo Research Institute of Dalian University of Technology
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2023-03-14

Abstract

An unsupervised video target segmentation algorithm based on heterogeneous transformers is characterized in that two different Transformer-based fusion strategies are designed for an unsupervised video target segmentation network in a superficial layer stage and a deep layer stage of appearance and motion characteristic fusion, wherein the two different Transformer-based fusion strategies are a Transformer shared by global context and a Transformer aggregated-embedded semantically. The global context sharing Transformer can learn global shared context information between video frames with low calculation amount, and the semantic aggregation-embedded Transformer respectively models semantic relevance of a foreground and a background and further reduces the calculation amount by soft aggregation of feature vectors. Based on two fusion modules, a hierarchical heterogeneous Transformer architecture is designed for the unsupervised video object segmentation task, and the architecture can realize the most advanced performance with lower calculation amount.

Description

Unsupervised video target segmentation algorithm based on heterogeneous transform

Technical Field

The invention belongs to the field of machine learning, semantic segmentation and unsupervised video target segmentation, and relates to algorithms such as a feature extraction network Swin-Transformer, a semantic segmentation decoder segormer MLPHead, a global context network GCNet and a visual Transformer, in particular to an unsupervised video target segmentation algorithm based on heterogeneous transformers.

Background

Semantic segmentation is one of the basic tasks in the field of computer vision, and is a core technology for understanding complex scenes. Semantic segmentation is generally defined as the task of predicting each pixel class, i.e. the class of the object to which the pixel belongs. The first work in the field of semantic segmentation based on deep learning methods, FCN, for the first time, uses full convolutional neural networks and pooling operations to extract low resolution features with deep semantic information. Subsequent work, represented by the deep lab series and the PSPNet, enhances global spatial information by enlarging the receptive field of the neural network, thereby obtaining more accurate segmentation results. Most of the latter work is inspired by Non-local networks (Non-local networks), and the performance of segmentation is improved by capturing global semantic context information through an attention mechanism. Recent work has introduced visual transformers into the field of semantic segmentation with great success.

Unsupervised video object segmentation, as a branch of the semantic segmentation task, aims at discovering the most compelling objects in a video sequence and can therefore be defined as a video semantic segmentation problem with two classes. Unlike still image segmentation, which relies primarily on appearance characteristics, unsupervised video object segmentation further exploits temporal motion information to obtain reliable and temporally consistent segmentation results. Mainstream methods such as FSNet (Full-duplex template Network) proposed by jigepeng et al and AMCNet (extensive Multi-modification fusion Network) proposed by yang 28557 et al mainly adopt a manually designed feature fusion module to aggregate appearance and motion information and apply the designed fusion module to a Multi-stage feature fusion process indiscriminately. Although these efforts have promoted the progress and development of the unsupervised video object segmentation, how to design a method of multi-stage appearance motion feature fusion more suitable for the unsupervised video object segmentation task is still an open problem.

Recently, thanks to the powerful global attention modeling capability and the flexibility of multi-modal fusion, visual transformers have made a great breakthrough in many computer vision tasks. However, this advantage has not yet been fully explored in the field of unsupervised video object segmentation. The baseline method of the invention uses a standard visual Transformer module as the fusion module of the apparent motion characteristics. Preliminary experiments show that for each feature fusion stage, the appearance motion features are spliced together and then directly sent to a standard visual Transformer module to obtain the most advanced performance, but the cost is that the calculated amount is too large, and the reasoning time is long. Therefore, how to effectively reduce the calculation cost on the premise of keeping high accuracy is a key problem that the visual Transformer can be successfully applied in the field of unsupervised video object segmentation.

Disclosure of Invention

In order to solve the problems, the invention designs two Transformer-based modules, namely a context-sharing Transformer module and a semantic aggregation-embedding Transformer module. The two modules can greatly reduce the calculation cost on the premise of keeping the precision of the standard vision Transformer, so that the vision Transformer can be more efficiently applied to an unsupervised video object segmentation task. Based on the two modules, the invention provides a high-performance and light-weight heterogeneous Transformer network architecture to solve the unsupervised video object segmentation task.

The technical scheme of the invention is as follows:

an unsupervised video object segmentation algorithm based on heterogeneous transformers comprises a context-sharing Transformer module, a semantic aggregation-embedding Transformer module and a heterogeneous Transformer network architecture designed based on the two modules:

the heterogeneous transform network architecture comprises an appearance feature extraction network, a motion feature extraction network, two context-shared transform fusion modules, two semantic aggregation-re-embedding transform fusion modules and a decoder, wherein the two feature extraction networks both use Swin-Tiny, and the decoder uses a full-link-layer-based segmentation head designed in the Segformer. The two feature extraction networks respectively extract appearance and motion features of four stages, and a primary fusion feature is obtained in a mode of splicing the appearance and the motion features by channel dimensions in each stage l (l belongs to {1,2,3,4 })

Wherein c is _l Represents the fused feature dimension of stage I, w _l And h _l Representing the width and height, respectively, of the fused feature resolution of the l-th stage. For convenience of presentation, the subscript l is no longer appended to the formula below after the feature fusion stage l is specified.

The standard vision Transformer module consists of a multi-head attention calculation module with a residual structure and a feedforward neural network module with the residual structure. The context-shared Transformer module simplifies multi-headed attention computation in the standard visual Transformer module by global context modeling, computing a weight map for all query feature vectors (query) that is shared and independent of the query feature vectors. Global context modeling involves a spatial attention computation and a channel attention computation that are independent of the query feature vector. Specifically, the fusion feature in the shallow layer stage l (l ∈ {1,2 }) is obtained

(query simultaneously modeled as Global contextInter feature vectors), a single-channel attention weight map is first generated by a 1 × 1 convolution and a SoftMax function

The weight map obtains a weighted representation shared by query feature vectors through weighted fusion feature X

To further model the correlation between channels, two sets of channel attention modules consisting of 1 × 1 convolution, batch normalization, and ReLU functions were used to characterize W for weighting _g And (6) performing adjustment and optimization. The global context information and the fusion feature X are aggregated using the residual structure after global context modeling. The output of the global context modeling is sent to a feedforward neural network module of a residual error structure in a standard visual Transformer to obtain the final fusion characteristics

Although the algorithm is simpler, the context-shared Transformer module significantly speeds up the inference speed of the standard Transformer (from 3 frames per second to 36 frames per second) without affecting high performance.

The core idea of semantic aggregation-re-embedding Transformer is to model the semantic relevance of the foreground and background separately while reducing the computational cost. The module comprises two parallel symmetrical branches for respectively processing foreground and background features, wherein each branch mainly comprises a selected inquiry feature vector (query), a key-value feature vector (key-value) soft aggregation, correlation modeling calculation and inquiry feature vector embedding. Wherein, the query feature vector (query) and the key-value feature vector (key-value) are input by the standard visual Transformer module.

For fusion characterization from deep stage l (l ∈ {3,4 })

Firstly, a feature vector selection thermodynamic diagram of a single channel is generated by using a 1 x 1 convolution and a Sigmoid function

Based on the thermodynamic diagram, query feature vector X belonging to foreground ^F ＝X[H _i ≥F _th ]And query feature vector X belonging to the background ^B ＝X[H _i ＜B _th ]Are selected respectively, wherein F _th And B _th Are two thresholds that determine the choice of foreground and background, H _i Is the corresponding value of the thermodynamic diagram at position i.

Taking the foreground branch as an example, in order to obtain the key-value feature vector pair with the prominent foreground, the fused feature X and the thermodynamic diagram H are firstly subjected to dot product to obtain the mask enhanced foreground feature vector sequence

A feature vector soft aggregation mechanism is then employed to obtain a more compact compressed representation

This mechanism transforms the matrix by learning a set of tokens

The foreground feature vector sequence is compressed. Subsequent foreground corresponding query feature vector X ^F And foreground enhanced compressed key-value feature vector X ^ce Is sent to a standard visual Transformer for attention calculation to model and enhance semantic relevance and update corresponding semantic representations. The background branch flow is consistent with the foreground. This approach can greatly reduce computational costs. The output of the visual Transformer module is then embedded back into the original fusion features X according to the corresponding index of the query feature vector selection stage and the final fusion features S are obtained.

In practical implementation, k is set to 1/9n in the key-value feature vector aggregation stage, and the calculation amount of multi-head attention calculation can be reduced to 10/81 without affecting the performance.

Compared with a baseline method for indiscriminately fusing multi-scale features by using a standard visual Transformer, the heterogeneous Transformer network architecture can increase the inference speed from 3 frames per second to 39 frames per second under the advantage of improving the segmentation performance, and the task requirements of high precision and instantaneity of unsupervised video target segmentation are met.

The invention has the beneficial effects that:

(1) Appearance motion information is fused in a multi-level mode by using a Transformer-based framework (comprising a feature extraction network and a feature fusion network), and higher accuracy can be achieved compared with an unsupervised video target segmentation network based on a convolutional neural network.

(2) And a heterogeneous feature fusion mode is used in the shallow layer and deep layer fusion stages, so that the requirements and characteristics of feature fusion of different layers can be met more pertinently. The heterogeneous fusion strategy can not only fully exert the advantage of high precision of the Transformer structure, but also enable the network to be lighter, thereby simultaneously meeting the task requirements of high precision and real-time performance of the unsupervised video target segmentation task.

Drawings

Fig. 1 is a flowchart of an algorithm of an unsupervised video object segmentation network heterogeneous Transformer.

FIG. 2 is a flow chart of the algorithm for context-shared transformers and semantic aggregation-embedding transformers.

Detailed Description

The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.

Fig. 1 is a flowchart of an algorithm of an unsupervised video object segmentation network heterogeneous Transformer. The heterogeneous Transformer comprises an appearance characteristic extraction network, a motion characteristic extraction network, four multi-scale fusion modules and a decoder. A video frame and a light flow graph calculated by the current frame and an adjacent frame are input, and the appearance characteristic extraction network and the motion characteristic extraction network can respectively extract appearance and motion characteristics of multiple stages. The multi-stage fusion module receives the multi-stage appearance and motion features and ultimately feeds the output to a multi-scale feature fusion decoder. The invention adopts an advanced Swin-Tiny backbone network as a feature extraction network and adopts a lightweight segformamer MLPHead as a multi-scale fusion feature decoder.

FIG. 2 is a flow chart of the algorithm of the context shared Transformer and the semantic aggregation-embedding Transformer. The appearance and motion characteristics of four stages respectively extracted by the appearance and motion characteristic extraction network are firstly fused by using a 1 multiplied by 1 convolution to channel dimensions to obtain a primary fusion characterization

Wherein the fused representations of the first two phases are fed into a context-shared Transformer and output is obtained

The fusion representation of the last two stages is sent into a Transformer of semantic aggregation-reinjection and output is obtained

The invention uses pre-trained RAFT to generate a light flow map for video data. All input pictures are scaled to 512 x 512 spatial resolution. In the model training phase, the invention uses data enhancement comprising random horizontal inversion and random photometric distortion transformation to enhance the generalization of the model, and uses an AdamW optimizer with the learning rate of fixed 6e-5 and a binary cross entropy loss function to train the model end to end. The invention pre-trains the model with the Youtube-VOS dataset for 300 rounds and fine-tunes the model with the DAVIS-2016 and FBMS datasets for 100 rounds. The binarization threshold of the segmentation result is set to 0.5. The model was trained all the way using 4 NVIDIA 3090 graphics cards in a batch size setting of 8.

The feature extraction network structure is as follows:

operation of	Down sampling ratio	Attention head count	Dimension (d) of	Input size	Output size
						Swin Stage1	4×	3	96	384×384	96×96
Swin Stage2	8×	6	192	96×96	48×48
						Swin Stage3	16×	12	384	48×48	24×24
Swin Stage4	32×	24	768	24×24	12×12

。

Claims

1. An unsupervised video object segmentation algorithm based on heterogeneous transformers is characterized in that the unsupervised video object segmentation algorithm based on the heterogeneous transformers comprises a context-shared Transformer module, a semantic aggregation-re-embedding Transformer module and a heterogeneous Transformer network architecture designed based on the two modules:

the heterogeneous Transformer network architecture comprises an appearance feature extraction network, a motion feature extraction network, two context-shared Transformer fusion modules, two semantic aggregation-re-embedding Transformer fusion modules and a decoder, wherein the appearance feature extraction network and the motion feature extraction network both use Swin-Tiny, and the decoder uses a full-connection-layer-based segmentation head designed in a Segformer; the appearance feature extraction network and the motion feature extraction network respectively extract appearance features and motion features of four stages, and primary fusion features are obtained in a mode that the appearance features and the motion features are spliced by channel dimensions in each stage

Wherein c is _l Represents the fused feature dimension of stage I, w _l And h _l Respectively representing the width and the height of the resolution of the fused features in the l stage, wherein l belongs to {1,2,3,4};

the standard vision Transformer module mainly comprises a multi-head attention calculation module with a residual structure and a feedforward neural network module with a residual structure; the context-sharing Transformer fusion module simplifies the multi-head attention calculation module of the residual error structure in the standard vision Transformer module by global context modeling, and the multi-head attention calculation module is used for all the queriesThe feature vector calculates a weight map which is shared and independent of the query feature vector; the global context modeling comprises a spatial attention calculation and a channel attention calculation which are independent of the query feature vector; in particular, the fusion characteristics at the stage of obtaining the shallow layer l

Then, l belongs to {1,2}, the fusion feature of the shallow stage l is simultaneously used as the query feature vector of the global context modeling, and a single-channel attention weight graph is generated by a 1 × 1 convolution layer and a SoftMax function

The single-channel attention weight map obtains a weighted representation shared by the query feature vectors through the fusion feature X of the shallow stage l

To further model the correlation between channels, W was characterized using two sets of channel attention module pairs consisting of 1 × 1 convolutional layers, batch normalization, and ReLU functions _g Performing tuning; aggregating global context information and fusion features X of the shallow stage l using a residual structure after global context modeling; the output of the global context modeling is sent to a feedforward neural network module of a residual error structure of a standard visual Transformer module to obtain the final fusion characteristics

The semantic aggregation-re-embedding Transformer fusion module is used for respectively modeling the feature correlation of a foreground and a background and comprises two parallel branches for respectively processing the features of the foreground and the background, wherein each branch comprises inquiry feature vector selection, key-value feature vector soft aggregation, correlation modeling calculation and inquiry feature vector re-embedding; wherein, the inquiry characteristic vector and the key-value characteristic vector are input by a standard vision Transformer module; characterization of fusion from deep stage l

l is equal to 3,4, firstly, a single-channel feature vector selection thermodynamic diagram is generated by using a 1 x 1 convolutional layer and a Sigmoid function

Selection of thermodynamic diagrams based on feature vectors, query feature vector X belonging to the foreground ^F ＝X[H _i ≥F _th ]And query feature vector X belonging to the background ^B ＝X[H _i ＜B _th ]Are selected respectively, wherein F _th And B _th Are two thresholds that determine the choice of foreground and background, H _i Is the corresponding value of the thermodynamic diagram at position i;

correlation calculation process of foreground branch: in order to obtain the key-value feature vector pair with prominent foreground, firstly, the fused feature X and the thermodynamic diagram H of the deep stage l are subjected to dot product to obtain a mask enhanced foreground feature vector sequence

n = h × w; a feature vector soft aggregation mechanism is then employed to obtain a more compact compressed representation

k < n, which mechanism is based on learning a set of token transformation matrices

Compressing the foreground feature vector sequence; subsequent foreground corresponding query feature vector X ^F And foreground-enhanced compressed key-value feature vector X ^ce The semantic representation data is sent to a standard visual Transformer for attention calculation to model and enhance semantic relevance and update corresponding semantic representations; the flow of the background branch is consistent with that of the foreground branch due to the symmetry of the flow of the foreground branch and the background branch; the output of the visual Transformer module is then embedded back into the original fused feature X according to the corresponding index of the query feature vector selection stageAnd the final fusion characteristics S are obtained.