CN115797835A - Non-supervision video target segmentation algorithm based on heterogeneous Transformer - Google Patents

Non-supervision video target segmentation algorithm based on heterogeneous Transformer Download PDF

Info

Publication number
CN115797835A
CN115797835A CN202211532178.1A CN202211532178A CN115797835A CN 115797835 A CN115797835 A CN 115797835A CN 202211532178 A CN202211532178 A CN 202211532178A CN 115797835 A CN115797835 A CN 115797835A
Authority
CN
China
Prior art keywords
transformer
fusion
feature vector
foreground
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211532178.1A
Other languages
Chinese (zh)
Inventor
王一帆
袁亦忱
卢湖川
王立君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Weishi Technology Co ltd
Dalian University of Technology
Ningbo Research Institute of Dalian University of Technology
Original Assignee
Dalian Weishi Technology Co ltd
Dalian University of Technology
Ningbo Research Institute of Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Weishi Technology Co ltd, Dalian University of Technology, Ningbo Research Institute of Dalian University of Technology filed Critical Dalian Weishi Technology Co ltd
Priority to CN202211532178.1A priority Critical patent/CN115797835A/en
Publication of CN115797835A publication Critical patent/CN115797835A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Image Analysis (AREA)

Abstract

An unsupervised video target segmentation algorithm based on heterogeneous transformers is characterized in that two different Transformer-based fusion strategies are designed for an unsupervised video target segmentation network in a superficial layer stage and a deep layer stage of appearance and motion characteristic fusion, wherein the two different Transformer-based fusion strategies are a Transformer shared by global context and a Transformer aggregated-embedded semantically. The global context sharing Transformer can learn global shared context information between video frames with low calculation amount, and the semantic aggregation-embedded Transformer respectively models semantic relevance of a foreground and a background and further reduces the calculation amount by soft aggregation of feature vectors. Based on two fusion modules, a hierarchical heterogeneous Transformer architecture is designed for the unsupervised video object segmentation task, and the architecture can realize the most advanced performance with lower calculation amount.

Description

Unsupervised video target segmentation algorithm based on heterogeneous transform
Technical Field
The invention belongs to the field of machine learning, semantic segmentation and unsupervised video target segmentation, and relates to algorithms such as a feature extraction network Swin-Transformer, a semantic segmentation decoder segormer MLPHead, a global context network GCNet and a visual Transformer, in particular to an unsupervised video target segmentation algorithm based on heterogeneous transformers.
Background
Semantic segmentation is one of the basic tasks in the field of computer vision, and is a core technology for understanding complex scenes. Semantic segmentation is generally defined as the task of predicting each pixel class, i.e. the class of the object to which the pixel belongs. The first work in the field of semantic segmentation based on deep learning methods, FCN, for the first time, uses full convolutional neural networks and pooling operations to extract low resolution features with deep semantic information. Subsequent work, represented by the deep lab series and the PSPNet, enhances global spatial information by enlarging the receptive field of the neural network, thereby obtaining more accurate segmentation results. Most of the latter work is inspired by Non-local networks (Non-local networks), and the performance of segmentation is improved by capturing global semantic context information through an attention mechanism. Recent work has introduced visual transformers into the field of semantic segmentation with great success.
Unsupervised video object segmentation, as a branch of the semantic segmentation task, aims at discovering the most compelling objects in a video sequence and can therefore be defined as a video semantic segmentation problem with two classes. Unlike still image segmentation, which relies primarily on appearance characteristics, unsupervised video object segmentation further exploits temporal motion information to obtain reliable and temporally consistent segmentation results. Mainstream methods such as FSNet (Full-duplex template Network) proposed by jigepeng et al and AMCNet (extensive Multi-modification fusion Network) proposed by yang 28557 et al mainly adopt a manually designed feature fusion module to aggregate appearance and motion information and apply the designed fusion module to a Multi-stage feature fusion process indiscriminately. Although these efforts have promoted the progress and development of the unsupervised video object segmentation, how to design a method of multi-stage appearance motion feature fusion more suitable for the unsupervised video object segmentation task is still an open problem.
Recently, thanks to the powerful global attention modeling capability and the flexibility of multi-modal fusion, visual transformers have made a great breakthrough in many computer vision tasks. However, this advantage has not yet been fully explored in the field of unsupervised video object segmentation. The baseline method of the invention uses a standard visual Transformer module as the fusion module of the apparent motion characteristics. Preliminary experiments show that for each feature fusion stage, the appearance motion features are spliced together and then directly sent to a standard visual Transformer module to obtain the most advanced performance, but the cost is that the calculated amount is too large, and the reasoning time is long. Therefore, how to effectively reduce the calculation cost on the premise of keeping high accuracy is a key problem that the visual Transformer can be successfully applied in the field of unsupervised video object segmentation.
Disclosure of Invention
In order to solve the problems, the invention designs two Transformer-based modules, namely a context-sharing Transformer module and a semantic aggregation-embedding Transformer module. The two modules can greatly reduce the calculation cost on the premise of keeping the precision of the standard vision Transformer, so that the vision Transformer can be more efficiently applied to an unsupervised video object segmentation task. Based on the two modules, the invention provides a high-performance and light-weight heterogeneous Transformer network architecture to solve the unsupervised video object segmentation task.
The technical scheme of the invention is as follows:
an unsupervised video object segmentation algorithm based on heterogeneous transformers comprises a context-sharing Transformer module, a semantic aggregation-embedding Transformer module and a heterogeneous Transformer network architecture designed based on the two modules:
the heterogeneous transform network architecture comprises an appearance feature extraction network, a motion feature extraction network, two context-shared transform fusion modules, two semantic aggregation-re-embedding transform fusion modules and a decoder, wherein the two feature extraction networks both use Swin-Tiny, and the decoder uses a full-link-layer-based segmentation head designed in the Segformer. The two feature extraction networks respectively extract appearance and motion features of four stages, and a primary fusion feature is obtained in a mode of splicing the appearance and the motion features by channel dimensions in each stage l (l belongs to {1,2,3,4 })
Figure BDA0003974700270000034
Wherein c is l Represents the fused feature dimension of stage I, w l And h l Representing the width and height, respectively, of the fused feature resolution of the l-th stage. For convenience of presentation, the subscript l is no longer appended to the formula below after the feature fusion stage l is specified.
The standard vision Transformer module consists of a multi-head attention calculation module with a residual structure and a feedforward neural network module with the residual structure. The context-shared Transformer module simplifies multi-headed attention computation in the standard visual Transformer module by global context modeling, computing a weight map for all query feature vectors (query) that is shared and independent of the query feature vectors. Global context modeling involves a spatial attention computation and a channel attention computation that are independent of the query feature vector. Specifically, the fusion feature in the shallow layer stage l (l ∈ {1,2 }) is obtained
Figure BDA0003974700270000031
(query simultaneously modeled as Global contextInter feature vectors), a single-channel attention weight map is first generated by a 1 × 1 convolution and a SoftMax function
Figure BDA0003974700270000032
The weight map obtains a weighted representation shared by query feature vectors through weighted fusion feature X
Figure BDA0003974700270000033
To further model the correlation between channels, two sets of channel attention modules consisting of 1 × 1 convolution, batch normalization, and ReLU functions were used to characterize W for weighting g And (6) performing adjustment and optimization. The global context information and the fusion feature X are aggregated using the residual structure after global context modeling. The output of the global context modeling is sent to a feedforward neural network module of a residual error structure in a standard visual Transformer to obtain the final fusion characteristics
Figure BDA0003974700270000035
Although the algorithm is simpler, the context-shared Transformer module significantly speeds up the inference speed of the standard Transformer (from 3 frames per second to 36 frames per second) without affecting high performance.
The core idea of semantic aggregation-re-embedding Transformer is to model the semantic relevance of the foreground and background separately while reducing the computational cost. The module comprises two parallel symmetrical branches for respectively processing foreground and background features, wherein each branch mainly comprises a selected inquiry feature vector (query), a key-value feature vector (key-value) soft aggregation, correlation modeling calculation and inquiry feature vector embedding. Wherein, the query feature vector (query) and the key-value feature vector (key-value) are input by the standard visual Transformer module.
For fusion characterization from deep stage l (l ∈ {3,4 })
Figure BDA0003974700270000047
Firstly, a feature vector selection thermodynamic diagram of a single channel is generated by using a 1 x 1 convolution and a Sigmoid function
Figure BDA0003974700270000041
Based on the thermodynamic diagram, query feature vector X belonging to foreground F =X[H i ≥F th ]And query feature vector X belonging to the background B =X[H i <B th ]Are selected respectively, wherein F th And B th Are two thresholds that determine the choice of foreground and background, H i Is the corresponding value of the thermodynamic diagram at position i.
Taking the foreground branch as an example, in order to obtain the key-value feature vector pair with the prominent foreground, the fused feature X and the thermodynamic diagram H are firstly subjected to dot product to obtain the mask enhanced foreground feature vector sequence
Figure BDA0003974700270000042
Figure BDA0003974700270000043
A feature vector soft aggregation mechanism is then employed to obtain a more compact compressed representation
Figure BDA0003974700270000044
Figure BDA0003974700270000045
This mechanism transforms the matrix by learning a set of tokens
Figure BDA0003974700270000046
The foreground feature vector sequence is compressed. Subsequent foreground corresponding query feature vector X F And foreground enhanced compressed key-value feature vector X ce Is sent to a standard visual Transformer for attention calculation to model and enhance semantic relevance and update corresponding semantic representations. The background branch flow is consistent with the foreground. This approach can greatly reduce computational costs. The output of the visual Transformer module is then embedded back into the original fusion features X according to the corresponding index of the query feature vector selection stage and the final fusion features S are obtained.
In practical implementation, k is set to 1/9n in the key-value feature vector aggregation stage, and the calculation amount of multi-head attention calculation can be reduced to 10/81 without affecting the performance.
Compared with a baseline method for indiscriminately fusing multi-scale features by using a standard visual Transformer, the heterogeneous Transformer network architecture can increase the inference speed from 3 frames per second to 39 frames per second under the advantage of improving the segmentation performance, and the task requirements of high precision and instantaneity of unsupervised video target segmentation are met.
The invention has the beneficial effects that:
(1) Appearance motion information is fused in a multi-level mode by using a Transformer-based framework (comprising a feature extraction network and a feature fusion network), and higher accuracy can be achieved compared with an unsupervised video target segmentation network based on a convolutional neural network.
(2) And a heterogeneous feature fusion mode is used in the shallow layer and deep layer fusion stages, so that the requirements and characteristics of feature fusion of different layers can be met more pertinently. The heterogeneous fusion strategy can not only fully exert the advantage of high precision of the Transformer structure, but also enable the network to be lighter, thereby simultaneously meeting the task requirements of high precision and real-time performance of the unsupervised video target segmentation task.
Drawings
Fig. 1 is a flowchart of an algorithm of an unsupervised video object segmentation network heterogeneous Transformer.
FIG. 2 is a flow chart of the algorithm for context-shared transformers and semantic aggregation-embedding transformers.
Detailed Description
The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.
Fig. 1 is a flowchart of an algorithm of an unsupervised video object segmentation network heterogeneous Transformer. The heterogeneous Transformer comprises an appearance characteristic extraction network, a motion characteristic extraction network, four multi-scale fusion modules and a decoder. A video frame and a light flow graph calculated by the current frame and an adjacent frame are input, and the appearance characteristic extraction network and the motion characteristic extraction network can respectively extract appearance and motion characteristics of multiple stages. The multi-stage fusion module receives the multi-stage appearance and motion features and ultimately feeds the output to a multi-scale feature fusion decoder. The invention adopts an advanced Swin-Tiny backbone network as a feature extraction network and adopts a lightweight segformamer MLPHead as a multi-scale fusion feature decoder.
FIG. 2 is a flow chart of the algorithm of the context shared Transformer and the semantic aggregation-embedding Transformer. The appearance and motion characteristics of four stages respectively extracted by the appearance and motion characteristic extraction network are firstly fused by using a 1 multiplied by 1 convolution to channel dimensions to obtain a primary fusion characterization
Figure BDA0003974700270000061
Wherein the fused representations of the first two phases are fed into a context-shared Transformer and output is obtained
Figure BDA0003974700270000062
The fusion representation of the last two stages is sent into a Transformer of semantic aggregation-reinjection and output is obtained
Figure BDA0003974700270000063
The invention uses pre-trained RAFT to generate a light flow map for video data. All input pictures are scaled to 512 x 512 spatial resolution. In the model training phase, the invention uses data enhancement comprising random horizontal inversion and random photometric distortion transformation to enhance the generalization of the model, and uses an AdamW optimizer with the learning rate of fixed 6e-5 and a binary cross entropy loss function to train the model end to end. The invention pre-trains the model with the Youtube-VOS dataset for 300 rounds and fine-tunes the model with the DAVIS-2016 and FBMS datasets for 100 rounds. The binarization threshold of the segmentation result is set to 0.5. The model was trained all the way using 4 NVIDIA 3090 graphics cards in a batch size setting of 8.
The feature extraction network structure is as follows:
operation of Down sampling ratio Attention head count Dimension (d) of Input size Output size
Swin Stage1 3 96 384×384 96×96
Swin Stage2 6 192 96×96 48×48
Swin Stage3 16× 12 384 48×48 24×24
Swin Stage4 32× 24 768 24×24 12×12

Claims (1)

1. An unsupervised video object segmentation algorithm based on heterogeneous transformers is characterized in that the unsupervised video object segmentation algorithm based on the heterogeneous transformers comprises a context-shared Transformer module, a semantic aggregation-re-embedding Transformer module and a heterogeneous Transformer network architecture designed based on the two modules:
the heterogeneous Transformer network architecture comprises an appearance feature extraction network, a motion feature extraction network, two context-shared Transformer fusion modules, two semantic aggregation-re-embedding Transformer fusion modules and a decoder, wherein the appearance feature extraction network and the motion feature extraction network both use Swin-Tiny, and the decoder uses a full-connection-layer-based segmentation head designed in a Segformer; the appearance feature extraction network and the motion feature extraction network respectively extract appearance features and motion features of four stages, and primary fusion features are obtained in a mode that the appearance features and the motion features are spliced by channel dimensions in each stage
Figure FDA0003974700260000011
Wherein c is l Represents the fused feature dimension of stage I, w l And h l Respectively representing the width and the height of the resolution of the fused features in the l stage, wherein l belongs to {1,2,3,4};
the standard vision Transformer module mainly comprises a multi-head attention calculation module with a residual structure and a feedforward neural network module with a residual structure; the context-sharing Transformer fusion module simplifies the multi-head attention calculation module of the residual error structure in the standard vision Transformer module by global context modeling, and the multi-head attention calculation module is used for all the queriesThe feature vector calculates a weight map which is shared and independent of the query feature vector; the global context modeling comprises a spatial attention calculation and a channel attention calculation which are independent of the query feature vector; in particular, the fusion characteristics at the stage of obtaining the shallow layer l
Figure FDA0003974700260000012
Then, l belongs to {1,2}, the fusion feature of the shallow stage l is simultaneously used as the query feature vector of the global context modeling, and a single-channel attention weight graph is generated by a 1 × 1 convolution layer and a SoftMax function
Figure FDA0003974700260000013
The single-channel attention weight map obtains a weighted representation shared by the query feature vectors through the fusion feature X of the shallow stage l
Figure FDA0003974700260000014
To further model the correlation between channels, W was characterized using two sets of channel attention module pairs consisting of 1 × 1 convolutional layers, batch normalization, and ReLU functions g Performing tuning; aggregating global context information and fusion features X of the shallow stage l using a residual structure after global context modeling; the output of the global context modeling is sent to a feedforward neural network module of a residual error structure of a standard visual Transformer module to obtain the final fusion characteristics
Figure FDA0003974700260000021
The semantic aggregation-re-embedding Transformer fusion module is used for respectively modeling the feature correlation of a foreground and a background and comprises two parallel branches for respectively processing the features of the foreground and the background, wherein each branch comprises inquiry feature vector selection, key-value feature vector soft aggregation, correlation modeling calculation and inquiry feature vector re-embedding; wherein, the inquiry characteristic vector and the key-value characteristic vector are input by a standard vision Transformer module; characterization of fusion from deep stage l
Figure FDA0003974700260000022
l is equal to 3,4, firstly, a single-channel feature vector selection thermodynamic diagram is generated by using a 1 x 1 convolutional layer and a Sigmoid function
Figure FDA0003974700260000023
Selection of thermodynamic diagrams based on feature vectors, query feature vector X belonging to the foreground F =X[H i ≥F th ]And query feature vector X belonging to the background B =X[H i <B th ]Are selected respectively, wherein F th And B th Are two thresholds that determine the choice of foreground and background, H i Is the corresponding value of the thermodynamic diagram at position i;
correlation calculation process of foreground branch: in order to obtain the key-value feature vector pair with prominent foreground, firstly, the fused feature X and the thermodynamic diagram H of the deep stage l are subjected to dot product to obtain a mask enhanced foreground feature vector sequence
Figure FDA0003974700260000024
n = h × w; a feature vector soft aggregation mechanism is then employed to obtain a more compact compressed representation
Figure FDA0003974700260000025
k < n, which mechanism is based on learning a set of token transformation matrices
Figure FDA0003974700260000026
Compressing the foreground feature vector sequence; subsequent foreground corresponding query feature vector X F And foreground-enhanced compressed key-value feature vector X ce The semantic representation data is sent to a standard visual Transformer for attention calculation to model and enhance semantic relevance and update corresponding semantic representations; the flow of the background branch is consistent with that of the foreground branch due to the symmetry of the flow of the foreground branch and the background branch; the output of the visual Transformer module is then embedded back into the original fused feature X according to the corresponding index of the query feature vector selection stageAnd the final fusion characteristics S are obtained.
CN202211532178.1A 2022-12-01 2022-12-01 Non-supervision video target segmentation algorithm based on heterogeneous Transformer Pending CN115797835A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211532178.1A CN115797835A (en) 2022-12-01 2022-12-01 Non-supervision video target segmentation algorithm based on heterogeneous Transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211532178.1A CN115797835A (en) 2022-12-01 2022-12-01 Non-supervision video target segmentation algorithm based on heterogeneous Transformer

Publications (1)

Publication Number Publication Date
CN115797835A true CN115797835A (en) 2023-03-14

Family

ID=85444630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211532178.1A Pending CN115797835A (en) 2022-12-01 2022-12-01 Non-supervision video target segmentation algorithm based on heterogeneous Transformer

Country Status (1)

Country Link
CN (1) CN115797835A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116452931A (en) * 2023-04-11 2023-07-18 北京科技大学 Hierarchical sensitive image feature aggregation method
CN116884067A (en) * 2023-07-12 2023-10-13 成都信息工程大学 Micro-expression recognition method based on improved implicit semantic data enhancement

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116452931A (en) * 2023-04-11 2023-07-18 北京科技大学 Hierarchical sensitive image feature aggregation method
CN116452931B (en) * 2023-04-11 2024-03-19 北京科技大学 Hierarchical sensitive image feature aggregation method
CN116884067A (en) * 2023-07-12 2023-10-13 成都信息工程大学 Micro-expression recognition method based on improved implicit semantic data enhancement
CN116884067B (en) * 2023-07-12 2024-06-14 成都信息工程大学 Micro-expression recognition method based on improved implicit semantic data enhancement

Similar Documents

Publication Publication Date Title
CN107273800B (en) Attention mechanism-based motion recognition method for convolutional recurrent neural network
WO2022252272A1 (en) Transfer learning-based method for improved vgg16 network pig identity recognition
CN111091045B (en) Sign language identification method based on space-time attention mechanism
US11810359B2 (en) Video semantic segmentation method based on active learning
CN109886225B (en) Image gesture action online detection and recognition method based on deep learning
CN111340814B (en) RGB-D image semantic segmentation method based on multi-mode self-adaptive convolution
CN111242288B (en) Multi-scale parallel deep neural network model construction method for lesion image segmentation
CN115797835A (en) Non-supervision video target segmentation algorithm based on heterogeneous Transformer
CN110458085B (en) Video behavior identification method based on attention-enhanced three-dimensional space-time representation learning
Hara et al. Towards good practice for action recognition with spatiotemporal 3d convolutions
WO2023065759A1 (en) Video action recognition method based on spatial-temporal enhanced network
CN110321805B (en) Dynamic expression recognition method based on time sequence relation reasoning
CN111062395A (en) Real-time video semantic segmentation method
CN115393396B (en) Unmanned aerial vehicle target tracking method based on mask pre-training
CN116168197A (en) Image segmentation method based on Transformer segmentation network and regularization training
US20230072445A1 (en) Self-supervised video representation learning by exploring spatiotemporal continuity
Zhang et al. FCHP: Exploring the discriminative feature and feature correlation of feature maps for hierarchical DNN pruning and compression
US11881020B1 (en) Method for small object detection in drone scene based on deep learning
Wang et al. Exploring fine-grained sparsity in convolutional neural networks for efficient inference
Jiao et al. Realization and improvement of object recognition system on raspberry pi 3b+
CN115587628A (en) Deep convolutional neural network lightweight method
CN111881794B (en) Video behavior recognition method and system
Liang et al. Semi-supervised video object segmentation based on local and global consistency learning
CN114140667A (en) Small sample rapid style migration method based on deep convolutional neural network
CN114494284A (en) Scene analysis model and method based on explicit supervision area relation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination