CN115797835A - Non-supervision video target segmentation algorithm based on heterogeneous Transformer - Google Patents
Non-supervision video target segmentation algorithm based on heterogeneous Transformer Download PDFInfo
- Publication number
- CN115797835A CN115797835A CN202211532178.1A CN202211532178A CN115797835A CN 115797835 A CN115797835 A CN 115797835A CN 202211532178 A CN202211532178 A CN 202211532178A CN 115797835 A CN115797835 A CN 115797835A
- Authority
- CN
- China
- Prior art keywords
- transformer
- fusion
- feature vector
- foreground
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Image Analysis (AREA)
Abstract
An unsupervised video target segmentation algorithm based on heterogeneous transformers is characterized in that two different Transformer-based fusion strategies are designed for an unsupervised video target segmentation network in a superficial layer stage and a deep layer stage of appearance and motion characteristic fusion, wherein the two different Transformer-based fusion strategies are a Transformer shared by global context and a Transformer aggregated-embedded semantically. The global context sharing Transformer can learn global shared context information between video frames with low calculation amount, and the semantic aggregation-embedded Transformer respectively models semantic relevance of a foreground and a background and further reduces the calculation amount by soft aggregation of feature vectors. Based on two fusion modules, a hierarchical heterogeneous Transformer architecture is designed for the unsupervised video object segmentation task, and the architecture can realize the most advanced performance with lower calculation amount.
Description
Technical Field
The invention belongs to the field of machine learning, semantic segmentation and unsupervised video target segmentation, and relates to algorithms such as a feature extraction network Swin-Transformer, a semantic segmentation decoder segormer MLPHead, a global context network GCNet and a visual Transformer, in particular to an unsupervised video target segmentation algorithm based on heterogeneous transformers.
Background
Semantic segmentation is one of the basic tasks in the field of computer vision, and is a core technology for understanding complex scenes. Semantic segmentation is generally defined as the task of predicting each pixel class, i.e. the class of the object to which the pixel belongs. The first work in the field of semantic segmentation based on deep learning methods, FCN, for the first time, uses full convolutional neural networks and pooling operations to extract low resolution features with deep semantic information. Subsequent work, represented by the deep lab series and the PSPNet, enhances global spatial information by enlarging the receptive field of the neural network, thereby obtaining more accurate segmentation results. Most of the latter work is inspired by Non-local networks (Non-local networks), and the performance of segmentation is improved by capturing global semantic context information through an attention mechanism. Recent work has introduced visual transformers into the field of semantic segmentation with great success.
Unsupervised video object segmentation, as a branch of the semantic segmentation task, aims at discovering the most compelling objects in a video sequence and can therefore be defined as a video semantic segmentation problem with two classes. Unlike still image segmentation, which relies primarily on appearance characteristics, unsupervised video object segmentation further exploits temporal motion information to obtain reliable and temporally consistent segmentation results. Mainstream methods such as FSNet (Full-duplex template Network) proposed by jigepeng et al and AMCNet (extensive Multi-modification fusion Network) proposed by yang 28557 et al mainly adopt a manually designed feature fusion module to aggregate appearance and motion information and apply the designed fusion module to a Multi-stage feature fusion process indiscriminately. Although these efforts have promoted the progress and development of the unsupervised video object segmentation, how to design a method of multi-stage appearance motion feature fusion more suitable for the unsupervised video object segmentation task is still an open problem.
Recently, thanks to the powerful global attention modeling capability and the flexibility of multi-modal fusion, visual transformers have made a great breakthrough in many computer vision tasks. However, this advantage has not yet been fully explored in the field of unsupervised video object segmentation. The baseline method of the invention uses a standard visual Transformer module as the fusion module of the apparent motion characteristics. Preliminary experiments show that for each feature fusion stage, the appearance motion features are spliced together and then directly sent to a standard visual Transformer module to obtain the most advanced performance, but the cost is that the calculated amount is too large, and the reasoning time is long. Therefore, how to effectively reduce the calculation cost on the premise of keeping high accuracy is a key problem that the visual Transformer can be successfully applied in the field of unsupervised video object segmentation.
Disclosure of Invention
In order to solve the problems, the invention designs two Transformer-based modules, namely a context-sharing Transformer module and a semantic aggregation-embedding Transformer module. The two modules can greatly reduce the calculation cost on the premise of keeping the precision of the standard vision Transformer, so that the vision Transformer can be more efficiently applied to an unsupervised video object segmentation task. Based on the two modules, the invention provides a high-performance and light-weight heterogeneous Transformer network architecture to solve the unsupervised video object segmentation task.
The technical scheme of the invention is as follows:
an unsupervised video object segmentation algorithm based on heterogeneous transformers comprises a context-sharing Transformer module, a semantic aggregation-embedding Transformer module and a heterogeneous Transformer network architecture designed based on the two modules:
the heterogeneous transform network architecture comprises an appearance feature extraction network, a motion feature extraction network, two context-shared transform fusion modules, two semantic aggregation-re-embedding transform fusion modules and a decoder, wherein the two feature extraction networks both use Swin-Tiny, and the decoder uses a full-link-layer-based segmentation head designed in the Segformer. The two feature extraction networks respectively extract appearance and motion features of four stages, and a primary fusion feature is obtained in a mode of splicing the appearance and the motion features by channel dimensions in each stage l (l belongs to {1,2,3,4 })Wherein c is l Represents the fused feature dimension of stage I, w l And h l Representing the width and height, respectively, of the fused feature resolution of the l-th stage. For convenience of presentation, the subscript l is no longer appended to the formula below after the feature fusion stage l is specified.
The standard vision Transformer module consists of a multi-head attention calculation module with a residual structure and a feedforward neural network module with the residual structure. The context-shared Transformer module simplifies multi-headed attention computation in the standard visual Transformer module by global context modeling, computing a weight map for all query feature vectors (query) that is shared and independent of the query feature vectors. Global context modeling involves a spatial attention computation and a channel attention computation that are independent of the query feature vector. Specifically, the fusion feature in the shallow layer stage l (l ∈ {1,2 }) is obtained(query simultaneously modeled as Global contextInter feature vectors), a single-channel attention weight map is first generated by a 1 × 1 convolution and a SoftMax functionThe weight map obtains a weighted representation shared by query feature vectors through weighted fusion feature XTo further model the correlation between channels, two sets of channel attention modules consisting of 1 × 1 convolution, batch normalization, and ReLU functions were used to characterize W for weighting g And (6) performing adjustment and optimization. The global context information and the fusion feature X are aggregated using the residual structure after global context modeling. The output of the global context modeling is sent to a feedforward neural network module of a residual error structure in a standard visual Transformer to obtain the final fusion characteristics
Although the algorithm is simpler, the context-shared Transformer module significantly speeds up the inference speed of the standard Transformer (from 3 frames per second to 36 frames per second) without affecting high performance.
The core idea of semantic aggregation-re-embedding Transformer is to model the semantic relevance of the foreground and background separately while reducing the computational cost. The module comprises two parallel symmetrical branches for respectively processing foreground and background features, wherein each branch mainly comprises a selected inquiry feature vector (query), a key-value feature vector (key-value) soft aggregation, correlation modeling calculation and inquiry feature vector embedding. Wherein, the query feature vector (query) and the key-value feature vector (key-value) are input by the standard visual Transformer module.
For fusion characterization from deep stage l (l ∈ {3,4 })Firstly, a feature vector selection thermodynamic diagram of a single channel is generated by using a 1 x 1 convolution and a Sigmoid functionBased on the thermodynamic diagram, query feature vector X belonging to foreground F =X[H i ≥F th ]And query feature vector X belonging to the background B =X[H i <B th ]Are selected respectively, wherein F th And B th Are two thresholds that determine the choice of foreground and background, H i Is the corresponding value of the thermodynamic diagram at position i.
Taking the foreground branch as an example, in order to obtain the key-value feature vector pair with the prominent foreground, the fused feature X and the thermodynamic diagram H are firstly subjected to dot product to obtain the mask enhanced foreground feature vector sequence A feature vector soft aggregation mechanism is then employed to obtain a more compact compressed representation This mechanism transforms the matrix by learning a set of tokensThe foreground feature vector sequence is compressed. Subsequent foreground corresponding query feature vector X F And foreground enhanced compressed key-value feature vector X ce Is sent to a standard visual Transformer for attention calculation to model and enhance semantic relevance and update corresponding semantic representations. The background branch flow is consistent with the foreground. This approach can greatly reduce computational costs. The output of the visual Transformer module is then embedded back into the original fusion features X according to the corresponding index of the query feature vector selection stage and the final fusion features S are obtained.
In practical implementation, k is set to 1/9n in the key-value feature vector aggregation stage, and the calculation amount of multi-head attention calculation can be reduced to 10/81 without affecting the performance.
Compared with a baseline method for indiscriminately fusing multi-scale features by using a standard visual Transformer, the heterogeneous Transformer network architecture can increase the inference speed from 3 frames per second to 39 frames per second under the advantage of improving the segmentation performance, and the task requirements of high precision and instantaneity of unsupervised video target segmentation are met.
The invention has the beneficial effects that:
(1) Appearance motion information is fused in a multi-level mode by using a Transformer-based framework (comprising a feature extraction network and a feature fusion network), and higher accuracy can be achieved compared with an unsupervised video target segmentation network based on a convolutional neural network.
(2) And a heterogeneous feature fusion mode is used in the shallow layer and deep layer fusion stages, so that the requirements and characteristics of feature fusion of different layers can be met more pertinently. The heterogeneous fusion strategy can not only fully exert the advantage of high precision of the Transformer structure, but also enable the network to be lighter, thereby simultaneously meeting the task requirements of high precision and real-time performance of the unsupervised video target segmentation task.
Drawings
Fig. 1 is a flowchart of an algorithm of an unsupervised video object segmentation network heterogeneous Transformer.
FIG. 2 is a flow chart of the algorithm for context-shared transformers and semantic aggregation-embedding transformers.
Detailed Description
The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.
Fig. 1 is a flowchart of an algorithm of an unsupervised video object segmentation network heterogeneous Transformer. The heterogeneous Transformer comprises an appearance characteristic extraction network, a motion characteristic extraction network, four multi-scale fusion modules and a decoder. A video frame and a light flow graph calculated by the current frame and an adjacent frame are input, and the appearance characteristic extraction network and the motion characteristic extraction network can respectively extract appearance and motion characteristics of multiple stages. The multi-stage fusion module receives the multi-stage appearance and motion features and ultimately feeds the output to a multi-scale feature fusion decoder. The invention adopts an advanced Swin-Tiny backbone network as a feature extraction network and adopts a lightweight segformamer MLPHead as a multi-scale fusion feature decoder.
FIG. 2 is a flow chart of the algorithm of the context shared Transformer and the semantic aggregation-embedding Transformer. The appearance and motion characteristics of four stages respectively extracted by the appearance and motion characteristic extraction network are firstly fused by using a 1 multiplied by 1 convolution to channel dimensions to obtain a primary fusion characterizationWherein the fused representations of the first two phases are fed into a context-shared Transformer and output is obtainedThe fusion representation of the last two stages is sent into a Transformer of semantic aggregation-reinjection and output is obtained
The invention uses pre-trained RAFT to generate a light flow map for video data. All input pictures are scaled to 512 x 512 spatial resolution. In the model training phase, the invention uses data enhancement comprising random horizontal inversion and random photometric distortion transformation to enhance the generalization of the model, and uses an AdamW optimizer with the learning rate of fixed 6e-5 and a binary cross entropy loss function to train the model end to end. The invention pre-trains the model with the Youtube-VOS dataset for 300 rounds and fine-tunes the model with the DAVIS-2016 and FBMS datasets for 100 rounds. The binarization threshold of the segmentation result is set to 0.5. The model was trained all the way using 4 NVIDIA 3090 graphics cards in a batch size setting of 8.
The feature extraction network structure is as follows:
operation of | Down sampling ratio | Attention head count | Dimension (d) of | Input size | Output size |
Swin Stage1 | 4× | 3 | 96 | 384×384 | 96×96 |
Swin Stage2 | 8× | 6 | 192 | 96×96 | 48×48 |
Swin Stage3 | 16× | 12 | 384 | 48×48 | 24×24 |
Swin Stage4 | 32× | 24 | 768 | 24×24 | 12×12 |
。
Claims (1)
1. An unsupervised video object segmentation algorithm based on heterogeneous transformers is characterized in that the unsupervised video object segmentation algorithm based on the heterogeneous transformers comprises a context-shared Transformer module, a semantic aggregation-re-embedding Transformer module and a heterogeneous Transformer network architecture designed based on the two modules:
the heterogeneous Transformer network architecture comprises an appearance feature extraction network, a motion feature extraction network, two context-shared Transformer fusion modules, two semantic aggregation-re-embedding Transformer fusion modules and a decoder, wherein the appearance feature extraction network and the motion feature extraction network both use Swin-Tiny, and the decoder uses a full-connection-layer-based segmentation head designed in a Segformer; the appearance feature extraction network and the motion feature extraction network respectively extract appearance features and motion features of four stages, and primary fusion features are obtained in a mode that the appearance features and the motion features are spliced by channel dimensions in each stageWherein c is l Represents the fused feature dimension of stage I, w l And h l Respectively representing the width and the height of the resolution of the fused features in the l stage, wherein l belongs to {1,2,3,4};
the standard vision Transformer module mainly comprises a multi-head attention calculation module with a residual structure and a feedforward neural network module with a residual structure; the context-sharing Transformer fusion module simplifies the multi-head attention calculation module of the residual error structure in the standard vision Transformer module by global context modeling, and the multi-head attention calculation module is used for all the queriesThe feature vector calculates a weight map which is shared and independent of the query feature vector; the global context modeling comprises a spatial attention calculation and a channel attention calculation which are independent of the query feature vector; in particular, the fusion characteristics at the stage of obtaining the shallow layer lThen, l belongs to {1,2}, the fusion feature of the shallow stage l is simultaneously used as the query feature vector of the global context modeling, and a single-channel attention weight graph is generated by a 1 × 1 convolution layer and a SoftMax functionThe single-channel attention weight map obtains a weighted representation shared by the query feature vectors through the fusion feature X of the shallow stage lTo further model the correlation between channels, W was characterized using two sets of channel attention module pairs consisting of 1 × 1 convolutional layers, batch normalization, and ReLU functions g Performing tuning; aggregating global context information and fusion features X of the shallow stage l using a residual structure after global context modeling; the output of the global context modeling is sent to a feedforward neural network module of a residual error structure of a standard visual Transformer module to obtain the final fusion characteristics
The semantic aggregation-re-embedding Transformer fusion module is used for respectively modeling the feature correlation of a foreground and a background and comprises two parallel branches for respectively processing the features of the foreground and the background, wherein each branch comprises inquiry feature vector selection, key-value feature vector soft aggregation, correlation modeling calculation and inquiry feature vector re-embedding; wherein, the inquiry characteristic vector and the key-value characteristic vector are input by a standard vision Transformer module; characterization of fusion from deep stage ll is equal to 3,4, firstly, a single-channel feature vector selection thermodynamic diagram is generated by using a 1 x 1 convolutional layer and a Sigmoid functionSelection of thermodynamic diagrams based on feature vectors, query feature vector X belonging to the foreground F =X[H i ≥F th ]And query feature vector X belonging to the background B =X[H i <B th ]Are selected respectively, wherein F th And B th Are two thresholds that determine the choice of foreground and background, H i Is the corresponding value of the thermodynamic diagram at position i;
correlation calculation process of foreground branch: in order to obtain the key-value feature vector pair with prominent foreground, firstly, the fused feature X and the thermodynamic diagram H of the deep stage l are subjected to dot product to obtain a mask enhanced foreground feature vector sequencen = h × w; a feature vector soft aggregation mechanism is then employed to obtain a more compact compressed representationk < n, which mechanism is based on learning a set of token transformation matricesCompressing the foreground feature vector sequence; subsequent foreground corresponding query feature vector X F And foreground-enhanced compressed key-value feature vector X ce The semantic representation data is sent to a standard visual Transformer for attention calculation to model and enhance semantic relevance and update corresponding semantic representations; the flow of the background branch is consistent with that of the foreground branch due to the symmetry of the flow of the foreground branch and the background branch; the output of the visual Transformer module is then embedded back into the original fused feature X according to the corresponding index of the query feature vector selection stageAnd the final fusion characteristics S are obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211532178.1A CN115797835A (en) | 2022-12-01 | 2022-12-01 | Non-supervision video target segmentation algorithm based on heterogeneous Transformer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211532178.1A CN115797835A (en) | 2022-12-01 | 2022-12-01 | Non-supervision video target segmentation algorithm based on heterogeneous Transformer |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115797835A true CN115797835A (en) | 2023-03-14 |
Family
ID=85444630
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211532178.1A Pending CN115797835A (en) | 2022-12-01 | 2022-12-01 | Non-supervision video target segmentation algorithm based on heterogeneous Transformer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115797835A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116452931A (en) * | 2023-04-11 | 2023-07-18 | 北京科技大学 | Hierarchical sensitive image feature aggregation method |
CN116884067A (en) * | 2023-07-12 | 2023-10-13 | 成都信息工程大学 | Micro-expression recognition method based on improved implicit semantic data enhancement |
-
2022
- 2022-12-01 CN CN202211532178.1A patent/CN115797835A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116452931A (en) * | 2023-04-11 | 2023-07-18 | 北京科技大学 | Hierarchical sensitive image feature aggregation method |
CN116452931B (en) * | 2023-04-11 | 2024-03-19 | 北京科技大学 | Hierarchical sensitive image feature aggregation method |
CN116884067A (en) * | 2023-07-12 | 2023-10-13 | 成都信息工程大学 | Micro-expression recognition method based on improved implicit semantic data enhancement |
CN116884067B (en) * | 2023-07-12 | 2024-06-14 | 成都信息工程大学 | Micro-expression recognition method based on improved implicit semantic data enhancement |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107273800B (en) | Attention mechanism-based motion recognition method for convolutional recurrent neural network | |
WO2022252272A1 (en) | Transfer learning-based method for improved vgg16 network pig identity recognition | |
CN111091045B (en) | Sign language identification method based on space-time attention mechanism | |
US11810359B2 (en) | Video semantic segmentation method based on active learning | |
CN109886225B (en) | Image gesture action online detection and recognition method based on deep learning | |
CN111340814B (en) | RGB-D image semantic segmentation method based on multi-mode self-adaptive convolution | |
CN111242288B (en) | Multi-scale parallel deep neural network model construction method for lesion image segmentation | |
CN115797835A (en) | Non-supervision video target segmentation algorithm based on heterogeneous Transformer | |
CN110458085B (en) | Video behavior identification method based on attention-enhanced three-dimensional space-time representation learning | |
Hara et al. | Towards good practice for action recognition with spatiotemporal 3d convolutions | |
WO2023065759A1 (en) | Video action recognition method based on spatial-temporal enhanced network | |
CN110321805B (en) | Dynamic expression recognition method based on time sequence relation reasoning | |
CN111062395A (en) | Real-time video semantic segmentation method | |
CN115393396B (en) | Unmanned aerial vehicle target tracking method based on mask pre-training | |
CN116168197A (en) | Image segmentation method based on Transformer segmentation network and regularization training | |
US20230072445A1 (en) | Self-supervised video representation learning by exploring spatiotemporal continuity | |
Zhang et al. | FCHP: Exploring the discriminative feature and feature correlation of feature maps for hierarchical DNN pruning and compression | |
US11881020B1 (en) | Method for small object detection in drone scene based on deep learning | |
Wang et al. | Exploring fine-grained sparsity in convolutional neural networks for efficient inference | |
Jiao et al. | Realization and improvement of object recognition system on raspberry pi 3b+ | |
CN115587628A (en) | Deep convolutional neural network lightweight method | |
CN111881794B (en) | Video behavior recognition method and system | |
Liang et al. | Semi-supervised video object segmentation based on local and global consistency learning | |
CN114140667A (en) | Small sample rapid style migration method based on deep convolutional neural network | |
CN114494284A (en) | Scene analysis model and method based on explicit supervision area relation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |