CN114943911A - Video object instance segmentation method based on improved hierarchical deep self-attention network - Google Patents
Video object instance segmentation method based on improved hierarchical deep self-attention network Download PDFInfo
- Publication number
- CN114943911A CN114943911A CN202111291423.XA CN202111291423A CN114943911A CN 114943911 A CN114943911 A CN 114943911A CN 202111291423 A CN202111291423 A CN 202111291423A CN 114943911 A CN114943911 A CN 114943911A
- Authority
- CN
- China
- Prior art keywords
- msa
- video object
- video
- encoder
- instance segmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The patent application of the invention is named as a video object example segmentation method based on an improved hierarchical deep self-attention network. The technology used in the present application is mainly used for instance segmentation of video objects, that is, given a piece of video, the instance segmentation is automatically performed on the object of each frame in the video segment. The technology is mainly used for realizing automatic video object instance segmentation, so that the video object can be segmented more conveniently. The technical key point of the application is that a hierarchical depth self-attention module P-MSA is provided and used in a connecting part of an encoder and a decoder, so that the purposes of better distinguishing and strengthening encoding characteristics are achieved, and better video object example segmentation is finally achieved.
Description
One, the technical field
Video object instance segmentation, video object segmentation, computer vision, transformations attention mechanism, deep learning, artificial intelligence
Second, background art
2.1 general technical method introduction
The convolutional neural network is a method for extracting features by utilizing a convolutional kernel to slide on a picture or the features, and is a widely adopted technology.
The encoder-decoder architecture mode is a technology commonly used in the field of image object segmentation and video object segmentation, and the method adopting the technology is mainly different in the structure of a feature extraction network of an encoder and the feature fusion mode of a decoder.
2.2 introduction to similar methods
Document [1] Swin Transformers proposes a multi-scale hierarchical Transformers model, the structure of which is shown in fig. 1, and the size of each window input to the Transformer module is changed in order to obtain features of different scales.
The Transformers model for each layer includes a W-MSA (Window multiple attention) module without offset and an SW-MSA (offset Window multiple attention) module with an offset of one-half window size, and the structure is shown in FIG. 2. The Transformers model for each layer was concatenated with W-MSA first, and then with SW-MSA, as shown in FIG. 3.
Document [2] STEm-Seg proposes a scheme of an encoder (encoder) and a multi-scale fusion decoder (decoder) of a pyramid structure, and the result is shown in fig. 4.
Third, the invention
In the existing video object instance segmentation method, important features and non-important features cannot be well distinguished in the feature extraction process, so the accuracy of the segmentation result needs to be further improved.
On the basis of fusing an innovative technology and an existing method, the method realizes automatic video object instance segmentation, and optimizes the effect of feature extraction by adopting an improved hierarchical deep self-attention network.
The application of the invention mainly comprises:
(1) an improved hierarchical deep self-attention module P-MSA is provided, and more kinds of feature extraction can be realized.
(2) The P-MSA module is applied to the existing video object example segmentation method, and the better video object example segmentation effect is realized by combining with feature fusion.
Description of the drawings
FIG. 1 is an introduction diagram of the model of reference [1], with different sizes of self-attention windows for each level to extract features. Wherein, the window of the lower floor is small, and the window of the upper floor is large.
FIG. 2 is an illustration of the receptor field regions and the connection modes of feature extraction of the module W-MSA and the module SW-MSA of reference [1 ]. The total score is 2 modules, the former ouqiang divides the whole feature region into 4 equal parts along the x and y axes, and the latter module is shifted by half the size of the module along the x and y axes to form 9 regions.
FIG. 3 is a network architecture level connection scheme illustration of the modules W-MSA, SW-MSA and others of reference [1 ].
FIG. 4 is a system architecture diagram of reference [2], upon which the method of the present application is improved.
FIG. 5 shows the differences of the improved module P-MSA of the present application with respect to the modules W-MSA, SW-MSA of reference [1], the main differences being: (1) the P-MSA adopts a parallel structure, and the reference [1] connects W-MSA and SW-MSA in series; (2) the P-MSA can calculate the characteristics under different displacements by improving the calculation mode of the window characteristics, can extract and strengthen more kinds of characteristics, and has better characteristic expression capability.
Fig. 6 is a system architecture diagram of the present application, and a flowchart of an example video object segmentation method proposed by the present application is based on a STEm-Seg (reference [2]), and a P-MSA module is added between an encoder and a decoder.
Fifth, detailed description of the invention
The original method is shown in the literature [1] and shown in FIG. 5(a), and is composed of W-MSA and SW-MSA connected in series, wherein the SW-MSA is shifted by half of a window size relative to the W-MSA. The improved method P-MSA provided by the application is shown in FIG. 5(b), and is formed by parallel synthesis of one W-MSA and k SW-MSAs, and the obtained results are used for calculating an algorithm average value. Wherein, the number of k is (L/s) × (L/s) -1 (integer), L is the window size, and s is the step length of each time of shifting on the x axis or the y axis.
On the basis of the document [2], a P-MSA module provided by the application is connected behind a convolution module for extracting each pyramid feature, and the output of each P-MSA module is used as the output of an encoder under the scale feature and is input into a decoder as the output of the encoder under the scale feature in the connection mode of the document [2 ].
The difference between the methods of the present application and the reference [2] is that the P-MSA module proposed in the present application is added between the encoder and the decoder. Meanwhile, compared with the method in reference [2], the method has better characteristic region partition and better segmentation prediction accuracy.
[1]Ze Liu,Yutong Lin,Yue Cao,Han Hu,Yixuan Wei,Zheng Zhang,Stephen Lin,Baining Guo:Swin Transformer:Hierarchical Vision Transformer using Shifted Windows.CoRRabs/2103.14030(2021)
[2]Ali Athar,Sabarinath Mahadevan,Aljosa Osep,Laura Leal-Taixé,Bastian Leibe:STEm-Seg: Spatio-Temporal Embeddings for Instance Segmentation in Videos.ECCV(11)2020:158-177
Claims (2)
1. A video object example segmentation method based on an improved hierarchical deep self-attention network is characterized by comprising the following steps: inputting a multi-frame image of a video segment, adopting an encoder-decoder framework under the condition of not needing to input a first frame of object outline, wherein the encoder adopts a convolution characteristic pyramid structure, and adding the improved hierarchical depth self-attention module P-MSA at the joint of the encoder and the decoder to realize automatic segmentation of the object outlines in all frames.
2. A hierarchical deep self-attention module P-MSA as claimed in claim 1, characterized in that: for each input two-dimensional feature, the two-dimensional feature can be moved by any distance in the size range of the convolution window along the x-axis direction or the y-axis direction or the combination of the two directions, and all feature outputs with different moving distances are connected in parallel, so that a new feature extraction and enhancement mode is realized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111291423.XA CN114943911A (en) | 2021-11-03 | 2021-11-03 | Video object instance segmentation method based on improved hierarchical deep self-attention network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111291423.XA CN114943911A (en) | 2021-11-03 | 2021-11-03 | Video object instance segmentation method based on improved hierarchical deep self-attention network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114943911A true CN114943911A (en) | 2022-08-26 |
Family
ID=82906073
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111291423.XA Pending CN114943911A (en) | 2021-11-03 | 2021-11-03 | Video object instance segmentation method based on improved hierarchical deep self-attention network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114943911A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115439483A (en) * | 2022-11-09 | 2022-12-06 | 四川川锅环保工程有限公司 | High-quality welding seam and welding seam defect identification system, method and storage medium |
-
2021
- 2021-11-03 CN CN202111291423.XA patent/CN114943911A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115439483A (en) * | 2022-11-09 | 2022-12-06 | 四川川锅环保工程有限公司 | High-quality welding seam and welding seam defect identification system, method and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Poudel et al. | Fast-scnn: Fast semantic segmentation network | |
Casser et al. | Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos | |
CN111275518B (en) | Video virtual fitting method and device based on mixed optical flow | |
Casser et al. | Unsupervised monocular depth and ego-motion learning with structure and semantics | |
Zhang et al. | Deep hierarchical guidance and regularization learning for end-to-end depth estimation | |
Liu et al. | A cross-modal adaptive gated fusion generative adversarial network for RGB-D salient object detection | |
Jin et al. | MoADNet: Mobile asymmetric dual-stream networks for real-time and lightweight RGB-D salient object detection | |
CN107229920B (en) | Behavior identification method based on integration depth typical time warping and related correction | |
Zhang et al. | RangeLVDet: Boosting 3D object detection in LiDAR with range image and RGB image | |
Saeedan et al. | Boosting monocular depth with panoptic segmentation maps | |
CN114943911A (en) | Video object instance segmentation method based on improved hierarchical deep self-attention network | |
Chen et al. | Denao: Monocular depth estimation network with auxiliary optical flow | |
Wang et al. | Deep unsupervised 3d sfm face reconstruction based on massive landmark bundle adjustment | |
CN106204456A (en) | Panoramic video sequences estimation is crossed the border folding searching method | |
CN117036770A (en) | Detection model training and target detection method and system based on cascade attention | |
Yi et al. | Cross-stage multi-scale interaction network for RGB-D salient object detection | |
CN107169498A (en) | It is a kind of to merge local and global sparse image significance detection method | |
CN101945299A (en) | Camera-equipment-array based dynamic scene depth restoring method | |
CN110390336B (en) | Method for improving feature point matching precision | |
CN114973305B (en) | Accurate human body analysis method for crowded people | |
CN111783497A (en) | Method, device and computer-readable storage medium for determining characteristics of target in video | |
CN115330655A (en) | Image fusion method and system based on self-attention mechanism | |
CN109801317A (en) | The image matching method of feature extraction is carried out based on convolutional neural networks | |
CN115578298A (en) | Depth portrait video synthesis method based on content perception | |
CN115131418A (en) | Monocular depth estimation algorithm based on Transformer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |