CN114943911A - Video object instance segmentation method based on improved hierarchical deep self-attention network - Google Patents

Video object instance segmentation method based on improved hierarchical deep self-attention network Download PDF

Info

Publication number
CN114943911A
CN114943911A CN202111291423.XA CN202111291423A CN114943911A CN 114943911 A CN114943911 A CN 114943911A CN 202111291423 A CN202111291423 A CN 202111291423A CN 114943911 A CN114943911 A CN 114943911A
Authority
CN
China
Prior art keywords
msa
video object
video
encoder
instance segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111291423.XA
Other languages
Chinese (zh)
Inventor
王秋睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Capital University of Physical Education and Sports
Original Assignee
Capital University of Physical Education and Sports
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Capital University of Physical Education and Sports filed Critical Capital University of Physical Education and Sports
Priority to CN202111291423.XA priority Critical patent/CN114943911A/en
Publication of CN114943911A publication Critical patent/CN114943911A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The patent application of the invention is named as a video object example segmentation method based on an improved hierarchical deep self-attention network. The technology used in the present application is mainly used for instance segmentation of video objects, that is, given a piece of video, the instance segmentation is automatically performed on the object of each frame in the video segment. The technology is mainly used for realizing automatic video object instance segmentation, so that the video object can be segmented more conveniently. The technical key point of the application is that a hierarchical depth self-attention module P-MSA is provided and used in a connecting part of an encoder and a decoder, so that the purposes of better distinguishing and strengthening encoding characteristics are achieved, and better video object example segmentation is finally achieved.

Description

Video object instance segmentation method based on improved hierarchical deep self-attention network
One, the technical field
Video object instance segmentation, video object segmentation, computer vision, transformations attention mechanism, deep learning, artificial intelligence
Second, background art
2.1 general technical method introduction
The convolutional neural network is a method for extracting features by utilizing a convolutional kernel to slide on a picture or the features, and is a widely adopted technology.
The encoder-decoder architecture mode is a technology commonly used in the field of image object segmentation and video object segmentation, and the method adopting the technology is mainly different in the structure of a feature extraction network of an encoder and the feature fusion mode of a decoder.
2.2 introduction to similar methods
Document [1] Swin Transformers proposes a multi-scale hierarchical Transformers model, the structure of which is shown in fig. 1, and the size of each window input to the Transformer module is changed in order to obtain features of different scales.
The Transformers model for each layer includes a W-MSA (Window multiple attention) module without offset and an SW-MSA (offset Window multiple attention) module with an offset of one-half window size, and the structure is shown in FIG. 2. The Transformers model for each layer was concatenated with W-MSA first, and then with SW-MSA, as shown in FIG. 3.
Document [2] STEm-Seg proposes a scheme of an encoder (encoder) and a multi-scale fusion decoder (decoder) of a pyramid structure, and the result is shown in fig. 4.
Third, the invention
In the existing video object instance segmentation method, important features and non-important features cannot be well distinguished in the feature extraction process, so the accuracy of the segmentation result needs to be further improved.
On the basis of fusing an innovative technology and an existing method, the method realizes automatic video object instance segmentation, and optimizes the effect of feature extraction by adopting an improved hierarchical deep self-attention network.
The application of the invention mainly comprises:
(1) an improved hierarchical deep self-attention module P-MSA is provided, and more kinds of feature extraction can be realized.
(2) The P-MSA module is applied to the existing video object example segmentation method, and the better video object example segmentation effect is realized by combining with feature fusion.
Description of the drawings
FIG. 1 is an introduction diagram of the model of reference [1], with different sizes of self-attention windows for each level to extract features. Wherein, the window of the lower floor is small, and the window of the upper floor is large.
FIG. 2 is an illustration of the receptor field regions and the connection modes of feature extraction of the module W-MSA and the module SW-MSA of reference [1 ]. The total score is 2 modules, the former ouqiang divides the whole feature region into 4 equal parts along the x and y axes, and the latter module is shifted by half the size of the module along the x and y axes to form 9 regions.
FIG. 3 is a network architecture level connection scheme illustration of the modules W-MSA, SW-MSA and others of reference [1 ].
FIG. 4 is a system architecture diagram of reference [2], upon which the method of the present application is improved.
FIG. 5 shows the differences of the improved module P-MSA of the present application with respect to the modules W-MSA, SW-MSA of reference [1], the main differences being: (1) the P-MSA adopts a parallel structure, and the reference [1] connects W-MSA and SW-MSA in series; (2) the P-MSA can calculate the characteristics under different displacements by improving the calculation mode of the window characteristics, can extract and strengthen more kinds of characteristics, and has better characteristic expression capability.
Fig. 6 is a system architecture diagram of the present application, and a flowchart of an example video object segmentation method proposed by the present application is based on a STEm-Seg (reference [2]), and a P-MSA module is added between an encoder and a decoder.
Fifth, detailed description of the invention
The original method is shown in the literature [1] and shown in FIG. 5(a), and is composed of W-MSA and SW-MSA connected in series, wherein the SW-MSA is shifted by half of a window size relative to the W-MSA. The improved method P-MSA provided by the application is shown in FIG. 5(b), and is formed by parallel synthesis of one W-MSA and k SW-MSAs, and the obtained results are used for calculating an algorithm average value. Wherein, the number of k is (L/s) × (L/s) -1 (integer), L is the window size, and s is the step length of each time of shifting on the x axis or the y axis.
On the basis of the document [2], a P-MSA module provided by the application is connected behind a convolution module for extracting each pyramid feature, and the output of each P-MSA module is used as the output of an encoder under the scale feature and is input into a decoder as the output of the encoder under the scale feature in the connection mode of the document [2 ].
The difference between the methods of the present application and the reference [2] is that the P-MSA module proposed in the present application is added between the encoder and the decoder. Meanwhile, compared with the method in reference [2], the method has better characteristic region partition and better segmentation prediction accuracy.
[1]Ze Liu,Yutong Lin,Yue Cao,Han Hu,Yixuan Wei,Zheng Zhang,Stephen Lin,Baining Guo:Swin Transformer:Hierarchical Vision Transformer using Shifted Windows.CoRRabs/2103.14030(2021)
[2]Ali Athar,Sabarinath Mahadevan,Aljosa Osep,Laura Leal-Taixé,Bastian Leibe:STEm-Seg: Spatio-Temporal Embeddings for Instance Segmentation in Videos.ECCV(11)2020:158-177

Claims (2)

1. A video object example segmentation method based on an improved hierarchical deep self-attention network is characterized by comprising the following steps: inputting a multi-frame image of a video segment, adopting an encoder-decoder framework under the condition of not needing to input a first frame of object outline, wherein the encoder adopts a convolution characteristic pyramid structure, and adding the improved hierarchical depth self-attention module P-MSA at the joint of the encoder and the decoder to realize automatic segmentation of the object outlines in all frames.
2. A hierarchical deep self-attention module P-MSA as claimed in claim 1, characterized in that: for each input two-dimensional feature, the two-dimensional feature can be moved by any distance in the size range of the convolution window along the x-axis direction or the y-axis direction or the combination of the two directions, and all feature outputs with different moving distances are connected in parallel, so that a new feature extraction and enhancement mode is realized.
CN202111291423.XA 2021-11-03 2021-11-03 Video object instance segmentation method based on improved hierarchical deep self-attention network Pending CN114943911A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111291423.XA CN114943911A (en) 2021-11-03 2021-11-03 Video object instance segmentation method based on improved hierarchical deep self-attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111291423.XA CN114943911A (en) 2021-11-03 2021-11-03 Video object instance segmentation method based on improved hierarchical deep self-attention network

Publications (1)

Publication Number Publication Date
CN114943911A true CN114943911A (en) 2022-08-26

Family

ID=82906073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111291423.XA Pending CN114943911A (en) 2021-11-03 2021-11-03 Video object instance segmentation method based on improved hierarchical deep self-attention network

Country Status (1)

Country Link
CN (1) CN114943911A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115439483A (en) * 2022-11-09 2022-12-06 四川川锅环保工程有限公司 High-quality welding seam and welding seam defect identification system, method and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115439483A (en) * 2022-11-09 2022-12-06 四川川锅环保工程有限公司 High-quality welding seam and welding seam defect identification system, method and storage medium

Similar Documents

Publication Publication Date Title
Poudel et al. Fast-scnn: Fast semantic segmentation network
Casser et al. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos
CN111275518B (en) Video virtual fitting method and device based on mixed optical flow
Casser et al. Unsupervised monocular depth and ego-motion learning with structure and semantics
Zhang et al. Deep hierarchical guidance and regularization learning for end-to-end depth estimation
Liu et al. A cross-modal adaptive gated fusion generative adversarial network for RGB-D salient object detection
Jin et al. MoADNet: Mobile asymmetric dual-stream networks for real-time and lightweight RGB-D salient object detection
CN107229920B (en) Behavior identification method based on integration depth typical time warping and related correction
Zhang et al. RangeLVDet: Boosting 3D object detection in LiDAR with range image and RGB image
Saeedan et al. Boosting monocular depth with panoptic segmentation maps
CN114943911A (en) Video object instance segmentation method based on improved hierarchical deep self-attention network
Chen et al. Denao: Monocular depth estimation network with auxiliary optical flow
Wang et al. Deep unsupervised 3d sfm face reconstruction based on massive landmark bundle adjustment
CN106204456A (en) Panoramic video sequences estimation is crossed the border folding searching method
CN117036770A (en) Detection model training and target detection method and system based on cascade attention
Yi et al. Cross-stage multi-scale interaction network for RGB-D salient object detection
CN107169498A (en) It is a kind of to merge local and global sparse image significance detection method
CN101945299A (en) Camera-equipment-array based dynamic scene depth restoring method
CN110390336B (en) Method for improving feature point matching precision
CN114973305B (en) Accurate human body analysis method for crowded people
CN111783497A (en) Method, device and computer-readable storage medium for determining characteristics of target in video
CN115330655A (en) Image fusion method and system based on self-attention mechanism
CN109801317A (en) The image matching method of feature extraction is carried out based on convolutional neural networks
CN115578298A (en) Depth portrait video synthesis method based on content perception
CN115131418A (en) Monocular depth estimation algorithm based on Transformer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination