CN113744306A - Video target segmentation method based on time sequence content perception attention mechanism - Google Patents
Video target segmentation method based on time sequence content perception attention mechanism Download PDFInfo
- Publication number
- CN113744306A CN113744306A CN202110634977.9A CN202110634977A CN113744306A CN 113744306 A CN113744306 A CN 113744306A CN 202110634977 A CN202110634977 A CN 202110634977A CN 113744306 A CN113744306 A CN 113744306A
- Authority
- CN
- China
- Prior art keywords
- frame
- feature
- video
- time sequence
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/215—Motion-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/269—Analysis of motion using gradient-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20004—Adaptive image processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20092—Interactive image processing based on input by user
- G06T2207/20104—Interactive definition of region of interest [ROI]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
The invention discloses a video target segmentation method based on a time sequence content perception attention mechanism, which aims at the problem of huge calculation amount of a global matching attention mechanism. And finally, matching is carried out between the features to be matched and the past frame feature vectors obtained by sampling, namely, similarity region matching is carried out on the local part of the whole image, so that the calculation amount of matching is reduced, and the similar features are obtained by taking the current frame features as references, so that the running speed is high, and the matching precision is higher due to the removal of irrelevant noise features.
Description
Technical Field
The invention relates to a machine learning technology, in particular to a video object segmentation technology based on machine learning.
Background
Video object segmentation is a fundamental task in computer vision. The video object segmentation task requires that each pixel on each frame image in the video is given a label, that is, a foreground object and a background area are subjected to binary label separation, and knowledge in the fields of pattern recognition, machine learning and the like is required. The video target segmentation has very important significance for wide application such as video editing, target tracking, scene understanding and the like. With the development of computer science technology, deep learning and real life needs, video object segmentation attracts the attention of many researchers in recent years, and meanwhile, great research progress is achieved. The degree of supervision of the video object segmentation task can be divided into three major categories: unsupervised, semi-supervised, and interactive video object segmentation. Unsupervised video object segmentation needs to find and segment the main objects in the video, which means that the algorithm needs to decide by itself which object is the main one. The semi-supervised video object segmentation task gives a first frame or key frame with mask information. The interactive video object segmentation draws the approximate outline of the video segmentation of the object by a human hand with a mouse, and then carries out video segmentation by using a video segmentation algorithm in the second step.
In the current video target segmentation algorithm based on the attention mechanism, global matching is performed in a time sequence feature matching link, namely, the features of a current frame are sequentially matched with all the features of a past frame, so that the method has a large calculation amount and causes a low model operation speed.
Disclosure of Invention
The invention aims to solve the technical problem of providing a video target segmentation method for adjusting a matching position and executing local matching based on time sequence content perception.
The invention adopts the technical scheme that the video target segmentation method based on the time sequence content perception attention mechanism comprises the following steps:
1) training a video target segmentation system:
1-1) receiving a training sample video and a target mask for each frame in the video;
1-2) splicing an ith frame in a video serving as a past frame and a corresponding target mask along a channel dimension, and then inputting the spliced frame into an encoder B of a feature extraction network, wherein the encoder B outputs features of the ith frame;
1-3) taking the (i + 1) th frame as a current frame and inputting the current frame into an encoder A of a feature extraction network, wherein the encoder A outputs the (i + 1) th frame feature;
1-4) sending the i +1 th frame characteristic and the i frame characteristic into a time sequence content perception attention module, and outputting the time sequence content perception characteristic of the i +1 th frame by the time sequence content perception attention module;
1-5) inputting the time sequence content perception characteristics of the (i + 1) th frame into a decoder, and outputting a target mask of the (i + 1) th frame by the decoder;
1-6) splicing the interested target masks of the (i + 1) th frame and the (i + 1) th frame along the channel dimension, inputting the spliced interested target masks into an encoder B of a feature extraction network, and outputting the updated features of the (i + 1) th frame by the encoder B;
1-7) judging whether the target masks of all frames in the output training sample video are met or the convergence condition of the loss function in the video target segmentation system is met, if so, finishing the training of the video target segmentation system, otherwise, updating i to i +1, taking the characteristics of the updated ith frame as the ith frame in the video, and returning to the step 2);
the specific processing mode when the attention module for time sequence content perception in the step 1-4) receives the i +1 th frame feature and the i frame feature is as follows: firstly, performing optical flow prediction processing on the (i + 1) th frame feature and the ith frame feature to obtain an optical flow field between the current frame feature and the past frame feature, and extracting a feature vector in the ith frame feature by using bilinear interpolation by using the optical flow field between the current frame feature and the past frame feature; matching the extracted feature vector with the features of the (i + 1) th frame to obtain the time sequence content perception features of the (i + 1) th frame;
2) video target segmentation system testing: and inputting the video to be processed into the trained video target segmentation system, and outputting the interested target area in the video by the video target segmentation system.
Aiming at the problem of huge calculation amount of a global matching attention mechanism, redundant calculation of time sequence feature matching is removed, namely the target feature of the current frame is only matched with the same target feature in the past frame and is not matched with all the features of the past frame. The invention obtains the characteristic vector related to the characteristic to be matched (the characteristic of the current frame) by learning a plurality of groups of optical flow fields between the current frame and the past frame and sampling the past frame according to the position mapping relation of the optical flow fields by using the assumption of time sequence continuity. And finally, matching is carried out between the features to be matched and the past frame feature vectors obtained by sampling, namely, similarity region matching is carried out on the local part of the whole image, so that the calculation amount of matching is reduced, and the similar features are obtained by taking the current frame features as references, so that the running speed is high, and the matching precision is higher due to the removal of irrelevant noise features.
The method has the advantages that the matching position is adjusted based on the time sequence content perception, the local matching is executed, the running speed of the time sequence characteristic matching part is effectively increased, and the matching precision is improved.
Drawings
FIG. 1 is a schematic diagram of video object segmentation based on a temporal content aware attention mechanism.
Fig. 2 is a specific structure of a module for perceiving attention based on time series content.
Fig. 3 is a block specific structure of predicting an optical flow field between timing characteristics.
Detailed Description
The specific structure of the video object segmentation system to complete the method of the present invention is shown in fig. 1, and comprises a feature extraction network, a time sequence content perception attention module, and a decoder. The feature extraction network includes an encoder a for the current frame and an encoder B for the past frame.
The whole training process of the video target segmentation system is as follows:
1) collecting a video and providing a target mask for each frame in the video;
2) the first frame image is taken as a past frame and the corresponding target mask is spliced along the channel dimension and sent into a processing encoder B to obtain the characteristics of the first frame;
3) the second frame is taken as a current frame, namely a frame to be processed, and is sent to an encoder A for processing the current frame to obtain the characteristics of the second frame;
4) sending the second frame characteristics and the first frame characteristics into a time sequence content perception attention module to obtain time sequence content perception characteristics of the second frame, and sending the time sequence content perception characteristics of the second frame into a decoder to obtain an interested target mask of the second frame;
5) splicing the second frame image and the interested target mask of the second frame along the channel dimension, and sending the spliced second frame image and the interested target mask of the second frame to an encoder B for processing the past frame to obtain the characteristics of the second frame again;
6) regarding the third frame as the current frame, taking the feature of the second frame obtained again as the past frame, obtaining a prediction target mask of the third frame by adopting the method of the steps 2) -5), and obtaining the third frame feature past frame again by adopting the method of the step 5);
7) repeating the step 6) until the target mask is predicted for all frames in the video; and the Binary Cross Engine loss function is used for convergence, and the training of the video target segmentation system is completed.
The current frame image passes through an encoder A to obtain current frame characteristics, at the moment, the past frame characteristics obtained by an encoder B are used, the two characteristics are spliced along a channel and sent to a time sequence content perception attention module to obtain time sequence characteristics, and then a decoder is used for mapping the time sequence characteristics to an output target mask. Then, the current frame image and the target mask are spliced, and the current frame image and the target mask are processed by an encoder B (the processed frame becomes a past frame), so that new past frame characteristics are obtained for predicting a next frame. The feature extraction network consists of two encoders, where encoder a extracts current frame features and the other encoder B extracts past frame features of past frames and their prediction masks. The input of the time sequence content perception attention module is the current frame feature and the past frame feature obtained by the feature extraction network, and the current frame feature and the past frame feature are spliced along the channel dimension. The specific structure of the time-series content perception attention module is as shown in fig. 2, and firstly, the characteristics similar to the characteristics of the current frame on the past frame characteristics are collected through an optical flow prediction module according to the optical flow to form a similar characteristic set. As shown in fig. 3, the optical flow prediction module includes 2 channel splicing modules C, 6 convolution Conv modules of 3 × 3, the current frame feature and the past frame feature are output to one 3 × 3Conv through the first channel splicing module, and are further divided into 4 paths of parallel 4 3 × 3Conv, where 3 × 3Conv are void convolutions, the void convolution rates D are 2, 4, and 8, respectively, and the output of the parallel 4 paths of 3 × 3Conv is output to the optical flow field between the current frame feature and the past frame feature through the second channel splicing module. Then, extracting a feature vector in the past frame feature by using bilinear interpolation by using an optical flow field between the current frame feature and the past frame feature; and then matching the extracted feature vector (the feature of the past frame after extraction) with the feature of the current frame to obtain the time sequence content perception feature of the current frame. And the matching module is used for matching the current frame features with the similar feature set. Specifically, the time-series content perception attention module predicts the optical flow field between the current frame feature and the past frame feature in sequence to obtain the position of each current frame feature vector in the past frame feature. Since the position is a fractional number, a bilinear interpolation is used to extract the eigenvector of the position. This is performed for all past frame features, resulting in a similar feature set for the current frame feature across all past frames. The size of the similar feature set is far lower than the feature quantity of all the past frames, so that the feature set and the current frame are used for matching, the calculation amount of matching is reduced, and the similar features are obtained by taking the current frame features as references, so that the matching is self-adaptive local matching of time sequence content perception, the operation speed is high, and the precision is not lost. The decoder outputs a target prediction mask for the current frame using the timing characteristics output by the attention module as input. In the feature matching process, the feature to be matched and the sampling feature vector of the past frame calculate the dot product similarity (realized by using matrix multiplication), then the similarity values are normalized by softmax, the probability value obtained by normalization is used for weighting the corresponding sampled feature vector, and the time sequence feature of the position to be matched is obtained. By executing the same operation on each feature to be matched, a time sequence feature with the same size as the current frame feature is obtained. The timing characteristics are entered into the decoder to predict the mask of the target of interest in the current frame. The size of the similar feature set is far lower than the feature quantity of all the past frames, so that the feature set and the current frame are used for matching, the calculation amount of matching is reduced, and the similar features are obtained by taking the current frame features as references, so that the matching is self-adaptive local matching of time sequence content perception, the operation speed is high, and the precision is not lost. The decoder outputs a target prediction mask for the current frame using the timing characteristics output by the attention module as input.
Specifically, two encoders in the feature extraction network are constructed from ResNet50, with encoder B having 4 channels of input data. The decoder is formed by stacking two 2-time upsampling modules, wherein each upsampling module consists of a 2-time upsampling interpolation layer, a convolution layer, a BatchNorm layer and a ReLU layer.
In the testing step, the video to be processed is input to the trained video target segmentation system, and the interested target area in the video can be obtained.
Claims (1)
1. The video target segmentation method based on the time sequence content perception attention mechanism comprises the following steps:
1) training a video target segmentation system:
1-1) receiving a training sample video and a target mask for each frame in the video;
1-2) splicing an ith frame in a video serving as a past frame and a corresponding target mask along a channel dimension, and then inputting the spliced frame into an encoder B of a feature extraction network, wherein the encoder B outputs features of the ith frame;
1-3) taking the (i + 1) th frame as a current frame and inputting the current frame into an encoder A of a feature extraction network, wherein the encoder A outputs the (i + 1) th frame feature;
1-4) sending the i +1 th frame characteristic and the i frame characteristic into a time sequence content perception attention module, and outputting the time sequence content perception characteristic of the i +1 th frame by the time sequence content perception attention module;
1-5) inputting the time sequence content perception characteristics of the (i + 1) th frame into a decoder, and outputting a target mask of the (i + 1) th frame by the decoder;
1-6) splicing the interested target masks of the (i + 1) th frame and the (i + 1) th frame along the channel dimension, inputting the spliced interested target masks into an encoder B of a feature extraction network, and outputting the updated features of the (i + 1) th frame by the encoder B;
1-7) judging whether the target masks of all frames in the output training sample video are met or the convergence condition of the loss function in the video target segmentation system is met, if so, finishing the training of the video target segmentation system, otherwise, updating i to i +1, taking the characteristics of the updated ith frame as the ith frame in the video, and returning to the step 2);
the specific processing mode when the attention module for time sequence content perception in the step 1-4) receives the i +1 th frame feature and the i frame feature is as follows: firstly, performing optical flow prediction processing on the (i + 1) th frame feature and the ith frame feature to obtain an optical flow field between the current frame feature and the past frame feature, and extracting a feature vector in the ith frame feature by using bilinear interpolation by using the optical flow field between the current frame feature and the past frame feature; matching the extracted feature vector with the features of the (i + 1) th frame to obtain the time sequence content perception features of the (i + 1) th frame;
2) video target segmentation system testing: and inputting the video to be processed into the trained video target segmentation system, and outputting the interested target area in the video by the video target segmentation system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110634977.9A CN113744306B (en) | 2021-06-08 | 2021-06-08 | Video target segmentation method based on time sequence content perception attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110634977.9A CN113744306B (en) | 2021-06-08 | 2021-06-08 | Video target segmentation method based on time sequence content perception attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113744306A true CN113744306A (en) | 2021-12-03 |
CN113744306B CN113744306B (en) | 2023-07-21 |
Family
ID=78728416
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110634977.9A Active CN113744306B (en) | 2021-06-08 | 2021-06-08 | Video target segmentation method based on time sequence content perception attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113744306B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110868598A (en) * | 2019-10-17 | 2020-03-06 | 上海交通大学 | Video content replacement method and system based on countermeasure generation network |
CN111210446A (en) * | 2020-01-08 | 2020-05-29 | 中国科学技术大学 | Video target segmentation method, device and equipment |
CN111968123A (en) * | 2020-08-28 | 2020-11-20 | 北京交通大学 | Semi-supervised video target segmentation method |
CN112085760A (en) * | 2020-09-04 | 2020-12-15 | 厦门大学 | Prospect segmentation method of laparoscopic surgery video |
CN112529931A (en) * | 2020-12-23 | 2021-03-19 | 南京航空航天大学 | Foreground segmentation method and system |
CN112749712A (en) * | 2021-01-22 | 2021-05-04 | 四川大学 | RGBD significance object detection method based on 3D convolutional neural network |
US20210150727A1 (en) * | 2019-11-19 | 2021-05-20 | Samsung Electronics Co., Ltd. | Method and apparatus with video segmentation |
-
2021
- 2021-06-08 CN CN202110634977.9A patent/CN113744306B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110868598A (en) * | 2019-10-17 | 2020-03-06 | 上海交通大学 | Video content replacement method and system based on countermeasure generation network |
US20210150727A1 (en) * | 2019-11-19 | 2021-05-20 | Samsung Electronics Co., Ltd. | Method and apparatus with video segmentation |
CN111210446A (en) * | 2020-01-08 | 2020-05-29 | 中国科学技术大学 | Video target segmentation method, device and equipment |
CN111968123A (en) * | 2020-08-28 | 2020-11-20 | 北京交通大学 | Semi-supervised video target segmentation method |
CN112085760A (en) * | 2020-09-04 | 2020-12-15 | 厦门大学 | Prospect segmentation method of laparoscopic surgery video |
CN112529931A (en) * | 2020-12-23 | 2021-03-19 | 南京航空航天大学 | Foreground segmentation method and system |
CN112749712A (en) * | 2021-01-22 | 2021-05-04 | 四川大学 | RGBD significance object detection method based on 3D convolutional neural network |
Non-Patent Citations (5)
Title |
---|
YANG JIE等: "context-aware deformable alignment for video object segmentation", 《2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION》 * |
严广宇等: "基于混合注意力的实时语义分割算法", 《现代计算机》 * |
杨杰: "基于时空匹配的视频目标分割算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
汤一明等: "视觉单目标跟踪算法综述", 《测控技术》 * |
汪梓艺等: "一种改进DeeplabV3网络的烟雾分割算法", 《西安电子科技大学学报》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113744306B (en) | 2023-07-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110516536B (en) | Weak supervision video behavior detection method based on time sequence class activation graph complementation | |
CN111968150B (en) | Weak surveillance video target segmentation method based on full convolution neural network | |
CN112016682B (en) | Video characterization learning and pre-training method and device, electronic equipment and storage medium | |
CN110688927B (en) | Video action detection method based on time sequence convolution modeling | |
CN114494981B (en) | Action video classification method and system based on multi-level motion modeling | |
CN112364699A (en) | Remote sensing image segmentation method, device and medium based on weighted loss fusion network | |
CN111476133B (en) | Unmanned driving-oriented foreground and background codec network target extraction method | |
CN111526434A (en) | Converter-based video abstraction method | |
CN110852295A (en) | Video behavior identification method based on multitask supervised learning | |
CN115375737B (en) | Target tracking method and system based on adaptive time and serialized space-time characteristics | |
CN112163490A (en) | Target detection method based on scene picture | |
CN114694255B (en) | Sentence-level lip language recognition method based on channel attention and time convolution network | |
CN114996495A (en) | Single-sample image segmentation method and device based on multiple prototypes and iterative enhancement | |
Wang et al. | Lightweight bilateral network for real-time semantic segmentation | |
CN113744306B (en) | Video target segmentation method based on time sequence content perception attention mechanism | |
CN116630850A (en) | Twin target tracking method based on multi-attention task fusion and bounding box coding | |
CN115797827A (en) | ViT human body behavior identification method based on double-current network architecture | |
CN113033283B (en) | Improved video classification system | |
CN112487927B (en) | Method and system for realizing indoor scene recognition based on object associated attention | |
CN113255493B (en) | Video target segmentation method integrating visual words and self-attention mechanism | |
CN114359786A (en) | Lip language identification method based on improved space-time convolutional network | |
CN111382761B (en) | CNN-based detector, image detection method and terminal | |
CN116170638B (en) | Self-attention video stream compression method and system for online action detection task | |
CN117115474A (en) | End-to-end single target tracking method based on multi-stage feature extraction | |
CN117558067A (en) | Action prediction method based on action recognition and sequence reasoning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |