CN113744306A - Video target segmentation method based on time sequence content perception attention mechanism - Google Patents

Video target segmentation method based on time sequence content perception attention mechanism Download PDF

Info

Publication number
CN113744306A
CN113744306A CN202110634977.9A CN202110634977A CN113744306A CN 113744306 A CN113744306 A CN 113744306A CN 202110634977 A CN202110634977 A CN 202110634977A CN 113744306 A CN113744306 A CN 113744306A
Authority
CN
China
Prior art keywords
frame
feature
video
time sequence
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110634977.9A
Other languages
Chinese (zh)
Other versions
CN113744306B (en
Inventor
周雪
杨杰
邹见效
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202110634977.9A priority Critical patent/CN113744306B/en
Publication of CN113744306A publication Critical patent/CN113744306A/en
Application granted granted Critical
Publication of CN113744306B publication Critical patent/CN113744306B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/215Motion-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20004Adaptive image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20092Interactive image processing based on input by user
    • G06T2207/20104Interactive definition of region of interest [ROI]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a video target segmentation method based on a time sequence content perception attention mechanism, which aims at the problem of huge calculation amount of a global matching attention mechanism. And finally, matching is carried out between the features to be matched and the past frame feature vectors obtained by sampling, namely, similarity region matching is carried out on the local part of the whole image, so that the calculation amount of matching is reduced, and the similar features are obtained by taking the current frame features as references, so that the running speed is high, and the matching precision is higher due to the removal of irrelevant noise features.

Description

Video target segmentation method based on time sequence content perception attention mechanism
Technical Field
The invention relates to a machine learning technology, in particular to a video object segmentation technology based on machine learning.
Background
Video object segmentation is a fundamental task in computer vision. The video object segmentation task requires that each pixel on each frame image in the video is given a label, that is, a foreground object and a background area are subjected to binary label separation, and knowledge in the fields of pattern recognition, machine learning and the like is required. The video target segmentation has very important significance for wide application such as video editing, target tracking, scene understanding and the like. With the development of computer science technology, deep learning and real life needs, video object segmentation attracts the attention of many researchers in recent years, and meanwhile, great research progress is achieved. The degree of supervision of the video object segmentation task can be divided into three major categories: unsupervised, semi-supervised, and interactive video object segmentation. Unsupervised video object segmentation needs to find and segment the main objects in the video, which means that the algorithm needs to decide by itself which object is the main one. The semi-supervised video object segmentation task gives a first frame or key frame with mask information. The interactive video object segmentation draws the approximate outline of the video segmentation of the object by a human hand with a mouse, and then carries out video segmentation by using a video segmentation algorithm in the second step.
In the current video target segmentation algorithm based on the attention mechanism, global matching is performed in a time sequence feature matching link, namely, the features of a current frame are sequentially matched with all the features of a past frame, so that the method has a large calculation amount and causes a low model operation speed.
Disclosure of Invention
The invention aims to solve the technical problem of providing a video target segmentation method for adjusting a matching position and executing local matching based on time sequence content perception.
The invention adopts the technical scheme that the video target segmentation method based on the time sequence content perception attention mechanism comprises the following steps:
1) training a video target segmentation system:
1-1) receiving a training sample video and a target mask for each frame in the video;
1-2) splicing an ith frame in a video serving as a past frame and a corresponding target mask along a channel dimension, and then inputting the spliced frame into an encoder B of a feature extraction network, wherein the encoder B outputs features of the ith frame;
1-3) taking the (i + 1) th frame as a current frame and inputting the current frame into an encoder A of a feature extraction network, wherein the encoder A outputs the (i + 1) th frame feature;
1-4) sending the i +1 th frame characteristic and the i frame characteristic into a time sequence content perception attention module, and outputting the time sequence content perception characteristic of the i +1 th frame by the time sequence content perception attention module;
1-5) inputting the time sequence content perception characteristics of the (i + 1) th frame into a decoder, and outputting a target mask of the (i + 1) th frame by the decoder;
1-6) splicing the interested target masks of the (i + 1) th frame and the (i + 1) th frame along the channel dimension, inputting the spliced interested target masks into an encoder B of a feature extraction network, and outputting the updated features of the (i + 1) th frame by the encoder B;
1-7) judging whether the target masks of all frames in the output training sample video are met or the convergence condition of the loss function in the video target segmentation system is met, if so, finishing the training of the video target segmentation system, otherwise, updating i to i +1, taking the characteristics of the updated ith frame as the ith frame in the video, and returning to the step 2);
the specific processing mode when the attention module for time sequence content perception in the step 1-4) receives the i +1 th frame feature and the i frame feature is as follows: firstly, performing optical flow prediction processing on the (i + 1) th frame feature and the ith frame feature to obtain an optical flow field between the current frame feature and the past frame feature, and extracting a feature vector in the ith frame feature by using bilinear interpolation by using the optical flow field between the current frame feature and the past frame feature; matching the extracted feature vector with the features of the (i + 1) th frame to obtain the time sequence content perception features of the (i + 1) th frame;
2) video target segmentation system testing: and inputting the video to be processed into the trained video target segmentation system, and outputting the interested target area in the video by the video target segmentation system.
Aiming at the problem of huge calculation amount of a global matching attention mechanism, redundant calculation of time sequence feature matching is removed, namely the target feature of the current frame is only matched with the same target feature in the past frame and is not matched with all the features of the past frame. The invention obtains the characteristic vector related to the characteristic to be matched (the characteristic of the current frame) by learning a plurality of groups of optical flow fields between the current frame and the past frame and sampling the past frame according to the position mapping relation of the optical flow fields by using the assumption of time sequence continuity. And finally, matching is carried out between the features to be matched and the past frame feature vectors obtained by sampling, namely, similarity region matching is carried out on the local part of the whole image, so that the calculation amount of matching is reduced, and the similar features are obtained by taking the current frame features as references, so that the running speed is high, and the matching precision is higher due to the removal of irrelevant noise features.
The method has the advantages that the matching position is adjusted based on the time sequence content perception, the local matching is executed, the running speed of the time sequence characteristic matching part is effectively increased, and the matching precision is improved.
Drawings
FIG. 1 is a schematic diagram of video object segmentation based on a temporal content aware attention mechanism.
Fig. 2 is a specific structure of a module for perceiving attention based on time series content.
Fig. 3 is a block specific structure of predicting an optical flow field between timing characteristics.
Detailed Description
The specific structure of the video object segmentation system to complete the method of the present invention is shown in fig. 1, and comprises a feature extraction network, a time sequence content perception attention module, and a decoder. The feature extraction network includes an encoder a for the current frame and an encoder B for the past frame.
The whole training process of the video target segmentation system is as follows:
1) collecting a video and providing a target mask for each frame in the video;
2) the first frame image is taken as a past frame and the corresponding target mask is spliced along the channel dimension and sent into a processing encoder B to obtain the characteristics of the first frame;
3) the second frame is taken as a current frame, namely a frame to be processed, and is sent to an encoder A for processing the current frame to obtain the characteristics of the second frame;
4) sending the second frame characteristics and the first frame characteristics into a time sequence content perception attention module to obtain time sequence content perception characteristics of the second frame, and sending the time sequence content perception characteristics of the second frame into a decoder to obtain an interested target mask of the second frame;
5) splicing the second frame image and the interested target mask of the second frame along the channel dimension, and sending the spliced second frame image and the interested target mask of the second frame to an encoder B for processing the past frame to obtain the characteristics of the second frame again;
6) regarding the third frame as the current frame, taking the feature of the second frame obtained again as the past frame, obtaining a prediction target mask of the third frame by adopting the method of the steps 2) -5), and obtaining the third frame feature past frame again by adopting the method of the step 5);
7) repeating the step 6) until the target mask is predicted for all frames in the video; and the Binary Cross Engine loss function is used for convergence, and the training of the video target segmentation system is completed.
The current frame image passes through an encoder A to obtain current frame characteristics, at the moment, the past frame characteristics obtained by an encoder B are used, the two characteristics are spliced along a channel and sent to a time sequence content perception attention module to obtain time sequence characteristics, and then a decoder is used for mapping the time sequence characteristics to an output target mask. Then, the current frame image and the target mask are spliced, and the current frame image and the target mask are processed by an encoder B (the processed frame becomes a past frame), so that new past frame characteristics are obtained for predicting a next frame. The feature extraction network consists of two encoders, where encoder a extracts current frame features and the other encoder B extracts past frame features of past frames and their prediction masks. The input of the time sequence content perception attention module is the current frame feature and the past frame feature obtained by the feature extraction network, and the current frame feature and the past frame feature are spliced along the channel dimension. The specific structure of the time-series content perception attention module is as shown in fig. 2, and firstly, the characteristics similar to the characteristics of the current frame on the past frame characteristics are collected through an optical flow prediction module according to the optical flow to form a similar characteristic set. As shown in fig. 3, the optical flow prediction module includes 2 channel splicing modules C, 6 convolution Conv modules of 3 × 3, the current frame feature and the past frame feature are output to one 3 × 3Conv through the first channel splicing module, and are further divided into 4 paths of parallel 4 3 × 3Conv, where 3 × 3Conv are void convolutions, the void convolution rates D are 2, 4, and 8, respectively, and the output of the parallel 4 paths of 3 × 3Conv is output to the optical flow field between the current frame feature and the past frame feature through the second channel splicing module. Then, extracting a feature vector in the past frame feature by using bilinear interpolation by using an optical flow field between the current frame feature and the past frame feature; and then matching the extracted feature vector (the feature of the past frame after extraction) with the feature of the current frame to obtain the time sequence content perception feature of the current frame. And the matching module is used for matching the current frame features with the similar feature set. Specifically, the time-series content perception attention module predicts the optical flow field between the current frame feature and the past frame feature in sequence to obtain the position of each current frame feature vector in the past frame feature. Since the position is a fractional number, a bilinear interpolation is used to extract the eigenvector of the position. This is performed for all past frame features, resulting in a similar feature set for the current frame feature across all past frames. The size of the similar feature set is far lower than the feature quantity of all the past frames, so that the feature set and the current frame are used for matching, the calculation amount of matching is reduced, and the similar features are obtained by taking the current frame features as references, so that the matching is self-adaptive local matching of time sequence content perception, the operation speed is high, and the precision is not lost. The decoder outputs a target prediction mask for the current frame using the timing characteristics output by the attention module as input. In the feature matching process, the feature to be matched and the sampling feature vector of the past frame calculate the dot product similarity (realized by using matrix multiplication), then the similarity values are normalized by softmax, the probability value obtained by normalization is used for weighting the corresponding sampled feature vector, and the time sequence feature of the position to be matched is obtained. By executing the same operation on each feature to be matched, a time sequence feature with the same size as the current frame feature is obtained. The timing characteristics are entered into the decoder to predict the mask of the target of interest in the current frame. The size of the similar feature set is far lower than the feature quantity of all the past frames, so that the feature set and the current frame are used for matching, the calculation amount of matching is reduced, and the similar features are obtained by taking the current frame features as references, so that the matching is self-adaptive local matching of time sequence content perception, the operation speed is high, and the precision is not lost. The decoder outputs a target prediction mask for the current frame using the timing characteristics output by the attention module as input.
Specifically, two encoders in the feature extraction network are constructed from ResNet50, with encoder B having 4 channels of input data. The decoder is formed by stacking two 2-time upsampling modules, wherein each upsampling module consists of a 2-time upsampling interpolation layer, a convolution layer, a BatchNorm layer and a ReLU layer.
In the testing step, the video to be processed is input to the trained video target segmentation system, and the interested target area in the video can be obtained.

Claims (1)

1. The video target segmentation method based on the time sequence content perception attention mechanism comprises the following steps:
1) training a video target segmentation system:
1-1) receiving a training sample video and a target mask for each frame in the video;
1-2) splicing an ith frame in a video serving as a past frame and a corresponding target mask along a channel dimension, and then inputting the spliced frame into an encoder B of a feature extraction network, wherein the encoder B outputs features of the ith frame;
1-3) taking the (i + 1) th frame as a current frame and inputting the current frame into an encoder A of a feature extraction network, wherein the encoder A outputs the (i + 1) th frame feature;
1-4) sending the i +1 th frame characteristic and the i frame characteristic into a time sequence content perception attention module, and outputting the time sequence content perception characteristic of the i +1 th frame by the time sequence content perception attention module;
1-5) inputting the time sequence content perception characteristics of the (i + 1) th frame into a decoder, and outputting a target mask of the (i + 1) th frame by the decoder;
1-6) splicing the interested target masks of the (i + 1) th frame and the (i + 1) th frame along the channel dimension, inputting the spliced interested target masks into an encoder B of a feature extraction network, and outputting the updated features of the (i + 1) th frame by the encoder B;
1-7) judging whether the target masks of all frames in the output training sample video are met or the convergence condition of the loss function in the video target segmentation system is met, if so, finishing the training of the video target segmentation system, otherwise, updating i to i +1, taking the characteristics of the updated ith frame as the ith frame in the video, and returning to the step 2);
the specific processing mode when the attention module for time sequence content perception in the step 1-4) receives the i +1 th frame feature and the i frame feature is as follows: firstly, performing optical flow prediction processing on the (i + 1) th frame feature and the ith frame feature to obtain an optical flow field between the current frame feature and the past frame feature, and extracting a feature vector in the ith frame feature by using bilinear interpolation by using the optical flow field between the current frame feature and the past frame feature; matching the extracted feature vector with the features of the (i + 1) th frame to obtain the time sequence content perception features of the (i + 1) th frame;
2) video target segmentation system testing: and inputting the video to be processed into the trained video target segmentation system, and outputting the interested target area in the video by the video target segmentation system.
CN202110634977.9A 2021-06-08 2021-06-08 Video target segmentation method based on time sequence content perception attention mechanism Active CN113744306B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110634977.9A CN113744306B (en) 2021-06-08 2021-06-08 Video target segmentation method based on time sequence content perception attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110634977.9A CN113744306B (en) 2021-06-08 2021-06-08 Video target segmentation method based on time sequence content perception attention mechanism

Publications (2)

Publication Number Publication Date
CN113744306A true CN113744306A (en) 2021-12-03
CN113744306B CN113744306B (en) 2023-07-21

Family

ID=78728416

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110634977.9A Active CN113744306B (en) 2021-06-08 2021-06-08 Video target segmentation method based on time sequence content perception attention mechanism

Country Status (1)

Country Link
CN (1) CN113744306B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110868598A (en) * 2019-10-17 2020-03-06 上海交通大学 Video content replacement method and system based on countermeasure generation network
CN111210446A (en) * 2020-01-08 2020-05-29 中国科学技术大学 Video target segmentation method, device and equipment
CN111968123A (en) * 2020-08-28 2020-11-20 北京交通大学 Semi-supervised video target segmentation method
CN112085760A (en) * 2020-09-04 2020-12-15 厦门大学 Prospect segmentation method of laparoscopic surgery video
CN112529931A (en) * 2020-12-23 2021-03-19 南京航空航天大学 Foreground segmentation method and system
CN112749712A (en) * 2021-01-22 2021-05-04 四川大学 RGBD significance object detection method based on 3D convolutional neural network
US20210150727A1 (en) * 2019-11-19 2021-05-20 Samsung Electronics Co., Ltd. Method and apparatus with video segmentation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110868598A (en) * 2019-10-17 2020-03-06 上海交通大学 Video content replacement method and system based on countermeasure generation network
US20210150727A1 (en) * 2019-11-19 2021-05-20 Samsung Electronics Co., Ltd. Method and apparatus with video segmentation
CN111210446A (en) * 2020-01-08 2020-05-29 中国科学技术大学 Video target segmentation method, device and equipment
CN111968123A (en) * 2020-08-28 2020-11-20 北京交通大学 Semi-supervised video target segmentation method
CN112085760A (en) * 2020-09-04 2020-12-15 厦门大学 Prospect segmentation method of laparoscopic surgery video
CN112529931A (en) * 2020-12-23 2021-03-19 南京航空航天大学 Foreground segmentation method and system
CN112749712A (en) * 2021-01-22 2021-05-04 四川大学 RGBD significance object detection method based on 3D convolutional neural network

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
YANG JIE等: "context-aware deformable alignment for video object segmentation", 《2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION》 *
严广宇等: "基于混合注意力的实时语义分割算法", 《现代计算机》 *
杨杰: "基于时空匹配的视频目标分割算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
汤一明等: "视觉单目标跟踪算法综述", 《测控技术》 *
汪梓艺等: "一种改进DeeplabV3网络的烟雾分割算法", 《西安电子科技大学学报》 *

Also Published As

Publication number Publication date
CN113744306B (en) 2023-07-21

Similar Documents

Publication Publication Date Title
CN110516536B (en) Weak supervision video behavior detection method based on time sequence class activation graph complementation
CN111968150B (en) Weak surveillance video target segmentation method based on full convolution neural network
CN112016682B (en) Video characterization learning and pre-training method and device, electronic equipment and storage medium
CN110688927B (en) Video action detection method based on time sequence convolution modeling
CN114494981B (en) Action video classification method and system based on multi-level motion modeling
CN112364699A (en) Remote sensing image segmentation method, device and medium based on weighted loss fusion network
CN111476133B (en) Unmanned driving-oriented foreground and background codec network target extraction method
CN111526434A (en) Converter-based video abstraction method
CN110852295A (en) Video behavior identification method based on multitask supervised learning
CN115375737B (en) Target tracking method and system based on adaptive time and serialized space-time characteristics
CN112163490A (en) Target detection method based on scene picture
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
CN114996495A (en) Single-sample image segmentation method and device based on multiple prototypes and iterative enhancement
Wang et al. Lightweight bilateral network for real-time semantic segmentation
CN113744306B (en) Video target segmentation method based on time sequence content perception attention mechanism
CN116630850A (en) Twin target tracking method based on multi-attention task fusion and bounding box coding
CN115797827A (en) ViT human body behavior identification method based on double-current network architecture
CN113033283B (en) Improved video classification system
CN112487927B (en) Method and system for realizing indoor scene recognition based on object associated attention
CN113255493B (en) Video target segmentation method integrating visual words and self-attention mechanism
CN114359786A (en) Lip language identification method based on improved space-time convolutional network
CN111382761B (en) CNN-based detector, image detection method and terminal
CN116170638B (en) Self-attention video stream compression method and system for online action detection task
CN117115474A (en) End-to-end single target tracking method based on multi-stage feature extraction
CN117558067A (en) Action prediction method based on action recognition and sequence reasoning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant