WO2023035904A9 - 视频时序动作提名生成方法及系统 - Google Patents

视频时序动作提名生成方法及系统 Download PDF

Info

Publication number
WO2023035904A9
WO2023035904A9 PCT/CN2022/113540 CN2022113540W WO2023035904A9 WO 2023035904 A9 WO2023035904 A9 WO 2023035904A9 CN 2022113540 W CN2022113540 W CN 2022113540W WO 2023035904 A9 WO2023035904 A9 WO 2023035904A9
Authority
WO
WIPO (PCT)
Prior art keywords
features
nomination
video
segment
decoder
Prior art date
Application number
PCT/CN2022/113540
Other languages
English (en)
French (fr)
Other versions
WO2023035904A1 (zh
Inventor
罗平
吴剑南
沈家骏
马岚
Original Assignee
港大科桥有限公司
Tcl科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 港大科桥有限公司, Tcl科技集团股份有限公司 filed Critical 港大科桥有限公司
Publication of WO2023035904A1 publication Critical patent/WO2023035904A1/zh
Publication of WO2023035904A9 publication Critical patent/WO2023035904A9/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Definitions

  • the present invention relates to video processing, and in particular to systems and methods for generating video temporal action nominations.
  • Generating video temporal action nominations is a key step in video temporal action detection. Its purpose is to detect action segments containing human behavior from an uncropped long video, that is, to determine the start and end time of the action.
  • High-quality video temporal action nominations should have the following two key characteristics: (1) accurate temporal boundaries, that is, the generated action nominations should completely cover the area where the action occurs; (2) reliable confidence scores for accurate evaluation The quality of the generated nominations is used for subsequent search ranking.
  • By combining video temporal action nominations with specific action categories, subsequent video temporal action detection tasks can be further completed. Generating video temporal action nominations efficiently and with high quality is beneficial to improving and improving the recognition accuracy of video actions.
  • the purpose of the embodiments of the present invention is to provide a new video temporal action nomination generation method and system to quickly and efficiently generate high-quality video temporal action nominations.
  • a video temporal action nomination generation system which includes a feature extraction module, a feature processing module and a prediction module.
  • the feature extraction module is used to extract video features related to the video from the input video.
  • the feature processing module includes a pre-trained encoder and a decoder, in which the encoder obtains video coding features with global information based on the video features from the feature extraction module, and extracts individual video coding features from the video coding features through several pre-trained nomination segments.
  • the segment features of interest corresponding to the nominated segments are provided to the decoder, and the decoder generates segment features based on the segment of interest features corresponding to each nominated segment and the pre-trained nomination features corresponding to the nominated segments, and provides them to the prediction module.
  • the prediction module generates temporal action nomination results based on segment features from the decoder, which include nomination boundaries and confidence scores.
  • the encoder includes a graph attention layer, a multi-head self-attention layer and a feed-forward layer, wherein the encoder adds the result of video features and position encoding as the value of the multi-head self-attention layer Vector input, and the result is provided as input to the graph attention layer for processing, and its output is linearly transformed to obtain the query vector and key vector of the multi-head self-attention layer.
  • the decoder includes a multi-head self-attention layer, a sparse interaction module and a feed-forward layer, wherein the decoder provides the nomination features corresponding to the nomination segments after being processed by the multi-head self-attention layer to the sparse interaction module and The segment features of interest corresponding to the nominated segment undergo sparse interaction; the output of the sparse interaction module is processed by the feedforward layer to obtain segment features.
  • the feature processing module may be built based on the transformer model.
  • the prediction module may perform boundary regression and binary prediction based on segment features from the decoder.
  • step S1 extracting from the input video via a feature extraction module Video features
  • step S2) Process the extracted video features via the encoder to obtain video encoding features with global context information of the input video
  • Step S3) Use each of several pre-trained nomination segments to extract video information from the video Extract corresponding segment features of interest from the coding features
  • Step S4) Generate segment features via the decoder based on the segment of interest features corresponding to each nominated segment and the pre-trained nomination features corresponding to the nominated segments
  • Step S5) Generate segment features via the prediction module based on The segment features from the decoder are subjected to boundary regression and binary classification prediction, and the corresponding temporal action nomination results are output.
  • the encoder may include a graph attention layer, a multi-head self-attention layer and a feed-forward layer, wherein step S2) includes adding the video feature and position encoding as the multi-head self-attention layer.
  • step S2 includes adding the video feature and position encoding as the multi-head self-attention layer.
  • Value vector input, and the result is provided as input to the graph attention layer for processing, and its output is linearly transformed to obtain the query vector and key vector of the multi-head self-attention layer.
  • a computer-readable storage medium is also provided, on which a computer program is stored.
  • the program is executed, the method described in the second aspect of the above embodiment is implemented.
  • This solution can effectively capture the global context information of the video and obtain video coding features with stronger representation ability; and by introducing several learnable nomination segments to extract the feature sequence of the corresponding position from the video coding features for subsequent prediction, it greatly improves the performance of the video coding feature. Improves training convergence speed and significantly reduces computational burden.
  • Figure 1 shows a schematic operational flow diagram of a video sequential action nomination generation system according to an embodiment of the present invention.
  • Figure 2 shows a schematic diagram of the sparse interaction process of the sparse interaction module according to an embodiment of the present invention.
  • Figure 3 shows a schematic flowchart of a method for generating video temporal action nominations according to an embodiment of the present invention.
  • Existing video temporal action nomination generation methods can be divided into anchor box-based methods and boundary-based methods.
  • boundary regression is performed on uniformly distributed anchor boxes with predefined sizes and proportions, and a binary classifier is used to evaluate the confidence score of the nomination.
  • anchor boxes with predefined sizes and proportions are laid at each position of the one-dimensional feature sequence of the video. If the length of the one-dimensional feature sequence is T and K anchor boxes are laid at each position, a total of TK anchors need to be predicted. box results.
  • the intersection-over-union (IOU) size with the real label box is used to select positive and negative samples, and the regression of the temporal boundary and the two-class prediction of the confidence of the anchor box are performed on these TK anchor boxes.
  • IOU intersection-over-union
  • the boundary-based method generates candidate nominations of any length by enumerating all candidate starting and ending points, and predicts the boundary probability of each candidate nomination to obtain a two-dimensional confidence map.
  • the basic module of this type of method is the convolution layer, which can only capture the information of local areas, but cannot capture the long-term semantic information of the video.
  • BMN Liu, X., Li, , pp.3889-3898, 2019
  • both methods have the following two disadvantages.
  • a video temporal action nomination generation system which includes a feature extraction module, a feature processing module and a prediction module.
  • the feature extraction module is used to extract video features related to the video from the input video.
  • the feature processing module is built based on the Transformer model, including an encoder and a decoder.
  • the encoder obtains video coding features with global information based on the video features from the feature extraction module, and extracts the interesting segment features corresponding to each nominated segment from the video coding features through several preset nomination segments and provides them to the decoder.
  • the decoder generates segment features based on the segment features of interest corresponding to each nominated segment and the nomination features corresponding to the nominated segment, and provides them to the prediction module.
  • the prediction module generates temporal action nomination results based on segment features from the decoder, which include nomination boundaries and confidence scores.
  • the system's feature processing module and prediction module are first uniformly trained using a training set composed of a large number of video clips that have been marked with temporal action nominations as samples (which can be called an offline training phase), and then the system is The processed video clips are provided as input to the trained system for processing, and the output is a temporal action nomination of the input video, which includes each nomination boundary and the corresponding confidence score (which can be called an online prediction stage).
  • a training set composed of a large number of video clips that have been marked with temporal action nominations as samples (which can be called an offline training phase)
  • the system is The processed video clips are provided as input to the trained system for processing, and the output is a temporal action nomination of the input video, which includes each nomination boundary and the corresponding confidence score (which can be called an online prediction stage).
  • several preset nomination segments and their corresponding nomination features, as well as the parameters involved in the encoder, decoder and prediction module are all randomly set.
  • the above-mentioned parameters are continuously adjusted during the training process until
  • the feature extraction module and prediction module can adopt any type of machine learning model suitable for video feature extraction and prediction of nomination boundaries and confidence scores based on input features, including but not limited to neural network models, this article There are no restrictions on this.
  • the extraction and processing of video features in the training phase and the online processing phase are basically the same, the following mainly introduces the video feature processing process in the training phase with reference to Figure 1.
  • the feature extraction module extracts video features related to the video, such as image features (such as RGB features) and optical flow features of the video.
  • a neural network such as a Temporal Segment Network (TSN) can be used to extract video features.
  • TSN Temporal Segment Network
  • the extracted video features of different dimensions are converted into a series of feature sequences with the same feature dimensions.
  • the feature dimensions of the feature sequence can be set according to actual needs and are not limited here.
  • the video features are recorded as Among them, R represents a real number, M represents the length of the video, which can be understood as the number of frames of the video, and C represents the dimension of the feature vector, that is, the dimension of the feature vector extracted from each video frame.
  • the video feature f can also be regarded as a video feature sequence composed of the feature vectors of M video frames, and each video frame has its own specific position in the sequence.
  • the video features extracted through the feature extraction module are provided to the feature processing module for processing. It should be understood that the above video features can be appropriately transformed to adapt to or match the feature dimensions set in the feature processing module.
  • the extracted features can be aligned in feature dimensions through a one-dimensional convolution layer with a convolution kernel size of 1, and the transformed video feature sequence can be used as the input of the encoder in the subsequent process.
  • the encoder mainly includes multi-head self-attention layers and feed-forward layers.
  • the multi-head self-attention layer consists of multiple independent self-attention layers.
  • the self-attention layer adopts a structure based on the attention mechanism. Its core content is that when encoding the corresponding sequence, it can view other sequence information of the input sequence and connect the sequences two by two, thereby effectively capturing the global context information of the input sequence and constructing inter-sequence information. long-distance dependencies. Therefore, the purpose of enhancing relevant features and suppressing irrelevant features can be achieved.
  • the input of the multi-head self-attention layer is a triplet, consisting of the query vector Q (query), the key vector K (key), and the value vector V (value).
  • the calculation process of each self-attention layer is as follows:
  • the score between two features in the sequence is calculated by performing a dot product operation on the query vector Q and the key vector K.
  • the score represents the correlation between the two features.
  • a scale factor d k is used to normalize the score, and then the value is normalized to between 0 and 1 through the softmax() function.
  • the final score is weighted with the value vector V to achieve The purpose is to enhance relevant features and reduce the suppression of irrelevant features.
  • the multi-head self-attention layer contains multiple independent self-attention layers to each focus on a part of the contextual information.
  • h represents the total number of self-attention layers included in the multi-head self-attention layer
  • W o is the parameter matrix of the linear layer used to aggregate features.
  • the output of the multi-head self-attention layer will be further processed by addition and normalization operations before being input to the feedforward layer.
  • the feedforward layer can be composed of two linear transformation layers and a nonlinear activation function Relu.
  • the output of the feedforward layer is processed by addition and normalization operations to obtain the output of the entire encoder, which is the video coding feature.
  • the inputs Q, K, and V of the multi-head self-attention layer in the encoder are made by passing the input feature sequence through three matrices (W Q , W K , W V ) is obtained by mapping the linear transformation layer. For example, assuming that the input sequence is T 0 , then Q, K, and V are calculated according to the following formula:
  • a graph attention layer is introduced in the encoder based on the multi-head self-attention layer to preprocess the input sequence, so that the encoder can better focus on the video.
  • position coding is used in the encoder, and the addition result of the input video feature sequence and position coding is used as input, where d is the dimension of the features used in the encoder.
  • the dimension of the position encoding is the same as the dimension of the input video feature, that is, the feature vector of each video frame in the input video feature sequence has its own corresponding position encoding.
  • the position encoding which is one of the parameters of the encoder, is randomly set during system initialization and is continuously adjusted during subsequent training.
  • the input x obtained by adding the input video feature sequence and the position encoding is directly used as the value vector V of the multi-head self-attention layer.
  • the input x is provided to a graph attention layer for transformation processing.
  • the output of the graph attention layer is further subjected to linear layer transformation to obtain the query vector Q and key vector K of the multi-head self-attention layer.
  • the graph attention layer is used to further strengthen the connection between features at different time points in the video. Taking the input i-th vector x i as an example, it is transformed into:
  • M here represents the length of the video, which can be understood as the number of frames of the video.
  • W k is the learnable weight matrix of the k-th graph self-attention layer
  • is the nonlinear activation function, for example, Leaky ReLU function.
  • the calculation process is:
  • ⁇ k is the learnable weight vector
  • T represents the transpose operation
  • N learnable nomination segments and their corresponding nomination features are introduced to further process the video coding features output via the encoder.
  • Each nominated segment is used to extract the feature sequence of the corresponding position from the video coding features to obtain the segment features of interest and provide them as input to the decoder together with the nomination features corresponding to the nominated segment.
  • Each nomination segment is a normalized two-dimensional coordinate (value between 0-1), which represents a segment on the video timeline; each nomination feature is a vector with dimension d.
  • the length of each nomination segment may be different, and therefore the dimensions of the extracted feature sequence may also be different.
  • bilinear interpolation can be used to adjust all the extracted features to the same length M′, that is, the length of each segment of interest feature
  • M′ the length of each segment of interest feature
  • the dimensions are M′ ⁇ d.
  • these N nomination fragments and their corresponding nomination features are also parameters that must be obtained during the training process. They are randomly set during system initialization. In the subsequent training process Keep making adjustments.
  • these N nominated features are first input to the multi-head self-attention layer.
  • the multi-head self-attention layer relevant information about the long-distance dependence between each nominated feature is obtained.
  • the nomination features corresponding to each nomination segment and the interesting segment features corresponding to the nomination segment interact one-to-one in the sparse interaction module.
  • the output of the sparse interaction module is further added and normalized and then provided to the feed-forward layer.
  • N segment features are output, which is the output result of the decoder.
  • Figure 2 takes the k-th nominated feature as an example to show the sparse interaction process with the corresponding segment features of interest in the sparse interaction module.
  • the nomination feature vector with dimension d is passed through a linear layer and scaled to obtain two parameters of size d ⁇ d h and d h ⁇ d (d h here can be set according to specific decoder requirements).
  • Interested The fragment features are matrix multiplied with these two parameters in sequence to obtain fragment features of size M′ ⁇ d.
  • This process can be regarded as the feature fragment of interest passing through two layers of one-dimensional convolution layers, so it can also be called a dynamic convolution operation.
  • the nomination features interact with the corresponding segment of interest features without interacting with the global video coding features, which can greatly improve the speed of training convergence.
  • the prediction module receives N segment features from the decoder for boundary regression and binary classification prediction, and outputs N nomination prediction results, including nomination boundaries and corresponding confidence scores.
  • N nomination prediction results predicted through the above process are matched one-to-one with the real nomination labels corresponding to the samples using optimal binary matching.
  • the Focal loss function is used as the binary classification loss function
  • the L1 loss function and the GIOU loss function are the regression loss functions.
  • the sum of the classification cost and regression cost of N nomination prediction results for each nomination label is calculated.
  • the unique prediction prediction result with the smallest total cost is selected as a positive sample, and the nomination prediction results that do not match the real nomination label are regarded as negative samples.
  • the prediction module consists of two independent feed-forward layers, one of which consists of one linear layer for evaluating the confidence score of the generated nomination results, and the other of which consists of three layers Composed of linear layers for regression on nominated boundary coordinates.
  • the above training process is continuously repeated on the training set to continue iterative optimization, in which the nomination boundaries output by the prediction module in each round of training are used as the N nomination fragments used in the next round of training.
  • the N nomination segments involved in the system and their corresponding nomination features as well as the parameters involved in the encoder, decoder and prediction module will be determined, which can be used in the subsequent online prediction stage.
  • N can be set according to the length of the video clip to be processed, actual requirements and system performance. For example, if there are usually 2 to 3 nominations on a 1-minute-long video clip to be processed, N can be set to at least greater than the number of nominations that may exist on the video clip, for example, N can be set to any integer greater than 3. However, it should be understood that the larger N is, the greater the computing performance consumed. Therefore, N is usually at most no more than 10 times the number of possible nominations on the video clip to be processed. For example, for a 1-minute video clip to be processed, N can be set to an integer between 3-30.
  • the video clips to be processed are fed to the system.
  • the system first extracts video features from it, changes the extracted video features into video coding features with global contextual information of the input video via the encoder, and combines each of the pre-trained N nomination clips from the video coding features Extract corresponding segment features of interest. Then, through the decoder, the segment features of interest corresponding to each nominated segment and its corresponding nominated features are interacted one-to-one to obtain the segment features and provide them to the prediction module. Finally, the prediction module performs boundary regression and binary classification prediction on the segment features from the decoder, and outputs N nomination generation results corresponding to the video segment to be processed.
  • N action nomination results can be obtained directly without the need for non-maximum suppression post-processing, and the generated
  • the number of action nominations has nothing to do with the length of the video, so it can significantly reduce the computational burden and greatly increase the speed of generating time-series action nominations.
  • the system according to the above embodiment can effectively capture the global context information of the video and obtain video coding features with stronger representation ability; and by introducing several learnable nomination fragments, the features of the corresponding positions are extracted from the video coding features.
  • the sequence is used for subsequent predictions, which greatly improves the training convergence speed and greatly reduces the computational burden.
  • FIG. 3 shows a schematic flowchart of a method for generating temporal action nominations using the above video temporal action nomination generation system according to an embodiment of the present invention.
  • the method includes: step S1) extracting video features from the input video via a feature extraction module; step S2) processing the extracted video features via an encoder to obtain video coding features with global context information of the input video; Step S3) Use each of the preset plurality of nomination segments to extract the corresponding segment of interest features from the video coding features; Step S4) Use the decoder to extract the nomination features corresponding to each nomination segment and the sense corresponding to the nomination segment.
  • the interesting segment features interact to obtain the segment features;
  • the prediction module performs boundary regression and binary classification prediction according to the segment features from the decoder, and outputs the corresponding temporal action nomination results.
  • the inventor also compared the performance of the sequential action nomination generation method of the present invention and the existing commonly used sequential action nomination generation method based on the THUMOS14 data set and ActivityNet-1.3 data set.
  • the system structure shown in Figure 1 is used to perform 20 cycles of iterative training on the training set. After each cycle is completed, the loss on the validation set is calculated to evaluate the performance of the system, and the validation set loss is selected.
  • the minimal system structure serves as a training-completed system.
  • the video features are input into the trained system, and the output of the prediction module is used as the final N nomination generation results.
  • the nomination generation results are compared with the real nomination labels, and the recall rate on the validation set is calculated to verify the performance of the trained model structure.
  • Table 1 shows the performance comparison between the method of the present invention and the current mainstream methods on the THUMOS14 data set. The recall rate of nominations is used as the evaluation index. The results show that the method of the present invention is better than other methods.
  • Table 2 shows the comparison of the inference speed between the method of the present invention and other mainstream algorithms on the ActivityNet-1.3 data set. For a fair comparison, the average inference time of each video is calculated, and the results show that the inventive method is at least 8 times faster than the existing method.
  • a computer-readable storage medium is also provided, on which a computer program or executable instructions are stored, and when the computer program or executable instructions are executed by a processor or other computing unit
  • a computer-readable storage medium may be any tangible medium capable of storing data and readable by a computing device. Examples of computer-readable storage media include hard drives, network attached storage (NAS), read-only memory, random access memory, CD-ROMs, CD-Rs, CD-RWs, magnetic tape, and other optical or non-optical data storage devices.
  • Computer-readable storage media may also include computer-readable media distributed over network-coupled computer systems so that the computer program or instructions may be stored and executed in a distributed manner.
  • appearances of the phrases “in various embodiments,” “in some embodiments,” “in one embodiment,” or “in an embodiment” etc. in various places throughout this specification are not necessarily referring to the same implementation. example.
  • specific features, structures, or properties may be combined in any suitable manner in one or more embodiments.
  • a particular feature, structure, or property shown or described in connection with one embodiment may be combined, in whole or in part, without limitation with features, structures, or properties of one or more other embodiments so long as the combination is not non-unlimited. Logical or not working.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明的实施例提供了视频时序动作提名生成系统及方法,其经由编码器对从输入的视频所提取的视频特征进行处理以获取带有全局信息的视频编码特征,并通过预先训练的多个提名片段从视频编码特征中抽取对应的感兴趣片段特征提供至解码器,解码器基于每个提名片段对应的感兴趣片段特征和预先训练的各提名片段对应的提名特征生成片段特征,并将其提供至预测模块;预测模块基于来自解码器的片段特征生成时序动作提名结果。本发明实施例的方案能有效捕捉视频的全局上下文信息,获取表征能力更强的视频编码特征;而且通过引入若干个可学习的提名片段来从视频编码特征中抽取对应位置的特征序列来用于后续预测,提高了训练收敛速度并大幅降低了计算负担。

Description

视频时序动作提名生成方法及系统 技术领域
本发明涉及视频处理,尤其涉及用于生成视频时序动作提名的系统及方法。
背景技术
生成视频时序动作提名是视频时序动作检测的关键步骤,其目的在于从一段未裁剪的长视频中检测出包含人类行为的动作片段,即确定动作发生的开始和结束时间。高质量的视频时序动作提名应当具有以下两个关键特性:(1)准确的时序边界,即生成的动作提名应完整地覆盖动作发生的区域;(2)可靠的置信度分数,用于准确评估所生成的提名的质量以用于后续的检索排序。通过视频时序动作提名与具体的动作类别结合可进一步完成后续的视频时序动作检测任务。高效且高质量地生成视频时序动作提名有利于改善和提高视频动作的识别精度。
发明内容
本发明实施例的目的在于提供一种新的视频时序动作提名生成方法和系统来快速、高效地生成高质量的视频时序动作提名。上述目的是通过以下技术方案实现的:
根据本发明实施例的第一方面,提供了一种视频时序动作提名生成系统,其包括特征提取模块、特征处理模块和预测模块。其中特征提取模块用于从输入的视频提取与该视频相关的视频特征。特征处理模块包括预先训练的编码器和解码器,其中编码器基于来自特征提取模块的视频特征获取带有全局信息的视频编码特征,并通过预先训练的若干个提名片段从视频编码特征中抽取各个提名片段对应的感兴趣片段特征提供至解码器,解码器基于每个提名片段对应的感兴趣片段特征和预先训练的与提名片段对应的提名特征生成片段特征,并将其提供至预测模块。预测模块基于来自解码器的片段特征生成时序动作提名结果,其包括提名边界和置信度分数。
在本发明的一些实施例中,编码器包括图注意力层、多头自注意力层和前馈层,其中所述编码器将视频特征和位置编码相加的结果作为多头自注意力层的值向量输入,同时将该结果作为输入提供给图注意力层处理,其输出经线性变换后得到多头自注意力层的查询向量和键向量。
在本发明的一些实施例中,解码器包括多头自注意力层、稀疏交互模块和前馈层,其中解码器将提名片段对应的提名特征经多头自注意力层处理后提供至稀疏交互模块与该提名片段对应的感兴趣片段特征进行稀疏交互;该稀疏交互模块的输出经前馈层处理后得到片段特征。
在本发明的一些实施例中,特征处理模块可以基于变换器模型构建。
在本发明的一些实施例中,预测模块可以基于来自解码器的片段特征进行边界回归和二分类预测。
根据本发明实施例的第二方面,还提供了一种采用根据本发明实施例的第一方面的系统生成时序动作提名生成的方法,包括:步骤S1)经由特征提取模块从输入的视频中提取视频特征;步骤S2)经由编码器对所提取的视频特征进行处理以得到具有该输入的视频的全局上下文信息的视频编码特征;步骤S3)利用预先训练的若干个提名片段中的每一个从视频编码特征中抽取相应的感兴趣片段特征;步骤S4)经由解码器基于每个提名片段对应的感兴趣片段特征和预先训练的与提名片段对应的提名特征生成片段特征;步骤S5)经由预测模块根据来自解码器的片段特征进行边界回归和二分类预测,输出相应的时序动作提名结果。
在本发明的一些实施例中,编码器可包括图注意力层、多头自注意力层和前馈层,其中步骤S2)包括将视频特征和位置编码相加的结果作为多头自注意力层的值向量输入,同时将该结果作为输入提供给图注意力层处理,其输出经线性变换后得到多头自注意力层的查询向量和键向量。
根据本发明实施例的第三方面,还提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被执行时实现如上述实施例第二方面所述的方法。
本发明实施例提供的技术方案可以包括以下有益效果:
该方案能有效捕捉视频的全局上下文信息,获取表征能力更强的视频编码特征;而且通过引入若干个可学习的提名片段来从视频编码特征中抽取对应位置的特征序列来用于后续预测,大大提高了训练收敛速度并大幅降低了计算负担。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本发明。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并与说明书一起用于解释本发明的原理。显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。在附图中:
图1示出了根据本发明一个实施例的视频时序动作提名生成系统的操作流程示意图。
图2示出了根据本发明一个实施例的稀疏交互模块的稀疏交互流程示意图。
图3示出了根据本发明一个实施例的视频时序动作提名生成方法的流程示意图。
具体实施方式
为了使本发明的目的,技术方案及优点更加清楚明白,以下结合附图通过具体实施例对本发明进一步详细说明。应当理解,所描述的实施例是本发明的一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在不经创造性劳动获得的所有其他实施例,都属于本发明保护的范围。
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本发明的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本发明的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本发明的各方面。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
现有的视频时序动作提名生成方法可分为基于锚框方法和基于边界方法。基于锚框方法对预先定义好尺寸和比例且均匀分布的锚框进行边界回归,并采用一个二分类器来评估提名的置信度分数。具体地,在视频一维特征序列的每个位置上铺设预定义好大小和比例的锚框,若一维特征序列长度为T,每个位置铺设K个锚框,则共需预测TK个锚框结果。在训练阶段,采用与真实标签框的交并比(IOU)大小选择正负样本,对这TK个锚框进行时序边界的回归以及锚框置信度的二分类预测。在模型推理阶段,由于预测出的锚框结果会有大量的重叠,因此需要采用非极大值抑制方法去除冗余的预测结果,得到最终的提名生成结果。常见的方法有Prop-SSAD(Lin,T.,Zhao,X.,&Shou,Z.,Temporal convolution based action proposal:Submission to activitynet 2017.arXiv preprint arXiv:1707.06750.),RapNet(Gao,J.,Shi,Z.,Wang,G.,Li,J.,Yuan,Y.,Ge,S.,&Zhou,X..Accurate temporal action proposal generation with relation-aware pyramid network.In Proceedings of the AAAI Conference on Artificial Intelligence,Vol.34,No.07,pp.10810-10817,2020年4月)。该类方法的性能极度依赖于锚框的人工设计,因此难以扩展,在应用于不同场景时十分繁琐。而基于边界方法通过列举所有的候选起止点生成任意长度的候选提名,并对每个候选提名进行边界概率预测得到二维置信图。该类方法的基础模块是卷积层,只能捕捉局部区域的信息,而不能捕捉视频的长期语义信息。BMN(Lin,T.,Liu,X.,Li,X.,Ding,E.,&Wen,S.Bmn:Boundary-matching network for temporal action proposal generation.In Proceedings of the IEEE/CVF International Conference on Computer Vision,pp.3889-3898,2019),DBG(Lin,C.,Li,J.,Wang,Y.,Tai,Y.,Luo,D.,Cui,Z.,...&Ji,R..Fast learning of temporal action proposal via dense boundary generator.In Proceedings of the AAAI Conference on Artificial Intelligence,Vol.34,No.07,pp.11499-11506,2020年4月),BSN++(Su,H.,Gan,W.,Wu,W.,Qiao,Y.,&Yan,J.(2020).Bsn++:Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation.arXiv preprint arXiv:2009.07641.)属于该类方法。
此外,这两种方法均具有以下两个缺点。一是随着视频长度的增加,预定义的锚框数量以及生成的置信图尺寸都会大大增加,对计算资源消耗 巨大,难以应用到实际场景中;二是这两种方法均生成了大量的冗余提名,需要采用非极大值抑制的后处理方法消除冗余预测结果,后处理操作不仅需要细致的参数选择,而且大大降低了模型的推理速度。
在本发明的实施例中提供了一种视频时序动作提名生成系统,其包括特征提取模块、特征处理模块和预测模块。其中特征提取模块用于从输入的视频提取与该视频相关的视频特征。特征处理模块基于Transformer(变换器)模型构建,包括编码器和解码器。该编码器基于来自特征提取模块的视频特征获取带有全局信息的视频编码特征,并通过预设的若干个提名片段从视频编码特征中抽取各个提名片段对应的感兴趣片段特征提供至解码器,解码器基于每个提名片段对应的感兴趣片段特征和该提名片段对应的提名特征生成片段特征,并将其提供至预测模块。预测模块基于来自解码器的片段特征生成时序动作提名结果,其包括提名边界和置信度分数。
在该实施例中,首先利用以已标注了时序动作提名的大量视频片段作为样本构成的训练集对该系统的特征处理模块和预测模块进行统一训练(可以称为离线训练阶段),然后将待处理的视频片段作为输入提供给该训练好的系统进行处理,其输出为该输入视频的时序动作提名,其包括各个提名边界及对应置信度分数(可以称为在线预测阶段)。在系统初始时,预设的若干个提名片段及其对应的提名特征以及编码器、解码器和预测模块中涉及的参数均是随机设置。在训练过程中上述这些参数在训练过程中不断被调整直到训练结束,这些训练好的参数用于后续在线预测阶段。应指出,这里的特征提取模块和预测模块可以采用适用于进行视频特征提取和使用于根据输入的特征预测提名边界和置信度分数的任何类型的机器学习模型,包括但不限于神经网络模型,本文对此不进行限制。考虑到在训练阶段和在线处理阶段对于视频特征的提取和处理是基本相同的,下文主要结合图1对于训练阶段中视频特征的处理过程进行介绍。
首先对于输入的视频,通过特征提取模块提取与该视频相关的视频特征,例如视频的图像特征(如RGB特征)和光流特征等。在一个示例中,可以采用诸如时间片段网络(Temporal Segment Network,TSN)之类的神经网络来提取视频特征。对于所提取的不同维度的视频特征,将其转换成具备相同的特征维度的一系列特征序列。特征序列的特征维度可以根据实际需求进行设置,在此不进行限制。为方便描述,在下面的示例中将视频特征记为
Figure PCTCN2022113540-appb-000001
其中R代表实数,M代表视频的长度,可以理解为视 频的帧数,C代表特征向量的维度,即从每个视频帧提取的特征向量的维度。可以看出,视频特征f也可以被视为是由M个视频帧的特征向量构成的一个视频特征序列,每个视频帧在该序列中有自己特定的位置。经由特征提取模块提取的视频特征提供至特征处理模块进行处理。应理解,可以对上述视频特征进行适当变换处理以适应或匹配特征处理模块中设定的特征维度。例如,对于所提取的特征,可以经过一个卷积核大小为1的一维卷积层进行特征维度的对齐,变换后的视频特征序列可作为后续过程中编码器的输入。
参考图1,编码器主要包含多头自注意力层和前馈层。其中多头自注意层由多个独立的自注意层组成。自注意层采用基于注意力机制的结构,其核心内容是在编码相应序列的时候可以查看输入序列的其他序列信息,将序列两两连接,从而有效捕捉到输入序列的全局上下文信息,构建序列间的长距离依赖关系。因此可以达到增强相关特征,抑制无缘特征的目的。多头自注意层的输入为三元组,由查询向量Q(query)、键向量K(key)、值向量V(value)构成。每个自注意层的计算过程如下:
Figure PCTCN2022113540-appb-000002
其中d k为比例因子,这里的T表示转置,softmax()代表激活函数。如上面公式所示,通过将查询向量Q与键向量K做点积操作来计算序列中特征两两之间的分数,该分数代表着两个特征之间的关联性。为了保持梯度稳定,用一个比例因子d k来对分数进行归一化操作,然后再经过softmax()函数将数值标准化到0-1之间,最后得到的分数与值向量V进行加权,以达到增强相关特征,降低抑制无关特征的目的。在此基础上面,多头自注意力层包含了多个独立的自注意层来各自重点关注一部分上下文信息,这些自注意层的输出(每个自注意力层的输出可以记为head,head=Attention(Q,K,V))被拼接起来,并且经由一个线性层进一步聚合后得到的具有更好鲁棒性的多头自注意力层的输出MultiHead(Q,K,V),其计算公式如下:
MultiHead(Q,K,V)=Concat(head 1,...,head h)W O
其中,h代表多头自注意力层中包含的自注意力层的总数量,W o为用于聚合特征的线性层的参数矩阵。如图1所示,该多头自注意层的输出还会进一步经过相加和归一化操作处理后输入至前馈层。其中前馈层可以由两个线性变换层以及一个非线性激活函数Relu组成。前馈层的输出经相加和归一化操作处理后得到整个编码器的输出结果,即视频编码特征。
在一些实施例中,可以如通常的Transformer模型,编码器中的多头自注意力层的输入Q,K,V是将输入的特征序列经由3个含不同参数矩阵(W Q、W K、W V)的线性变换层进行映射的得到的。例如假设输入的序列为T 0,则Q,K,V以如下公式计算:
Q=T 0W Q,K=T 0W k,V=T 0W V
而在图1所示的实施例中,在编码器中在多头自注意力层的基础上引入了图注意力层来对输入的序列进行预处理,从而使编码器能更好地关注视频中动作发生的片段并构建动作片段之间的联系,从而获得表征能力更强的编码特征。并且为了视频特征中相对位置关系,例如各视频帧的相对位置和时序关系,在编码器中采用了位置编码,以输入的视频特征序列和位置编码的相加结果作为输入,
Figure PCTCN2022113540-appb-000003
其中d为在编码器中适用的特征的维度。位置编码的维度与输入的视频特征的维度相同,即输入的视频特征序列中每个视频帧的特征向量都有自己对应的位置编码。如上文提到的,作为编码器的其中一个参数的位置编码在系统初始化时随机设置的,在后续训练过程中不断进行调整。
如图1所示,将输入的视频特征序列与位置编码的相加后得到的输入x直接作为多头自注意力层的值向量V,同时该输入x被提供至一个图注意力层进行变换处理,图注意力层的输出进一步进行线性层变换后得到多头自注意力层的查询向量Q和键向量K。其中图注意力层用于进一步强化视频中不同时间点特征之间的联系,以输入的第i个向量x i为例,其经过图注意力层后变换为:
Figure PCTCN2022113540-appb-000004
其中,||为拼接操作,K为图注意力层的多头数量,i=1,2,…,M,这里的M如上文提到的代表视频的长度,可以理解为视频的帧数。W k为第k个图自注意力层的可学习权重矩阵,σ为非线性激活函数,例如,Leaky ReLU 函数。
Figure PCTCN2022113540-appb-000005
为第k个图注意力层中特征向量x i对x j的权重系数,表征了两者之间的相关性大小,其计算过程为:
Figure PCTCN2022113540-appb-000006
其中,α k为可学习的权重向量,T代表转置操作。通过在编码器中引入上述图注意力机制可以进一步动态地构建视频特征序列不同帧之间的联系,从而更准确地进行全局上下文信息的捕捉,帮助编码器获得表征能力更强的视频编码特征。
继续参考图1,在本发明的实施例中,引入了N个可学习的提名片段及其对应的提名特征来进一步对经由编码器输出的视频编码特征进行处理。利用每个提名片段从视频编码特征中抽取对应位置的特征序列以得到感兴趣片段特征并将其与该提名片段对应的提名特征一起作为输入提供给解码器。其中每个提名片段为一个归一化的二维坐标(数值在0-1之间),其代表视频时间轴上一个片段;每个提名特征为维度为d的向量。这里,各个提名片段的长度可以不同,因此所提取的特征序列的维度也可能不同。因此在一个示例中,在利用提名片段从视频编码特征中抽取对应位置的特征序列之后,可以利用双线性插值将所有抽取出的特征调整至同一长度M′,即每个感兴趣片段特征的维度为M′×d。如上文提到的,与编码器的位置编码一样,这N个提名片段及其对应的提名特征也都是要在经过训练过程中得到的参数,在系统初始化时随机设置,在后续训练过程中不断进行调整。
在解码器中,这N个提名特征首先输入至多头自注意力层,经过多头自注意力层以获取各提名特征之间长距离依赖关系的相关信息,在对多头自注意力层的输出经过相加和归一化处理后,每个提名片段对应的提名特征和该提名片段对应的感兴趣片段特征在稀疏交互模块中进行一对一的交互。该稀疏交互模块的输出进一步经过相加和归一化处理后提供至前馈层,前馈层的输出经相加和归一化处理后,输出N个片段特征,即解码器的输出结果。图2以第k个提名特征为例,展示了其与对应感兴趣片段特征在稀疏交互模块中的稀疏交互过程。具体地,维度为d的提名特征向量 经过线性层并进行尺度调整后得到大小为d×d h以及d h×d的两个参数(这里的d h可根据具体解码器需求设置),感兴趣片段特征分别与这两个参数依次进行矩阵乘法,得到大小为M′×d的片段特征。这一过程可视为感兴趣特征片段经过两层的一维卷积层,因此也可称为动态卷积操作。在上述的解码器中,提名特征至于对应的感兴趣片段特征进行交互,而不需要与全局的视频编码特征交互,从而可以大大提高训练收敛的速度。
继续参考图1,预测模块接收来自解码器的N个片段特征进行边界回归和二分类预测,输出N个提名预测结果,包括提名边界及对应置信度分数。在每次的训练中将经过上述过程预测得到的N个提名预测结果与样本对应的真实提名标签采用最优二分匹配进行一对一的匹配。例如,采用Focal损失函数为二分类损失函数,L1损失函数和GIOU损失函数为回归损失函数,对于一个视频,计算N个提名预测结果对每个提名标签的分类代价与回归代价之和,最终对于每个真实提名标签,选择总代价最小的唯一提名预测结果作为正样本,而不与真实提名标签匹配的提名预测结果均视为负样本。在该实施例中,预测模块由两个独立的前馈层组成,其中一个前馈层由一层线性层组成,用于评估所生成提名结果的置信度分数,另一个前馈层由三层线性层组成,用于对提名的边界坐标进行回归。在训练集上不断重复上述训练过程继续迭代优化,其中每一轮的训练中预测模块输出的提名边界作为下一轮训练中采用的N个提名片段。在训练完成后,该系统中涉及的N个提名片段及其对应的提名特征以及编码器、解码器和预测模块中涉及的参数都会被确定下来,从而可用于后续在线预测阶段。在本文中,N的取值可以依据待处理的视频片段长度、实际需求和系统性能来设置。例如,待处理的1分钟长度的视频片段上通常有2到3个提名,则可以将N设置为至少大于该视频片段上可能存在的提名的数量,例如将N设置为大于3的任意整数。但应理解,N越大,消耗的计算性能越大。因此,N通常最大不超过待处理视频片段上可能存在的提名的数量的10倍数的关系。例如对于待处理的1分钟长度的视频片段,可以将N设置为在3-30之间的整数。
在线预测阶段,将待处理的视频片段提供给该系统。该系统首先从中提取视频特征,经由编码器将所提取的视频特征变化为具有该输入的视频的全局上下文信息的视频编码特征,结合预先训练好的N个提名片段中的 每一个从视频编码特征中抽取相应的感兴趣片段特征。接着经由解码器对于每个提名片段对应的感兴趣片段特征和其对应的提名特征进行一对一交互后得到片段特征,并将其提供给预测模块。最后经由预测模块对来自解码器的片段特征进行边界回归和二分类预测,并输出与该待处理的视频片段相对应的N个提名生成结果。与现有技术不同,在该系统中通过引入N个可学习的提名片段和对应的提名特征,可以直接得到N个动作提名结果,而无需非极大值抑制的后处理过程,而且其生成的动作提名数量与视频长度无关,因此能大幅度降低计算负担,极大地提高时序动作提名的生成速度。
可以看出,根据上述实施例的系统能有效捕捉视频的全局上下文信息,获取表征能力更强的视频编码特征;而且通过引入若干个可学习的提名片段来从视频编码特征中抽取对应位置的特征序列来用于后续预测,大大提高了训练收敛速度并大幅降低了计算负担。
图3示出了利用上述根据本发明实施例的视频时序动作提名生成系统生成时序动作提名的方法的流程示意图。该方法包括:步骤S1)经由特征提取模块从输入的视频中提取视频特征;步骤S2)经由编码器对所提取的视频特征进行处理以得到具有该输入的视频的全局上下文信息的视频编码特征;步骤S3)利用预设的多个提名片段中的每一个从视频编码特征中抽取相应的感兴趣片段特征;步骤S4)经由解码器对于每个提名片段对应的提名特征与该提名片段对应的感兴趣片段特征进行交互以得到片段特征;步骤S5)经由预测模块根据来自解码器的片段特征进行边界回归和二分类预测,输出相应的时序动作提名结果。
为了更好地说明本发明的性能,发明人还基于THUMOS14数据集和ActivityNet-1.3数据集比较了本发明的时序动作提名生成方法与现有常用的时序动作提名生成方法的性能。
在训练过程中,利用图1所示的系统结构在训练集上进行20个周期的迭代训练,在每个周期完成后,计算验证集上的损失以评估该系统的性能,并选择验证集损失最小的系统结构作为训练完成的系统。
在预测阶段,将视频特征输入训练好的系统,将预测模块的输出结果作为最终的N个提名生成结果。将提名生成结果与真实提名标签进行比较,计算在验证集上的召回率以验证所训练的模型结构的性能。表1为用本发明的方法与目前主流方法在THUMOS14数据集上进行性能比较,以提名 的召回率作为评估指标,结果显示本发明的方法优于其他方法。表2为本发明的方法与其他主流算法在ActivityNet-1.3数据集上的推理速度的比较。为了公平比较,计算每个视频的平均推理时间,结果显示本发明的方法比现有方法至少快8倍。
表1
方法 AR@50 AR@100 AR@200 AR@500
BSN 37.46 46.06 53.21 60.64
BMN 39.36 47.72 54.70 62.07
RapNet 40.35 48.23 54.92 61.41
DBG 37.32 46.67 54.50 62.21
本发明 40.40 48.70 55.51 62.20
表2
方法 BSN BMN GTAD DBG 本发明
T pro(sec) 0.671 0.118 0.103 0.219 0.056
T all(sec) 0.815 0.745 0.862 0.596 0.074
在本发明的又一个实施例中,还提供了一种计算机可读存储介质,其上存储有计算机程序或可执行指令,当所述计算机程序或可执行指令被处理器或其他计算单元执行时实现如前述实施例中所述的技术方案,其实现原理类似,此处不再赘述。在本发明的实施例中,计算机可读存储介质可以是任何能够存储数据且可以被计算装置读取的有形介质。计算机可读存储介质的实例包括硬盘驱动器、网络附加存储器(NAS)、只读存储器、随机存取存储器、CD-ROM、CD-R、CD-RW、磁带以及其它光学或非光学数据存储装置。计算机可读存储介质也可以包括分布在网络耦合计算机系统上的计算机可读介质,以便可以分布式地存储和执行计算机程序或指令。
本说明书中针对“各个实施例”、“一些实施例”、“一个实施例”、或“实施例”等的参考指代的是结合所述实施例所描述的特定特征、结构、或性质包括在至少一个实施例中。因此,短语“在各个实施例中”、“在一些实施例中”、“在一个实施例中”、或“在实施例中”等在整个说明书中各地方的出现并非必须指代相同的实施例。此外,特定特征、结构、或性质可以在一个或多个实施例中以任何合适方式组合。因此,结合一个实施例中所示出或描述的特定特征、结构或性质可以整体地或部分地与一个或 多个其他实施例的特征、结构、或性质无限制地组合,只要该组合不是非逻辑性的或不能工作。
本说明书中“包括”和“具有”以及类似含义的术语表达,意图在于覆盖不排他的包含,例如包含了一系列步骤或单元的过程、方法、系统、产品或设备并不限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。“一”或“一个”也不排除多个的情况。另外,本申请附图中的各个元素仅仅为了示意说明,并非按比例绘制。
虽然本发明已经通过上述实施例进行了描述,然而本发明并非局限于这里所描述的实施例,在不脱离本发明范围的情况下还包括所做出的各种改变以及变化。

Claims (9)

  1. 一种视频时序动作提名生成系统,其包括特征提取模块、特征处理模块和预测模块,其中:
    特征提取模块,用于从输入的视频提取与该视频相关的视频特征;
    特征处理模块包括预先训练的编码器和解码器,其中编码器基于来自特征提取模块的视频特征获取带有全局信息的视频编码特征,并通过预先训练的若干个提名片段从视频编码特征中抽取各个提名片段对应的感兴趣片段特征提供至解码器,解码器基于每个提名片段对应的感兴趣片段特征和预先训练的与提名片段对应的提名特征生成片段特征,并将其提供至预测模块;
    预测模块基于来自解码器的片段特征生成时序动作提名结果,其包括提名边界和置信度分数。
  2. 根据权利要求1所述的系统,其中编码器包括图注意力层、多头自注意力层和前馈层,其中所述编码器将视频特征和位置编码相加的结果作为多头自注意力层的值向量输入,同时将该结果作为输入提供给图注意力层处理,其输出经线性变换后得到多头自注意力层的查询向量和键向量。
  3. 根据权利要求1所述的系统,其中解码器包括多头自注意力层、稀疏交互模块和前馈层,其中解码器将提名片段对应的提名特征经多头自注意力层处理后提供至稀疏交互模块与该提名片段对应的感兴趣片段特征进行稀疏交互;该稀疏交互模块的输出经前馈层处理后得到片段特征。
  4. 根据权利要求1所述的系统,其中特征处理模块基于变换器模型构建。
  5. 根据权利要求1所述的系统,其中预测模块基于来自解码器的片段特征进行边界回归和二分类预测。
  6. 一种采用如前述任一权利要求所述的系统生成时序动作提名生成的方法,包括:
    步骤S1)经由特征提取模块从输入的视频中提取视频特征;
    步骤S2)经由编码器对所提取的视频特征进行处理以得到具有该输入的视频的全局上下文信息的视频编码特征;
    步骤S3)利用预先训练的若干个提名片段中的每一个从视频编码特征中抽取相应的感兴趣片段特征;
    步骤S4)经由解码器基于每个提名片段对应的感兴趣片段特征和预先训练的与提名片段对应的提名特征生成片段特征;
    步骤S5)经由预测模块根据来自解码器的片段特征进行边界回归和二分类预测,输出相应的时序动作提名结果。
  7. 根据权利要求6所述的方法,其中编码器包括图注意力层、多头自注意力层和前馈层,其中步骤S2)包括将视频特征和位置编码相加的结果作为多头自注意力层的值向量输入,同时将该结果作为输入提供给图注意力层处理,其输出经线性变换后得到多头自注意力层的查询向量和键向量。
  8. 根据权利要求6所述的方法,其中解码器包括多头自注意力层、稀疏交互模块和前馈层,其中步骤S4)包括将提名片段对应的提名特征经多头自注意力层处理后提供至稀疏交互模块与该提名片段对应的感兴趣片段特征进行稀疏交互;该稀疏交互模块的输出经前馈层处理后得到片段特征。
  9. 一种计算机可读存储介质,其特征在于,该计算机可读存储介质上存储有计算机程序,所述程序被执行时实现权利要求6-8中任一项所述的方法。
PCT/CN2022/113540 2021-09-08 2022-08-19 视频时序动作提名生成方法及系统 WO2023035904A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111049034.6A CN115797818A (zh) 2021-09-08 2021-09-08 视频时序动作提名生成方法及系统
CN202111049034.6 2021-09-08

Publications (2)

Publication Number Publication Date
WO2023035904A1 WO2023035904A1 (zh) 2023-03-16
WO2023035904A9 true WO2023035904A9 (zh) 2024-03-14

Family

ID=85473422

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/113540 WO2023035904A1 (zh) 2021-09-08 2022-08-19 视频时序动作提名生成方法及系统

Country Status (2)

Country Link
CN (1) CN115797818A (zh)
WO (1) WO2023035904A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116797972A (zh) * 2023-06-26 2023-09-22 中科(黑龙江)数字经济研究院有限公司 基于稀疏图因果时序编码的自监督群体行为识别方法及其识别系统
CN117601143A (zh) * 2023-11-23 2024-02-27 中建新疆建工集团第三建设工程有限公司 基于多传感器融合的智能巡检机器人及方法
CN117292307B (zh) * 2023-11-27 2024-01-30 江苏源驶科技有限公司 一种基于粗时间粒度的时序动作提名生成方法及系统

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8774499B2 (en) * 2011-02-28 2014-07-08 Seiko Epson Corporation Embedded optical flow features
CN110163129B (zh) * 2019-05-08 2024-02-13 腾讯科技(深圳)有限公司 视频处理的方法、装置、电子设备及计算机可读存储介质
CN111327949B (zh) * 2020-02-28 2021-12-21 华侨大学 一种视频的时序动作检测方法、装置、设备及存储介质
CN111372123B (zh) * 2020-03-03 2022-08-09 南京信息工程大学 基于从局部到全局的视频时序片段提取方法
CN112183588A (zh) * 2020-09-11 2021-01-05 上海商汤智能科技有限公司 视频处理方法及装置、电子设备及存储介质
CN112906586B (zh) * 2021-02-26 2024-05-24 上海商汤科技开发有限公司 时序动作提名生成方法和相关产品

Also Published As

Publication number Publication date
WO2023035904A1 (zh) 2023-03-16
CN115797818A (zh) 2023-03-14

Similar Documents

Publication Publication Date Title
WO2023035904A9 (zh) 视频时序动作提名生成方法及系统
Dosovitskiy et al. Generating images with perceptual similarity metrics based on deep networks
WO2019228317A1 (zh) 人脸识别方法、装置及计算机可读介质
Grcić et al. Densely connected normalizing flows
US11270124B1 (en) Temporal bottleneck attention architecture for video action recognition
Gao et al. Domain-adaptive crowd counting via high-quality image translation and density reconstruction
Hu et al. Style transformer for image inversion and editing
CN109919221B (zh) 基于双向双注意力机制图像描述方法
Kim et al. Deep blind image quality assessment by employing FR-IQA
Huai et al. Zerobn: Learning compact neural networks for latency-critical edge systems
Wang et al. Adaptive convolutions with per-pixel dynamic filter atom
Chang et al. End-to-End ASR with Adaptive Span Self-Attention.
CN112200096A (zh) 基于压缩视频实现实时异常行为识别的方法、装置及其存储介质
Zhao et al. Transformer vision-language tracking via proxy token guided cross-modal fusion
Wang et al. Tmf: Temporal motion and fusion for action recognition
CN113239866B (zh) 一种时空特征融合与样本注意增强的人脸识别方法及系统
Abrol et al. Improving generative modelling in VAEs using multimodal prior
Chen et al. Talking head generation driven by speech-related facial action units and audio-based on multimodal representation fusion
Chien et al. Learning flow-based disentanglement
Shen et al. Bidirectional generative modeling using adversarial gradient estimation
Huang et al. Residual networks as flows of velocity fields for diffeomorphic time series alignment
WO2021223747A1 (zh) 视频处理方法、装置、电子设备、存储介质及程序产品
Lei et al. Masked Diffusion Models are Fast Learners
Wang et al. Online knowledge distillation for efficient action recognition
Maeda et al. Multi-view Convolution for Lipreading

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE