WO2023035904A1 - 视频时序动作提名生成方法及系统 - Google Patents

视频时序动作提名生成方法及系统 Download PDF

Info

Publication number
WO2023035904A1
WO2023035904A1 PCT/CN2022/113540 CN2022113540W WO2023035904A1 WO 2023035904 A1 WO2023035904 A1 WO 2023035904A1 CN 2022113540 W CN2022113540 W CN 2022113540W WO 2023035904 A1 WO2023035904 A1 WO 2023035904A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
segment
video
nominated
feature
Prior art date
Application number
PCT/CN2022/113540
Other languages
English (en)
French (fr)
Other versions
WO2023035904A9 (zh
Inventor
罗平
吴剑南
沈家骏
马岚
Original Assignee
港大科桥有限公司
Tcl科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 港大科桥有限公司, Tcl科技集团股份有限公司 filed Critical 港大科桥有限公司
Publication of WO2023035904A1 publication Critical patent/WO2023035904A1/zh
Publication of WO2023035904A9 publication Critical patent/WO2023035904A9/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Definitions

  • the present invention relates to video processing, and in particular to a system and method for generating video timing action nominations.
  • Generating video timing action nominations is a key step in video timing action detection. Its purpose is to detect action clips containing human actions from a long uncropped video, that is, to determine the start and end times of actions.
  • High-quality video temporal action proposals should have the following two key properties: (1) accurate temporal boundaries, i.e., the generated action proposals should completely cover the region where the action occurred; (2) reliable confidence scores for accurate evaluation The quality of the generated nominations is used for subsequent search ranking. Subsequent video timing action detection tasks can be further completed by combining video timing action nomination with specific action categories. Efficient and high-quality generation of video temporal action proposals is beneficial to improve and improve the recognition accuracy of video actions.
  • the purpose of the embodiments of the present invention is to provide a new method and system for generating video time-series action nominations to quickly and efficiently generate high-quality video time-sequence action nominations.
  • the above-mentioned purpose is achieved through the following technical solutions:
  • a video sequence action nomination generation system which includes a feature extraction module, a feature processing module and a prediction module.
  • the feature extraction module is used to extract video features related to the video from the input video.
  • the feature processing module includes a pre-trained encoder and decoder, in which the encoder obtains video coding features with global information based on the video features from the feature extraction module, and extracts each from the video coding features through several pre-trained nomination segments
  • the segment features of interest corresponding to the nominated segments are provided to the decoder, and the decoder generates segment features based on the segment features of interest corresponding to each nominated segment and the pre-trained nominated features corresponding to the nominated segments, and provides them to the prediction module.
  • the prediction module generates temporal action proposal results based on segment features from the decoder, which include proposal boundaries and confidence scores.
  • the encoder includes a graph attention layer, a multi-head self-attention layer, and a feed-forward layer, wherein the encoder uses the result of adding video features and position codes as the value of the multi-head self-attention layer Vector input, and the result is provided as input to the graph attention layer for processing, and its output is linearly transformed to obtain the query vector and key vector of the multi-head self-attention layer.
  • the decoder includes a multi-head self-attention layer, a sparse interaction module and a feed-forward layer, wherein the decoder provides the nominated features corresponding to the nominated segments to the sparse interaction module and the sparse interaction module after being processed by the multi-head self-attention layer.
  • the feature of the segment of interest corresponding to the nominated segment is sparsely interacted; the output of the sparse interaction module is processed by the feed-forward layer to obtain the segment feature.
  • the feature processing module can be constructed based on the transformer model.
  • the prediction module can perform boundary regression and binary classification prediction based on the segment features from the decoder.
  • step S1 extracting from the input video via a feature extraction module Video features
  • Step S2) process the extracted video features via encoder to obtain the video encoding features of the global context information of the input video
  • Step S3) utilize each of several nomination segments of pre-training from video Extract the corresponding segment features of interest from the coding features
  • step S4) generate segment features via the decoder based on the segment features of interest corresponding to each nominated segment and the pre-trained nominated features corresponding to the nominated segments
  • step S5 generate segment features via the prediction module according to The segment features from the decoder perform boundary regression and binary classification prediction, and output the corresponding temporal action nomination results.
  • the encoder may include a graph attention layer, a multi-head self-attention layer, and a feed-forward layer, wherein step S2) includes adding the video features and position codes as the result of the multi-head self-attention layer
  • step S2 includes adding the video features and position codes as the result of the multi-head self-attention layer
  • the value vector is input, and the result is provided as an input to the graph attention layer for processing, and the output is linearly transformed to obtain the query vector and key vector of the multi-head self-attention layer.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed, the method described in the second aspect of the above-mentioned embodiments is implemented.
  • This scheme can effectively capture the global context information of the video and obtain video coding features with stronger representation capabilities; and by introducing several learnable nomination segments to extract the feature sequence of the corresponding position from the video coding features for subsequent prediction, greatly The speed of training convergence is improved and the computational burden is greatly reduced.
  • Fig. 1 shows a schematic diagram of the operation flow of a system for generating video sequence action nominations according to an embodiment of the present invention.
  • Fig. 2 shows a schematic diagram of a sparse interaction process of a sparse interaction module according to an embodiment of the present invention.
  • Fig. 3 shows a schematic flowchart of a method for generating video sequence action nominations according to an embodiment of the present invention.
  • Existing video temporal action proposal generation methods can be divided into anchor box-based methods and boundary-based methods.
  • the anchor box-based method performs boundary regression on uniformly distributed anchor boxes with predefined sizes and scales, and uses a binary classifier to evaluate the confidence score of the nomination. Specifically, an anchor box with a predefined size and ratio is laid on each position of the video one-dimensional feature sequence. If the length of the one-dimensional feature sequence is T, and K anchor boxes are laid at each position, a total of TK anchor boxes need to be predicted box results.
  • positive and negative samples are selected using the intersection-over-union (IOU) size with the real label frame, and the regression of the temporal boundary and the binary classification prediction of the anchor frame confidence are performed on the TK anchor frames.
  • IOU intersection-over-union
  • the performance of this type of method is extremely dependent on the artificial design of the anchor box, so it is difficult to expand, and it is very cumbersome to apply to different scenarios.
  • the boundary-based method generates candidate nominations of any length by enumerating all candidate starting and ending points, and predicts the boundary probability of each candidate nomination to obtain a two-dimensional confidence map.
  • the basic module of this type of method is the convolutional layer, which can only capture the information of the local area, but cannot capture the long-term semantic information of the video.
  • BMN (Lin,T.,Liu,X.,Li,X.,Ding,E.,&Wen,S.Bmn:Boundary-matching network for temporal action proposal generation.
  • DBG Li, C., Li, J., Wang, Y., Tai, Y., Luo, D., Cui, Z.,...&Ji, R.. Fast learning of temporal action proposal via dense boundary generator.
  • Bsn++ Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation.arXiv preprint arXiv:2009.07641.
  • both methods have the following two disadvantages.
  • An embodiment of the present invention provides a video time-series action nomination generation system, which includes a feature extraction module, a feature processing module and a prediction module.
  • the feature extraction module is used to extract video features related to the video from the input video.
  • the feature processing module is built based on the Transformer model, including encoders and decoders.
  • the encoder obtains video encoding features with global information based on the video features from the feature extraction module, and extracts the features of the segment of interest corresponding to each nominated segment from the video encoding features through a number of preset nominated segments and provides them to the decoder.
  • the decoder generates segment features based on the segment features of interest corresponding to each nominated segment and the nominated features corresponding to the nominated segment, and provides them to the prediction module.
  • the prediction module generates temporal action proposal results based on segment features from the decoder, which include proposal boundaries and confidence scores.
  • the feature processing module and prediction module of the system are uniformly trained using a large number of video clips marked with temporal action nominations as samples (which can be called the offline training stage), and then the The processed video clips are provided as input to the trained system for processing, and the output is the temporal action proposal of the input video, which includes each nomination boundary and corresponding confidence score (which can be called an online prediction stage).
  • the preset nominated segments and their corresponding nominated features, as well as the parameters involved in the encoder, decoder and prediction module are all randomly set.
  • the above-mentioned parameters are continuously adjusted until the end of the training, and these trained parameters are used in the subsequent online prediction stage.
  • the feature extraction module and prediction module can adopt any type of machine learning model suitable for video feature extraction and used to predict nomination boundaries and confidence scores according to the input features, including but not limited to neural network models, this paper There are no restrictions on this.
  • the extraction and processing of video features in the training phase and the online processing phase are basically the same, the following mainly introduces the processing process of video features in the training phase in conjunction with Figure 1 .
  • the video features related to the video are extracted through the feature extraction module, such as image features (such as RGB features) and optical flow features of the video.
  • a neural network such as a temporal segment network (Temporal Segment Network, TSN) may be used to extract video features.
  • TSN Temporal Segment Network
  • the feature dimension of the feature sequence can be set according to actual needs, and is not limited here.
  • the video features are recorded as Among them, R represents a real number, M represents the length of the video, which can be understood as the number of frames of the video, and C represents the dimension of the feature vector, that is, the dimension of the feature vector extracted from each video frame.
  • the video feature f can also be regarded as a video feature sequence composed of feature vectors of M video frames, and each video frame has its own specific position in the sequence.
  • the video features extracted by the feature extraction module are provided to the feature processing module for processing. It should be understood that appropriate transformation processing may be performed on the above video features to adapt to or match the feature dimensions set in the feature processing module. For example, for the extracted features, a one-dimensional convolution layer with a convolution kernel size of 1 can be used to align the feature dimensions, and the transformed video feature sequence can be used as the input of the encoder in the subsequent process.
  • the encoder mainly consists of a multi-head self-attention layer and a feed-forward layer.
  • the multi-head self-attention layer consists of multiple independent self-attention layers.
  • the self-attention layer adopts a structure based on the attention mechanism. Its core content is to view other sequence information of the input sequence when encoding the corresponding sequence, and connect the sequences two by two, so as to effectively capture the global context information of the input sequence and construct a sequence between sequences. long-distance dependencies. Therefore, the purpose of enhancing relevant features and suppressing unrelated features can be achieved.
  • the input of the multi-head self-attention layer is a triplet consisting of a query vector Q (query), a key vector K (key), and a value vector V (value).
  • the calculation process of each self-attention layer is as follows:
  • d k is a scaling factor
  • T here represents transposition
  • softmax() represents an activation function.
  • the score between two features in the sequence is calculated by performing the dot product operation between the query vector Q and the key vector K, and the score represents the correlation between the two features.
  • a scale factor d k is used to normalize the score, and then the value is normalized to 0-1 through the softmax() function, and the final score is weighted with the value vector V to achieve Enhance relevant features and reduce the purpose of suppressing irrelevant features.
  • the multi-head self-attention layer includes multiple independent self-attention layers to focus on a part of the context information.
  • the calculation formula is as follows:
  • MultiHead (Q, K, V) Concat (head 1 ,..., head h ) W O
  • the output of the multi-head self-attention layer is further processed by addition and normalization operations and then input to the feed-forward layer.
  • the feedforward layer can be composed of two linear transformation layers and a nonlinear activation function Relu.
  • the output of the feed-forward layer is processed by addition and normalization operations to obtain the output result of the entire encoder, that is, the video coding feature.
  • the input Q, K, and V of the multi-head self-attention layer in the encoder pass the input feature sequence through three matrices containing different parameters (W Q , W K , W V ) is obtained by mapping the linear transformation layer. For example, suppose the input sequence is T 0 , then Q, K, and V are calculated by the following formula:
  • a graph attention layer is introduced in the encoder based on the multi-head self-attention layer to preprocess the input sequence, so that the encoder can better focus on the
  • the fragments of action occurrence and the connection between the action fragments are constructed, so as to obtain encoding features with stronger representation ability.
  • position coding is used in the encoder, and the input video feature sequence and the addition result of position coding are used as input.
  • d is the dimensionality of features applied in the encoder.
  • the dimension of the position code is the same as the dimension of the input video feature, that is, the feature vector of each video frame in the input video feature sequence has its own corresponding position code.
  • the position code which is one of the parameters of the encoder, is randomly set during system initialization and is continuously adjusted during the subsequent training process.
  • the input x obtained by adding the input video feature sequence and the position code is directly used as the value vector V of the multi-head self-attention layer, and the input x is provided to a graph attention layer for transformation processing , the output of the graph attention layer is further transformed by the linear layer to obtain the query vector Q and key vector K of the multi-head self-attention layer.
  • the graph attention layer is used to further strengthen the connection between the features at different time points in the video. Taking the input i-th vector x i as an example, after passing through the graph attention layer, it is transformed into:
  • W k is the learnable weight matrix of the k-th graph self-attention layer
  • is a nonlinear activation function, for example, the Leaky ReLU function.
  • ⁇ k is a learnable weight vector
  • T represents the transpose operation
  • N learnable nominated segments and their corresponding nominated features are introduced to further process the video coding features output by the encoder.
  • Each nominated segment is used to extract the feature sequence corresponding to the position from the video coding features to obtain the segment feature of interest, and provide it together with the nominated feature corresponding to the nominated segment as an input to the decoder.
  • Each nominated segment is a normalized two-dimensional coordinate (value between 0-1), which represents a segment on the video time axis; each nominated feature is a vector with a dimension of d.
  • the lengths of the proposed segments can be different, so the dimensions of the extracted feature sequences can also be different.
  • bilinear interpolation can be used to adjust all the extracted features to the same length M', that is, the length of each segment of interest
  • M' the length of each segment of interest
  • the dimension is M' ⁇ d.
  • the N nominated segments and their corresponding nominated features are also parameters to be obtained during the training process, which are randomly set during system initialization, and are set randomly during the subsequent training process. Constantly make adjustments.
  • the N nominated features are first input to the multi-head self-attention layer, and through the multi-head self-attention layer to obtain the relevant information of the long-distance dependencies between the nominated features, the output of the multi-head self-attention layer passes through After summing and normalization, the nominated feature corresponding to each nominated segment and the segment-of-interest feature corresponding to the nominated segment have a one-to-one interaction in the sparse interaction module.
  • the output of the sparse interaction module is further added and normalized to the feedforward layer, and the output of the feedforward layer is added and normalized to output N segment features, which are the output results of the decoder.
  • Figure 2 takes the kth nominated feature as an example to show its sparse interaction process with the corresponding segment feature of interest in the sparse interaction module.
  • the nominated feature vector with dimension d is passed through the linear layer and scaled to obtain two parameters of size d ⁇ d h and d h ⁇ d (here d h can be set according to the specific decoder requirements).
  • interesting The fragment features are respectively matrix multiplied with these two parameters to obtain the fragment features with a size of M′ ⁇ d.
  • This process can be regarded as a feature segment of interest passing through two layers of one-dimensional convolutional layers, so it can also be called a dynamic convolution operation.
  • the proposed feature interacts with the corresponding segment feature of interest instead of interacting with the global video coding feature, which can greatly improve the speed of training convergence.
  • the prediction module receives N segment features from the decoder for boundary regression and binary classification prediction, and outputs N nomination prediction results, including nomination boundaries and corresponding confidence scores.
  • N nomination prediction results predicted through the above process and the real nomination labels corresponding to the samples are matched one-to-one by optimal binary matching.
  • the Focal loss function is used as the binary classification loss function
  • the L1 loss function and GIOU loss function are used as the regression loss function.
  • the prediction module consists of two independent feed-forward layers, one feed-forward layer consisting of a linear layer for evaluating the confidence scores of generated nominations, and the other feed-forward layer consisting of three layers Linear layer composition for regressing the proposed bounding coordinates.
  • N the number of the video segment to be processed, actual requirements and system performance. For example, there are usually 2 to 3 nominations on a 1-minute video segment to be processed, then N may be set to be at least greater than the number of possible nominations on the video segment, for example, N may be set to any integer greater than 3.
  • N the larger the computing performance is consumed. Therefore, N usually does not exceed a relationship that is a multiple of 10 times the number of possible nominations that may exist on the video segment to be processed. For example, for a video segment with a length of 1 minute to be processed, N may be set as an integer between 3-30.
  • the system is fed with video segments to be processed.
  • the system first extracts video features from it, changes the extracted video features into video coding features with the global context information of the input video via an encoder, and combines each of the pre-trained N nomination segments from video coding features Extract the corresponding segment features of interest.
  • the segment features are obtained after one-to-one interaction between the segment features of interest corresponding to each nominated segment and its corresponding nominated features via the decoder, and provided to the prediction module.
  • the boundary regression and binary classification prediction are performed on the segment features from the decoder through the prediction module, and N nomination generation results corresponding to the video segment to be processed are output.
  • N action nomination results can be directly obtained without post-processing of non-maximum value suppression, and the generated
  • the number of action nominations has nothing to do with the length of the video, so it can greatly reduce the computational burden and greatly improve the generation speed of temporal action nominations.
  • the system according to the above-mentioned embodiments can effectively capture the global context information of the video and obtain video coding features with stronger representation capabilities; and extract the corresponding position features from the video coding features by introducing several learnable nomination segments The sequence is used for subsequent predictions, which greatly improves the training convergence speed and greatly reduces the computational burden.
  • Fig. 3 shows a schematic flowchart of a method for generating time-sequence action nominations by using the video time-sequence action nomination generating system according to an embodiment of the present invention.
  • the method comprises: step S1) extracting video features from the input video via a feature extraction module; step S2) processing the extracted video features via an encoder to obtain video coding features with global context information of the input video; Step S3) using each of the preset plurality of nominated segments to extract the corresponding segment features of interest from the video coding features; Interest segment features are interacted to obtain segment features; Step S5) Through the prediction module, boundary regression and binary classification prediction are performed according to the segment features from the decoder, and the corresponding time series action nomination results are output.
  • the inventor also compared the performance of the temporal action nomination generation method of the present invention with the existing common temporal action nomination generation method based on the THUMOS14 dataset and the ActivityNet-1.3 dataset.
  • the video features are input into the trained system, and the output of the prediction module is used as the final N nomination generation results.
  • the recall rate on the validation set is calculated to verify the performance of the trained model structure.
  • Table 1 is a performance comparison between the method of the present invention and the current mainstream method on the THUMOS14 data set, and the recall rate of nomination is used as an evaluation index. The results show that the method of the present invention is superior to other methods.
  • Table 2 is a comparison of the inference speed of the method of the present invention and other mainstream algorithms on the ActivityNet-1.3 data set. For a fair comparison, the average inference time per video is calculated, and the results show that our method is at least 8 times faster than existing methods.
  • a computer-readable storage medium on which computer programs or executable instructions are stored, and when the computer programs or executable instructions are executed by a processor or other computing units
  • a computer-readable storage medium may be any tangible medium capable of storing data and readable by a computing device. Examples of computer readable storage media include hard drives, network attached storage (NAS), read only memory, random access memory, CD-ROM, CD-R, CD-RW, magnetic tape, and other optical or non-optical data storage devices.
  • the computer readable storage medium may also include computer readable media distributed over network coupled computer systems so that the computer programs or instructions are stored and executed in a distributed manner.
  • references in this specification to "various embodiments,” “some embodiments,” “one embodiment,” or “an embodiment” refer to particular features, structures, or properties described in connection with the embodiments, including In at least one embodiment.
  • appearances of the phrases “in various embodiments,” “in some embodiments,” “in one embodiment,” or “in an embodiment” in various places throughout this specification are not necessarily referring to the same implementation. example.
  • the particular features, structures, or properties may be combined in any suitable manner in one or more embodiments. Therefore, a particular feature, structure, or property shown or described in connection with one embodiment may be combined in whole or in part with features, structures, or properties of one or more other embodiments without limitation, as long as the combination is not incompatible. Logical or not working.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明的实施例提供了视频时序动作提名生成系统及方法,其经由编码器对从输入的视频所提取的视频特征进行处理以获取带有全局信息的视频编码特征,并通过预先训练的多个提名片段从视频编码特征中抽取对应的感兴趣片段特征提供至解码器,解码器基于每个提名片段对应的感兴趣片段特征和预先训练的各提名片段对应的提名特征生成片段特征,并将其提供至预测模块;预测模块基于来自解码器的片段特征生成时序动作提名结果。本发明实施例的方案能有效捕捉视频的全局上下文信息,获取表征能力更强的视频编码特征;而且通过引入若干个可学习的提名片段来从视频编码特征中抽取对应位置的特征序列来用于后续预测,提高了训练收敛速度并大幅降低了计算负担。

Description

视频时序动作提名生成方法及系统 技术领域
本发明涉及视频处理,尤其涉及用于生成视频时序动作提名的系统及方法。
背景技术
生成视频时序动作提名是视频时序动作检测的关键步骤,其目的在于从一段未裁剪的长视频中检测出包含人类行为的动作片段,即确定动作发生的开始和结束时间。高质量的视频时序动作提名应当具有以下两个关键特性:(1)准确的时序边界,即生成的动作提名应完整地覆盖动作发生的区域;(2)可靠的置信度分数,用于准确评估所生成的提名的质量以用于后续的检索排序。通过视频时序动作提名与具体的动作类别结合可进一步完成后续的视频时序动作检测任务。高效且高质量地生成视频时序动作提名有利于改善和提高视频动作的识别精度。
发明内容
本发明实施例的目的在于提供一种新的视频时序动作提名生成方法和系统来快速、高效地生成高质量的视频时序动作提名。上述目的是通过以下技术方案实现的:
根据本发明实施例的第一方面,提供了一种视频时序动作提名生成系统,其包括特征提取模块、特征处理模块和预测模块。其中特征提取模块用于从输入的视频提取与该视频相关的视频特征。特征处理模块包括预先训练的编码器和解码器,其中编码器基于来自特征提取模块的视频特征获取带有全局信息的视频编码特征,并通过预先训练的若干个提名片段从视频编码特征中抽取各个提名片段对应的感兴趣片段特征提供至解码器,解码器基于每个提名片段对应的感兴趣片段特征和预先训练的与提名片段对应的提名特征生成片段特征,并将其提供至预测模块。预测模块基于来自解码器的片段特征生成时序动作提名结果,其包括提名边界和置信度分数。
在本发明的一些实施例中,编码器包括图注意力层、多头自注意力层和前馈层,其中所述编码器将视频特征和位置编码相加的结果作为多头自注意力层的值向量输入,同时将该结果作为输入提供给图注意力层处理,其输出经线性变换后得到多头自注意力层的查询向量和键向量。
在本发明的一些实施例中,解码器包括多头自注意力层、稀疏交互模块和前馈层,其中解码器将提名片段对应的提名特征经多头自注意力层处理后提供至稀疏交互模块与该提名片段对应的感兴趣片段特征进行稀疏交互;该稀疏交互模块的输出经前馈层处理后得到片段特征。
在本发明的一些实施例中,特征处理模块可以基于变换器模型构建。
在本发明的一些实施例中,预测模块可以基于来自解码器的片段特征进行边界回归和二分类预测。
根据本发明实施例的第二方面,还提供了一种采用根据本发明实施例的第一方面的系统生成时序动作提名生成的方法,包括:步骤S1)经由特征提取模块从输入的视频中提取视频特征;步骤S2)经由编码器对所提取的视频特征进行处理以得到具有该输入的视频的全局上下文信息的视频编码特征;步骤S3)利用预先训练的若干个提名片段中的每一个从视频编码特征中抽取相应的感兴趣片段特征;步骤S4)经由解码器基于每个提名片段对应的感兴趣片段特征和预先训练的与提名片段对应的提名特征生成片段特征;步骤S5)经由预测模块根据来自解码器的片段特征进行边界回归和二分类预测,输出相应的时序动作提名结果。
在本发明的一些实施例中,编码器可包括图注意力层、多头自注意力层和前馈层,其中步骤S2)包括将视频特征和位置编码相加的结果作为多头自注意力层的值向量输入,同时将该结果作为输入提供给图注意力层处理,其输出经线性变换后得到多头自注意力层的查询向量和键向量。
根据本发明实施例的第三方面,还提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被执行时实现如上述实施例第二方面所述的方法。
本发明实施例提供的技术方案可以包括以下有益效果:
该方案能有效捕捉视频的全局上下文信息,获取表征能力更强的视频编码特征;而且通过引入若干个可学习的提名片段来从视频编码特征中抽取对应位置的特征序列来用于后续预测,大大提高了训练收敛速度并大幅降低了计算负担。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本发明。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并与说明书一起用于解释本发明的原理。显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。在附图中:
图1示出了根据本发明一个实施例的视频时序动作提名生成系统的操作流程示意图。
图2示出了根据本发明一个实施例的稀疏交互模块的稀疏交互流程示意图。
图3示出了根据本发明一个实施例的视频时序动作提名生成方法的流程示意图。
具体实施方式
为了使本发明的目的,技术方案及优点更加清楚明白,以下结合附图通过具体实施例对本发明进一步详细说明。应当理解,所描述的实施例是本发明的一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在不经创造性劳动获得的所有其他实施例,都属于本发明保护的范围。
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本发明的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本发明的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本发明的各方面。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
现有的视频时序动作提名生成方法可分为基于锚框方法和基于边界方法。基于锚框方法对预先定义好尺寸和比例且均匀分布的锚框进行边界回归,并采用一个二分类器来评估提名的置信度分数。具体地,在视频一维特征序列的每个位置上铺设预定义好大小和比例的锚框,若一维特征序列长度为T,每个位置铺设K个锚框,则共需预测TK个锚框结果。在训练阶段,采用与真实标签框的交并比(IOU)大小选择正负样本,对这TK个锚框进行时序边界的回归以及锚框置信度的二分类预测。在模型推理阶段,由于预测出的锚框结果会有大量的重叠,因此需要采用非极大值抑制方法去除冗余的预测结果,得到最终的提名生成结果。常见的方法有Prop-SSAD(Lin,T.,Zhao,X.,&Shou,Z.,Temporal convolution based action proposal:Submission to activitynet 2017.arXiv preprint arXiv:1707.06750.),RapNet(Gao,J.,Shi,Z.,Wang,G.,Li,J.,Yuan,Y.,Ge,S.,&Zhou,X..Accurate temporal action proposal generation with relation-aware pyramid network.In Proceedings of the AAAI Conference on Artificial Intelligence,Vol.34,No.07,pp.10810-10817,2020年4月)。该类方法的性能极度依赖于锚框的人工设计,因此难以扩展,在应用于不同场景时十分繁琐。而基于边界方法通过列举所有的候选起止点生成任意长度的候选提名,并对每个候选提名进行边界概率预测得到二维置信图。该类方法的基础模块是卷积层,只能捕捉局部区域的信息,而不能捕捉视频的长期语义信息。BMN(Lin,T.,Liu,X.,Li,X.,Ding,E.,&Wen,S.Bmn:Boundary-matching network for temporal action proposal generation.In Proceedings of the IEEE/CVF International Conference on Computer Vision,pp.3889-3898,2019),DBG(Lin,C.,Li,J.,Wang,Y.,Tai,Y.,Luo,D.,Cui,Z.,...&Ji,R..Fast learning of temporal action proposal via dense boundary generator.In Proceedings of the AAAI Conference on Artificial Intelligence,Vol.34,No.07,pp.11499-11506,2020年4月),BSN++(Su,H.,Gan,W.,Wu,W.,Qiao,Y.,&Yan,J.(2020).Bsn++:Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation.arXiv preprint arXiv:2009.07641.)属于该类方法。
此外,这两种方法均具有以下两个缺点。一是随着视频长度的增加,预定义的锚框数量以及生成的置信图尺寸都会大大增加,对计算资源消耗 巨大,难以应用到实际场景中;二是这两种方法均生成了大量的冗余提名,需要采用非极大值抑制的后处理方法消除冗余预测结果,后处理操作不仅需要细致的参数选择,而且大大降低了模型的推理速度。
在本发明的实施例中提供了一种视频时序动作提名生成系统,其包括特征提取模块、特征处理模块和预测模块。其中特征提取模块用于从输入的视频提取与该视频相关的视频特征。特征处理模块基于Transformer(变换器)模型构建,包括编码器和解码器。该编码器基于来自特征提取模块的视频特征获取带有全局信息的视频编码特征,并通过预设的若干个提名片段从视频编码特征中抽取各个提名片段对应的感兴趣片段特征提供至解码器,解码器基于每个提名片段对应的感兴趣片段特征和该提名片段对应的提名特征生成片段特征,并将其提供至预测模块。预测模块基于来自解码器的片段特征生成时序动作提名结果,其包括提名边界和置信度分数。
在该实施例中,首先利用以已标注了时序动作提名的大量视频片段作为样本构成的训练集对该系统的特征处理模块和预测模块进行统一训练(可以称为离线训练阶段),然后将待处理的视频片段作为输入提供给该训练好的系统进行处理,其输出为该输入视频的时序动作提名,其包括各个提名边界及对应置信度分数(可以称为在线预测阶段)。在系统初始时,预设的若干个提名片段及其对应的提名特征以及编码器、解码器和预测模块中涉及的参数均是随机设置。在训练过程中上述这些参数在训练过程中不断被调整直到训练结束,这些训练好的参数用于后续在线预测阶段。应指出,这里的特征提取模块和预测模块可以采用适用于进行视频特征提取和使用于根据输入的特征预测提名边界和置信度分数的任何类型的机器学习模型,包括但不限于神经网络模型,本文对此不进行限制。考虑到在训练阶段和在线处理阶段对于视频特征的提取和处理是基本相同的,下文主要结合图1对于训练阶段中视频特征的处理过程进行介绍。
首先对于输入的视频,通过特征提取模块提取与该视频相关的视频特征,例如视频的图像特征(如RGB特征)和光流特征等。在一个示例中,可以采用诸如时间片段网络(Temporal Segment Network,TSN)之类的神经网络来提取视频特征。对于所提取的不同维度的视频特征,将其转换成具备相同的特征维度的一系列特征序列。特征序列的特征维度可以根据实际需求进行设置,在此不进行限制。为方便描述,在下面的示例中将视频特征记为
Figure PCTCN2022113540-appb-000001
其中R代表实数,M代表视频的长度,可以理解为视 频的帧数,C代表特征向量的维度,即从每个视频帧提取的特征向量的维度。可以看出,视频特征f也可以被视为是由M个视频帧的特征向量构成的一个视频特征序列,每个视频帧在该序列中有自己特定的位置。经由特征提取模块提取的视频特征提供至特征处理模块进行处理。应理解,可以对上述视频特征进行适当变换处理以适应或匹配特征处理模块中设定的特征维度。例如,对于所提取的特征,可以经过一个卷积核大小为1的一维卷积层进行特征维度的对齐,变换后的视频特征序列可作为后续过程中编码器的输入。
参考图1,编码器主要包含多头自注意力层和前馈层。其中多头自注意层由多个独立的自注意层组成。自注意层采用基于注意力机制的结构,其核心内容是在编码相应序列的时候可以查看输入序列的其他序列信息,将序列两两连接,从而有效捕捉到输入序列的全局上下文信息,构建序列间的长距离依赖关系。因此可以达到增强相关特征,抑制无缘特征的目的。多头自注意层的输入为三元组,由查询向量Q(query)、键向量K(key)、值向量V(value)构成。每个自注意层的计算过程如下:
Figure PCTCN2022113540-appb-000002
其中d k为比例因子,这里的T表示转置,softmax()代表激活函数。如上面公式所示,通过将查询向量Q与键向量K做点积操作来计算序列中特征两两之间的分数,该分数代表着两个特征之间的关联性。为了保持梯度稳定,用一个比例因子d k来对分数进行归一化操作,然后再经过softmax()函数将数值标准化到0-1之间,最后得到的分数与值向量V进行加权,以达到增强相关特征,降低抑制无关特征的目的。在此基础上面,多头自注意力层包含了多个独立的自注意层来各自重点关注一部分上下文信息,这些自注意层的输出(每个自注意力层的输出可以记为head,head=Attention(Q,K,V))被拼接起来,并且经由一个线性层进一步聚合后得到的具有更好鲁棒性的多头自注意力层的输出MultiHead(Q,K,V),其计算公式如下:
MultiHead(Q,K,V)=Concat(head 1,...,head h)W O
其中,h代表多头自注意力层中包含的自注意力层的总数量,W o为用于聚合特征的线性层的参数矩阵。如图1所示,该多头自注意层的输出还会进一步经过相加和归一化操作处理后输入至前馈层。其中前馈层可以由两个线性变换层以及一个非线性激活函数Relu组成。前馈层的输出经相加和归一化操作处理后得到整个编码器的输出结果,即视频编码特征。
在一些实施例中,可以如通常的Transformer模型,编码器中的多头自注意力层的输入Q,K,V是将输入的特征序列经由3个含不同参数矩阵(W Q、W K、W V)的线性变换层进行映射的得到的。例如假设输入的序列为T 0,则Q,K,V以如下公式计算:
Q=T 0W Q,K=T 0W k,V=T 0W V
而在图1所示的实施例中,在编码器中在多头自注意力层的基础上引入了图注意力层来对输入的序列进行预处理,从而使编码器能更好地关注视频中动作发生的片段并构建动作片段之间的联系,从而获得表征能力更强的编码特征。并且为了视频特征中相对位置关系,例如各视频帧的相对位置和时序关系,在编码器中采用了位置编码,以输入的视频特征序列和位置编码的相加结果作为输入,
Figure PCTCN2022113540-appb-000003
其中d为在编码器中适用的特征的维度。位置编码的维度与输入的视频特征的维度相同,即输入的视频特征序列中每个视频帧的特征向量都有自己对应的位置编码。如上文提到的,作为编码器的其中一个参数的位置编码在系统初始化时随机设置的,在后续训练过程中不断进行调整。
如图1所示,将输入的视频特征序列与位置编码的相加后得到的输入x直接作为多头自注意力层的值向量V,同时该输入x被提供至一个图注意力层进行变换处理,图注意力层的输出进一步进行线性层变换后得到多头自注意力层的查询向量Q和键向量K。其中图注意力层用于进一步强化视频中不同时间点特征之间的联系,以输入的第i个向量x i为例,其经过图注意力层后变换为:
Figure PCTCN2022113540-appb-000004
其中,||为拼接操作,K为图注意力层的多头数量,i=1,2,…,M,这里的M如上文提到的代表视频的长度,可以理解为视频的帧数。W k为第k个图自注意力层的可学习权重矩阵,σ为非线性激活函数,例如,Leaky ReLU 函数。
Figure PCTCN2022113540-appb-000005
为第k个图注意力层中特征向量x i对x j的权重系数,表征了两者之间的相关性大小,其计算过程为:
Figure PCTCN2022113540-appb-000006
其中,α k为可学习的权重向量,T代表转置操作。通过在编码器中引入上述图注意力机制可以进一步动态地构建视频特征序列不同帧之间的联系,从而更准确地进行全局上下文信息的捕捉,帮助编码器获得表征能力更强的视频编码特征。
继续参考图1,在本发明的实施例中,引入了N个可学习的提名片段及其对应的提名特征来进一步对经由编码器输出的视频编码特征进行处理。利用每个提名片段从视频编码特征中抽取对应位置的特征序列以得到感兴趣片段特征并将其与该提名片段对应的提名特征一起作为输入提供给解码器。其中每个提名片段为一个归一化的二维坐标(数值在0-1之间),其代表视频时间轴上一个片段;每个提名特征为维度为d的向量。这里,各个提名片段的长度可以不同,因此所提取的特征序列的维度也可能不同。因此在一个示例中,在利用提名片段从视频编码特征中抽取对应位置的特征序列之后,可以利用双线性插值将所有抽取出的特征调整至同一长度M′,即每个感兴趣片段特征的维度为M′×d。如上文提到的,与编码器的位置编码一样,这N个提名片段及其对应的提名特征也都是要在经过训练过程中得到的参数,在系统初始化时随机设置,在后续训练过程中不断进行调整。
在解码器中,这N个提名特征首先输入至多头自注意力层,经过多头自注意力层以获取各提名特征之间长距离依赖关系的相关信息,在对多头自注意力层的输出经过相加和归一化处理后,每个提名片段对应的提名特征和该提名片段对应的感兴趣片段特征在稀疏交互模块中进行一对一的交互。该稀疏交互模块的输出进一步经过相加和归一化处理后提供至前馈层,前馈层的输出经相加和归一化处理后,输出N个片段特征,即解码器的输出结果。图2以第k个提名特征为例,展示了其与对应感兴趣片段特征在稀疏交互模块中的稀疏交互过程。具体地,维度为d的提名特征向量 经过线性层并进行尺度调整后得到大小为d×d h以及d h×d的两个参数(这里的d h可根据具体解码器需求设置),感兴趣片段特征分别与这两个参数依次进行矩阵乘法,得到大小为M′×d的片段特征。这一过程可视为感兴趣特征片段经过两层的一维卷积层,因此也可称为动态卷积操作。在上述的解码器中,提名特征至于对应的感兴趣片段特征进行交互,而不需要与全局的视频编码特征交互,从而可以大大提高训练收敛的速度。
继续参考图1,预测模块接收来自解码器的N个片段特征进行边界回归和二分类预测,输出N个提名预测结果,包括提名边界及对应置信度分数。在每次的训练中将经过上述过程预测得到的N个提名预测结果与样本对应的真实提名标签采用最优二分匹配进行一对一的匹配。例如,采用Focal损失函数为二分类损失函数,L1损失函数和GIOU损失函数为回归损失函数,对于一个视频,计算N个提名预测结果对每个提名标签的分类代价与回归代价之和,最终对于每个真实提名标签,选择总代价最小的唯一提名预测结果作为正样本,而不与真实提名标签匹配的提名预测结果均视为负样本。在该实施例中,预测模块由两个独立的前馈层组成,其中一个前馈层由一层线性层组成,用于评估所生成提名结果的置信度分数,另一个前馈层由三层线性层组成,用于对提名的边界坐标进行回归。在训练集上不断重复上述训练过程继续迭代优化,其中每一轮的训练中预测模块输出的提名边界作为下一轮训练中采用的N个提名片段。在训练完成后,该系统中涉及的N个提名片段及其对应的提名特征以及编码器、解码器和预测模块中涉及的参数都会被确定下来,从而可用于后续在线预测阶段。在本文中,N的取值可以依据待处理的视频片段长度、实际需求和系统性能来设置。例如,待处理的1分钟长度的视频片段上通常有2到3个提名,则可以将N设置为至少大于该视频片段上可能存在的提名的数量,例如将N设置为大于3的任意整数。但应理解,N越大,消耗的计算性能越大。因此,N通常最大不超过待处理视频片段上可能存在的提名的数量的10倍数的关系。例如对于待处理的1分钟长度的视频片段,可以将N设置为在3-30之间的整数。
在线预测阶段,将待处理的视频片段提供给该系统。该系统首先从中提取视频特征,经由编码器将所提取的视频特征变化为具有该输入的视频的全局上下文信息的视频编码特征,结合预先训练好的N个提名片段中的 每一个从视频编码特征中抽取相应的感兴趣片段特征。接着经由解码器对于每个提名片段对应的感兴趣片段特征和其对应的提名特征进行一对一交互后得到片段特征,并将其提供给预测模块。最后经由预测模块对来自解码器的片段特征进行边界回归和二分类预测,并输出与该待处理的视频片段相对应的N个提名生成结果。与现有技术不同,在该系统中通过引入N个可学习的提名片段和对应的提名特征,可以直接得到N个动作提名结果,而无需非极大值抑制的后处理过程,而且其生成的动作提名数量与视频长度无关,因此能大幅度降低计算负担,极大地提高时序动作提名的生成速度。
可以看出,根据上述实施例的系统能有效捕捉视频的全局上下文信息,获取表征能力更强的视频编码特征;而且通过引入若干个可学习的提名片段来从视频编码特征中抽取对应位置的特征序列来用于后续预测,大大提高了训练收敛速度并大幅降低了计算负担。
图3示出了利用上述根据本发明实施例的视频时序动作提名生成系统生成时序动作提名的方法的流程示意图。该方法包括:步骤S1)经由特征提取模块从输入的视频中提取视频特征;步骤S2)经由编码器对所提取的视频特征进行处理以得到具有该输入的视频的全局上下文信息的视频编码特征;步骤S3)利用预设的多个提名片段中的每一个从视频编码特征中抽取相应的感兴趣片段特征;步骤S4)经由解码器对于每个提名片段对应的提名特征与该提名片段对应的感兴趣片段特征进行交互以得到片段特征;步骤S5)经由预测模块根据来自解码器的片段特征进行边界回归和二分类预测,输出相应的时序动作提名结果。
为了更好地说明本发明的性能,发明人还基于THUMOS14数据集和ActivityNet-1.3数据集比较了本发明的时序动作提名生成方法与现有常用的时序动作提名生成方法的性能。
在训练过程中,利用图1所示的系统结构在训练集上进行20个周期的迭代训练,在每个周期完成后,计算验证集上的损失以评估该系统的性能,并选择验证集损失最小的系统结构作为训练完成的系统。
在预测阶段,将视频特征输入训练好的系统,将预测模块的输出结果作为最终的N个提名生成结果。将提名生成结果与真实提名标签进行比较,计算在验证集上的召回率以验证所训练的模型结构的性能。表1为用本发明的方法与目前主流方法在THUMOS14数据集上进行性能比较,以提名 的召回率作为评估指标,结果显示本发明的方法优于其他方法。表2为本发明的方法与其他主流算法在ActivityNet-1.3数据集上的推理速度的比较。为了公平比较,计算每个视频的平均推理时间,结果显示本发明的方法比现有方法至少快8倍。
表1
方法 AR@50 AR@100 AR@200 AR@500
BSN 37.46 46.06 53.21 60.64
BMN 39.36 47.72 54.70 62.07
RapNet 40.35 48.23 54.92 61.41
DBG 37.32 46.67 54.50 62.21
本发明 40.40 48.70 55.51 62.20
表2
方法 BSN BMN GTAD DBG 本发明
T pro(sec) 0.671 0.118 0.103 0.219 0.056
T all(sec) 0.815 0.745 0.862 0.596 0.074
在本发明的又一个实施例中,还提供了一种计算机可读存储介质,其上存储有计算机程序或可执行指令,当所述计算机程序或可执行指令被处理器或其他计算单元执行时实现如前述实施例中所述的技术方案,其实现原理类似,此处不再赘述。在本发明的实施例中,计算机可读存储介质可以是任何能够存储数据且可以被计算装置读取的有形介质。计算机可读存储介质的实例包括硬盘驱动器、网络附加存储器(NAS)、只读存储器、随机存取存储器、CD-ROM、CD-R、CD-RW、磁带以及其它光学或非光学数据存储装置。计算机可读存储介质也可以包括分布在网络耦合计算机系统上的计算机可读介质,以便可以分布式地存储和执行计算机程序或指令。
本说明书中针对“各个实施例”、“一些实施例”、“一个实施例”、或“实施例”等的参考指代的是结合所述实施例所描述的特定特征、结构、或性质包括在至少一个实施例中。因此,短语“在各个实施例中”、“在一些实施例中”、“在一个实施例中”、或“在实施例中”等在整个说明书中各地方的出现并非必须指代相同的实施例。此外,特定特征、结构、或性质可以在一个或多个实施例中以任何合适方式组合。因此,结合一个实施例中所示出或描述的特定特征、结构或性质可以整体地或部分地与一个或 多个其他实施例的特征、结构、或性质无限制地组合,只要该组合不是非逻辑性的或不能工作。
本说明书中“包括”和“具有”以及类似含义的术语表达,意图在于覆盖不排他的包含,例如包含了一系列步骤或单元的过程、方法、系统、产品或设备并不限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。“一”或“一个”也不排除多个的情况。另外,本申请附图中的各个元素仅仅为了示意说明,并非按比例绘制。
虽然本发明已经通过上述实施例进行了描述,然而本发明并非局限于这里所描述的实施例,在不脱离本发明范围的情况下还包括所做出的各种改变以及变化。

Claims (9)

  1. 一种视频时序动作提名生成系统,其包括特征提取模块、特征处理模块和预测模块,其中:
    特征提取模块,用于从输入的视频提取与该视频相关的视频特征;
    特征处理模块包括预先训练的编码器和解码器,其中编码器基于来自特征提取模块的视频特征获取带有全局信息的视频编码特征,并通过预先训练的若干个提名片段从视频编码特征中抽取各个提名片段对应的感兴趣片段特征提供至解码器,解码器基于每个提名片段对应的感兴趣片段特征和预先训练的与提名片段对应的提名特征生成片段特征,并将其提供至预测模块;
    预测模块基于来自解码器的片段特征生成时序动作提名结果,其包括提名边界和置信度分数。
  2. 根据权利要求1所述的系统,其中编码器包括图注意力层、多头自注意力层和前馈层,其中所述编码器将视频特征和位置编码相加的结果作为多头自注意力层的值向量输入,同时将该结果作为输入提供给图注意力层处理,其输出经线性变换后得到多头自注意力层的查询向量和键向量。
  3. 根据权利要求1所述的系统,其中解码器包括多头自注意力层、稀疏交互模块和前馈层,其中解码器将提名片段对应的提名特征经多头自注意力层处理后提供至稀疏交互模块与该提名片段对应的感兴趣片段特征进行稀疏交互;该稀疏交互模块的输出经前馈层处理后得到片段特征。
  4. 根据权利要求1所述的系统,其中特征处理模块基于变换器模型构建。
  5. 根据权利要求1所述的系统,其中预测模块基于来自解码器的片段特征进行边界回归和二分类预测。
  6. 一种采用如前述任一权利要求所述的系统生成时序动作提名生成的方法,包括:
    步骤S1)经由特征提取模块从输入的视频中提取视频特征;
    步骤S2)经由编码器对所提取的视频特征进行处理以得到具有该输入的视频的全局上下文信息的视频编码特征;
    步骤S3)利用预先训练的若干个提名片段中的每一个从视频编码特征中抽取相应的感兴趣片段特征;
    步骤S4)经由解码器基于每个提名片段对应的感兴趣片段特征和预先训练的与提名片段对应的提名特征生成片段特征;
    步骤S5)经由预测模块根据来自解码器的片段特征进行边界回归和二分类预测,输出相应的时序动作提名结果。
  7. 根据权利要求6所述的方法,其中编码器包括图注意力层、多头自注意力层和前馈层,其中步骤S2)包括将视频特征和位置编码相加的结果作为多头自注意力层的值向量输入,同时将该结果作为输入提供给图注意力层处理,其输出经线性变换后得到多头自注意力层的查询向量和键向量。
  8. 根据权利要求6所述的方法,其中解码器包括多头自注意力层、稀疏交互模块和前馈层,其中步骤S4)包括将提名片段对应的提名特征经多头自注意力层处理后提供至稀疏交互模块与该提名片段对应的感兴趣片段特征进行稀疏交互;该稀疏交互模块的输出经前馈层处理后得到片段特征。
  9. 一种计算机可读存储介质,其特征在于,该计算机可读存储介质上存储有计算机程序,所述程序被执行时实现权利要求6-8中任一项所述的方法。
PCT/CN2022/113540 2021-09-08 2022-08-19 视频时序动作提名生成方法及系统 WO2023035904A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111049034.6A CN115797818A (zh) 2021-09-08 2021-09-08 视频时序动作提名生成方法及系统
CN202111049034.6 2021-09-08

Publications (2)

Publication Number Publication Date
WO2023035904A1 true WO2023035904A1 (zh) 2023-03-16
WO2023035904A9 WO2023035904A9 (zh) 2024-03-14

Family

ID=85473422

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/113540 WO2023035904A1 (zh) 2021-09-08 2022-08-19 视频时序动作提名生成方法及系统

Country Status (2)

Country Link
CN (1) CN115797818A (zh)
WO (1) WO2023035904A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116797972A (zh) * 2023-06-26 2023-09-22 中科(黑龙江)数字经济研究院有限公司 基于稀疏图因果时序编码的自监督群体行为识别方法及其识别系统
CN117601143A (zh) * 2023-11-23 2024-02-27 中建新疆建工集团第三建设工程有限公司 基于多传感器融合的智能巡检机器人及方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117292307B (zh) * 2023-11-27 2024-01-30 江苏源驶科技有限公司 一种基于粗时间粒度的时序动作提名生成方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120219213A1 (en) * 2011-02-28 2012-08-30 Jinjun Wang Embedded Optical Flow Features
CN110163129A (zh) * 2019-05-08 2019-08-23 腾讯科技(深圳)有限公司 视频处理的方法、装置、电子设备及计算机可读存储介质
CN111327949A (zh) * 2020-02-28 2020-06-23 华侨大学 一种视频的时序动作检测方法、装置、设备及存储介质
CN111372123A (zh) * 2020-03-03 2020-07-03 南京信息工程大学 基于从局部到全局的视频时序片段提取方法
CN112183588A (zh) * 2020-09-11 2021-01-05 上海商汤智能科技有限公司 视频处理方法及装置、电子设备及存储介质
CN112906586A (zh) * 2021-02-26 2021-06-04 上海商汤科技开发有限公司 时序动作提名生成方法和相关产品

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120219213A1 (en) * 2011-02-28 2012-08-30 Jinjun Wang Embedded Optical Flow Features
CN110163129A (zh) * 2019-05-08 2019-08-23 腾讯科技(深圳)有限公司 视频处理的方法、装置、电子设备及计算机可读存储介质
CN111327949A (zh) * 2020-02-28 2020-06-23 华侨大学 一种视频的时序动作检测方法、装置、设备及存储介质
CN111372123A (zh) * 2020-03-03 2020-07-03 南京信息工程大学 基于从局部到全局的视频时序片段提取方法
CN112183588A (zh) * 2020-09-11 2021-01-05 上海商汤智能科技有限公司 视频处理方法及装置、电子设备及存储介质
CN112906586A (zh) * 2021-02-26 2021-06-04 上海商汤科技开发有限公司 时序动作提名生成方法和相关产品

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116797972A (zh) * 2023-06-26 2023-09-22 中科(黑龙江)数字经济研究院有限公司 基于稀疏图因果时序编码的自监督群体行为识别方法及其识别系统
CN117601143A (zh) * 2023-11-23 2024-02-27 中建新疆建工集团第三建设工程有限公司 基于多传感器融合的智能巡检机器人及方法

Also Published As

Publication number Publication date
WO2023035904A9 (zh) 2024-03-14
CN115797818A (zh) 2023-03-14

Similar Documents

Publication Publication Date Title
WO2023035904A1 (zh) 视频时序动作提名生成方法及系统
Xu et al. LCANet: End-to-end lipreading with cascaded attention-CTC
Wang et al. Hierarchical feature selection for random projection
Cen et al. Deep feature augmentation for occluded image classification
CN113657124B (zh) 基于循环共同注意力Transformer的多模态蒙汉翻译方法
US11270124B1 (en) Temporal bottleneck attention architecture for video action recognition
WO2016037350A1 (en) Learning student dnn via output distribution
Yang et al. Transformer-based source-free domain adaptation
CN113822125B (zh) 唇语识别模型的处理方法、装置、计算机设备和存储介质
Lin et al. Fp-age: Leveraging face parsing attention for facial age estimation in the wild
Chang et al. End-to-End ASR with Adaptive Span Self-Attention.
Zhao et al. Transformer vision-language tracking via proxy token guided cross-modal fusion
Zhu et al. Multi-scale temporal network for continuous sign language recognition
Nguyen et al. Self-attention amortized distributional projection optimization for sliced Wasserstein point-cloud reconstruction
Liu et al. Bilaterally normalized scale-consistent sinkhorn distance for few-shot image classification
Zhang et al. Weakly-supervised action localization via embedding-modeling iterative optimization
Tao et al. An efficient and robust cloud-based deep learning with knowledge distillation
Abrol et al. Improving generative modelling in VAEs using multimodal prior
Wu et al. Structured discriminative tensor dictionary learning for unsupervised domain adaptation
Huang et al. Residual networks as flows of velocity fields for diffeomorphic time series alignment
Wang et al. MIFNet: Multiple instances focused temporal action proposal generation
Jiao et al. CoSign: Exploring Co-occurrence Signals in Skeleton-based Continuous Sign Language Recognition
Lei et al. Masked Diffusion Models are Fast Learners
Viswanathan et al. Text to image translation using generative adversarial networks
Koohzadi et al. A context based deep temporal embedding network in action recognition

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE