CN114494314A

CN114494314A - Timing boundary detection method and timing perceptron

Info

Publication number: CN114494314A
Application number: CN202111615241.3A
Authority: CN
Inventors: 王利民; 谈婧; 王雨虹; 武港山
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2022-05-13
Anticipated expiration: 2041-12-27
Also published as: CN114494314B

Abstract

A time sequence boundary detection method and a time sequence perceptron are based on a transformation decoder structure and an attention mechanism, a universal non-category time sequence action detection model is established, a small amount of hidden feature query quantity is introduced into a coder of the detection model, input features are compressed to fixed dimensionality through a cross attention mechanism, and the features are decoded by the transformation decoder, so that the sparse detection of the universal non-category time sequence boundary is realized. The invention effectively solves the time sequence redundancy problem of the long video through the feature compression, and reduces the complexity of the quadratic model to the linear level; constructing two implicit characteristic query quantities, namely a boundary query quantity and a context query quantity, so as to correspondingly process a non-semantic-coherent boundary region and a coherent context region in the video and fully utilize the semantic structure of the video; an alignment loss function based on cross attention calculation is provided, so that the network can be converged quickly and stably; and the boundary position is sparsely coded by using a transform decoder, so that complex post-processing is avoided, and the generalization performance of the model is improved.

Description

Timing Boundary Detection Method and Timing Perceptron

技术领域technical field

本发明属于计算机软件技术领域，涉及视频时序边界检测，为一种时序边界检测方法及时序感知器。The invention belongs to the technical field of computer software, relates to video timing boundary detection, and relates to a timing boundary detection method and a timing sensor.

背景技术Background technique

由于互联网上的视频数据爆炸式地增长，视频内容理解成为计算机视觉领域的重要问题。在过往的文献中，对长视频理解的探索仍然不足。无类别时序边界检测是一种有效的弥合长视频和短视频理解之间差距的技术，其目的在于将长视频分割为一系列视频片段。无类别时序边界是由于语义不连续而自然产生的时序边界，它不由任何预先定义的语义类别所约束，现有数据集中包括子动作级、事件级和场景级等等不同粒度的无类别时序边界。对于不同粒度的无类别时序边界的检测，需要不同层次的信息来获取不同尺度下的时序结构和上下文关系。Due to the explosive growth of video data on the Internet, video content understanding has become an important problem in the field of computer vision. In the past literature, the exploration of long video understanding is still insufficient. Class-free temporal boundary detection is an effective technique for bridging the gap between long and short video understanding, which aims to segment long videos into a series of video segments. Classless temporal boundaries are naturally generated temporal boundaries due to semantic discontinuity. They are not constrained by any pre-defined semantic categories. Existing datasets include classless temporal boundaries of different granularities such as sub-action level, event level, and scene level. . For the detection of class-free temporal boundaries with different granularities, different levels of information are required to obtain temporal structure and context at different scales.

目前，由于时序边界语义和粒度的差异，无类别时序边界检测的研究分为多个不同的任务。时序动作分割任务的目标是检测将一个动作实例分割为多个不同的子动作片段的子动作级无类别时序边界。通用时间边界检测旨在定位事件级别的无类别时序边界，即动作/主题/环境变化的时刻。电影场景分割检测场景级别的无类别时序边界，即电影场景之间的过渡，标志着高层次情节的转折。这些任务的目标视频具有相同的语义结构，其边界检测范式表现出相似的特征。以往在这些任务上的工作主要侧重于针对特定边界精心设计的特征编码，并将边界检测问题归结为一个密集预测问题。在预测过程中，这些工作采用复杂的后处理技术来消除结果中大量存在的重复预测同一个真值的假正例。这样复杂的设计和后处理模块与特定的边界类型高度相关，因此不能很好的推广到不同类型的无类别边界检测中，缺乏泛化性。Currently, due to differences in temporal boundary semantics and granularity, research on category-free temporal boundary detection is divided into several distinct tasks. The goal of the temporal action segmentation task is to detect sub-action-level class-free temporal boundaries that split an action instance into multiple distinct sub-action segments. Universal temporal boundary detection aims to locate class-free temporal boundaries at the event level, i.e. moments of action/subject/environment change. Movie scene segmentation detects class-free temporal boundaries at the scene level, i.e. transitions between movie scenes that mark high-level plot turns. The target videos for these tasks have the same semantic structure, and their boundary detection paradigms exhibit similar characteristics. Previous work on these tasks has mainly focused on feature encodings carefully crafted for specific boundaries and reduced the boundary detection problem to a dense prediction problem. During the prediction process, these works employ sophisticated post-processing techniques to eliminate the large number of false positives that repeatedly predict the same true value in the results. Such complex design and post-processing modules are highly related to specific boundary types, so they cannot be well generalized to different types of class-free boundary detection and lack generalization.

发明内容SUMMARY OF THE INVENTION

本发明要解决的问题是：现有的无边界时序边界检测的范式具有相似的性质，但由于边界语义和粒度的差别，被分散在不同的任务中研究。现有的相关工作主要侧重于针对特定边界精心设计的特征编码，并由于密集预测范式采用了复杂的后处理技术来消除假正例，不能很好的推广到不同类型的无类别边界检测中。The problem to be solved by the present invention is that the existing paradigms of unbounded temporal boundary detection have similar properties, but are scattered in different tasks due to differences in boundary semantics and granularity. Existing related works mainly focus on well-designed feature encoding for specific boundaries, and since the dense prediction paradigm employs sophisticated post-processing techniques to eliminate false positives, it does not generalize well to different types of classless boundary detection.

本发明的技术方案为：时序边界检测方法，构建一个无类别时序边界检测网络对视频进行时序边界检测，检测网络包括骨干网络和检测模型，实现方式如下：The technical scheme of the present invention is: a time sequence boundary detection method, constructing a classless time sequence boundary detection network to detect the time sequence boundary of the video, the detection network includes a backbone network and a detection model, and the implementation is as follows:

1)由骨干网络生成检测样例：对视频间隔采样得到视频图像序列

以每一帧生成一个视频段，第i段视频段为由第i帧图像f_i的前后连续k帧组成的图像序列，由骨干网络对输入的视频段生成视频特征

和连续性打分

F_i和S_i分别为视频段i的RGB特征和连续性打分；1) The detection sample is generated by the backbone network: the video image sequence is obtained by sampling the video interval

A video segment is generated for each frame, and the i-th video segment is an image sequence composed of consecutive k frames before and after the i-th frame image f _i . The backbone network generates video features for the input video segment.

and continuity score

F _i and S _i are the RGB features and continuity scores of video segment i, respectively;

2)由检测模型基于视频特征F和连续性打分S进行无类别时序动作检测，所述检测模型包括如下配置：2) based on the video feature F and the continuity score S, the detection model is used to perform unclassified sequential action detection, and the detection model includes the following configurations:

2.1)编码器：编码器：编码器E包括N_e层串联的变换解码层，每层包含一个多头自注意力层、一个多头交叉注意力层和一个线性映射层，自注意力层、交叉注意力层及线性映射层分别带有一个残差结构，对编码器引入M个隐特征查询量Q_e，基于连续性打分S对视频特征F进行降序排序后输入编码器，编码器将排序后的视频特征压缩为M帧的压缩特征H，初始压缩特征H₀为0，在第j层变换解码层，隐特征查询量Q_e与当层的压缩特征H_j相加，经过自注意力层及其残差结构，在交叉注意力层和重排序的视频特征交互，再经过残差结构-线性映射层-残差结构变换后得到压缩特征H_j+1，j∈[1,(N_e-1)]，通过堆叠的N_e个编码层后，实现输入特征的压缩和编码，得到压缩特征

2.1) Encoder: Encoder: Encoder E includes N _e layers concatenated transform decoding layers, each layer contains a multi-head self-attention layer, a multi-head cross-attention layer and a linear mapping layer, self-attention layer, cross-attention layer The force layer and the linear mapping layer respectively have a residual structure, introduce M latent feature query quantities Q _e into the encoder, sort the video features F in descending order based on the continuity score S, and then input them into the encoder. The video feature is compressed into the compressed feature H of M frames, and the initial compressed feature H ₀ is 0. In the jth layer of the transformation and decoding layer, the hidden feature query quantity Q _e is added to the compressed feature H _j of the current layer, and after the self-attention layer and Its residual structure interacts with the reordered video features in the cross-attention layer, and then goes through the residual structure-linear mapping layer-residual structure transformation to obtain the compressed feature H _j+1 , j∈[1,(N _e - 1)], after stacking N _e coding layers, the compression and coding of the input features are realized, and the compressed features are obtained

其中，隐特征查询量的生成为：隐特征查询量Q_e被分为M_b个边界查询量和M_c个上下文查询量，随机初始化，在训练检测模型的过程随训练样本学习生成；边界查询量对应处理视频特征中的边界区域特征，上下文查询量对应处理视频特征中的上下文区域特征，视频特征中重排序后前M_b个特征为边界区域特征，其他为上下文特征；Among them, the generation of latent feature query volume is: hidden feature query volume Q _e is divided into M _b boundary query volume and M _c context query volume, initialized randomly, and generated with training samples in the process of training the detection model; boundary query The amount corresponds to the boundary area feature in the video feature, and the context query amount corresponds to the context area feature in the video feature. The first M _b features after reordering in the video feature are the boundary area feature, and the others are the context feature;

2.2)解码器：解码器D包括N_d层串联的解码层，每层包含一个多头自注意力层、一个多头交叉注意力层和一个线性映射层，自注意力层、交叉注意力层及线性映射层分别带有一个残差结构；对于编码器获得的压缩特征H，解码器通过变换解码器结构进行时序边界点解析，解码器定义N_p个提名查询量Q_d，提名查询量Q_d与隐特征查询量一样，随机初始化后再训练中学习生成，并初始化边界提名B₀为0，,在第j层，提名查询量Q_d与边界提名B_j相加，经过自注意力层和一次残差结构，在交叉注意力层和压缩特征H交互，经过残差结构-线性映射层-残差结构变换后得到更新后的边界提名B_j+1；通过堆叠的N_d个解码层后，实现压缩特征的解析，得到时序边界提名表示

2.2) Decoder: Decoder D includes N _d layers of concatenated decoding layers, each layer contains a multi-head self-attention layer, a multi-head cross-attention layer and a linear mapping layer, self-attention layer, cross-attention layer and linear The mapping layer has a residual structure respectively; for the compressed feature H obtained by the encoder, the decoder performs timing boundary point analysis by transforming the decoder structure, the decoder defines N _p nominated query quantities Q _d , and the nominated query quantities Q _d are The hidden feature query volume is the same, it is randomly initialized and then learned and generated during training, and the boundary nomination B ₀ is initialized to 0. In the jth layer, the nominated query volume Q _d is added to the boundary nomination B _j , after the self-attention layer and a time Residual structure, in the interaction between the cross attention layer and the compressed feature H, after the residual structure-linear mapping layer-residual structure transformation, the updated boundary nomination B _j+1 is obtained; after passing through the stacked N _d decoding layers, Realize the parsing of compressed features and get the nominated representation of timing boundaries

2.3)时序无类别边界的生成与打分：对于获得的时序边界提名表示B，送入两个不同的全连接层分支：定位分支和分类分支，两个分支分别用于输出时序无类别边界的时刻和置信度分数；2.3) Generation and scoring of time-series boundary-free boundaries: For the obtained time-series boundary nominated representation B, it is sent to two different fully-connected layer branches: the positioning branch and the classification branch, which are respectively used to output the time-series boundary-free moments. and confidence scores;

2.4)分配训练标签：采用严格的一对一训练标签匹配策略：根据定义的匹配代价C，利用匈牙利算法得到一组最优的一对一匹配，每个被分配到一个无类别边界真值的预测都获得正样本标签，其对应的边界真值为训练目标；匹配代价C由位置代价和分类代价两部分组成，位置代价基于预测时刻和边界真值时刻的距离绝对值定义，分类代价基于预测置信度定义；2.4) Assign training labels: adopt a strict one-to-one training label matching strategy: according to the defined matching cost C, use the Hungarian algorithm to obtain a set of optimal one-to-one matches, each of which is assigned to a class-free boundary truth value. The predictions all obtain positive sample labels, and the corresponding boundary truth values are the training targets; the matching cost C consists of two parts, the position cost and the classification cost. The position cost is defined based on the absolute value of the distance between the prediction moment and the boundary truth moment, and the classification cost is based on the prediction. Confidence definition;

2.5)时序无类别边界的提交：生成一系列的时序无类别边界后，通过置信度分数阈值γ筛选出最可信的时序无类别边界时刻，提交以进行后续性能度量；2.5) Submission of time-series classless boundaries: After generating a series of time-series class-free boundaries, the most credible time-series classless-boundary moments are screened out through the confidence score threshold γ, and submitted for subsequent performance measurement;

3)训练阶段：对配置的模型采用训练样例进行训练，使用交叉熵、L1距离和log函数作为损失函数，使用AdamW优化器，通过反向传播算法来更新网络参数，不断重复步骤1)和步骤2)，直至达到迭代次数；3) Training phase: The configured model is trained with training samples, using cross entropy, L1 distance and log function as loss functions, using AdamW optimizer, and updating network parameters through backpropagation algorithm, repeating steps 1) and Step 2), until the number of iterations is reached;

4)检测：将待测试数据的视频特征序列和连续性打分输入到训练完成的检测模型中，生成时序无类别边界时刻及打分，再通过2.3)的方法，得到用于性能度量的时序无类别边界时刻序列。4) Detection: Input the video feature sequence and continuity score of the data to be tested into the trained detection model, generate time-series unclassified boundary moments and scores, and then use the method of 2.3) to obtain the time-series classless time sequence for performance measurement. Boundary time series.

本发明还提供一个时序感知器，其具有计算机存储介质，所述计算机存储介质中配置有计算机程序，所述计算机程序用于实现上述的无类别时序边界检测网络，所述计算机程序被执行时实现上述的时序边界检测方法。The present invention also provides a timing sensor, which has a computer storage medium, and a computer program is configured in the computer storage medium, and the computer program is used for implementing the above-mentioned classless timing boundary detection network, and the computer program is implemented when executed. The above-mentioned timing boundary detection method.

本发明提出了一种通用、统一的架构来处理不同类型的无类别时序边界检测，能够基于注意力机制和视频语义结构，压缩冗余的视频输入为稳定可靠的特征表示，降低模型复杂度；从全局上下文角度，稀疏、高效、准确地给出任意无类别时序边界的位置和置信度分数。The present invention proposes a general and unified architecture to process different types of classless time series boundary detection, which can compress redundant video input into stable and reliable feature representation based on attention mechanism and video semantic structure, and reduce model complexity; From a global context perspective, the location and confidence scores of arbitrary class-free temporal boundaries are sparsely, efficiently, and accurately given.

本发明与现有技术相比有如下优点Compared with the prior art, the present invention has the following advantages

本发明提出了一种通用的无类别时序边界检测范式，基于变换结构和注意力机制，为任意无类别时序边界检测给出了高效统一的方法。The present invention proposes a general classless time series boundary detection paradigm, and based on the transformation structure and attention mechanism, provides an efficient and unified method for any classless time series boundary detection.

本发明引入了一小组可学习的隐特征查询量，作为锚框进行对于冗余视频输入的压缩。隐特征查询量通过交叉注意力机制，在保证重要边界信息的同时将输入压缩到固定大小的隐特征空间，将输入长度平方级的模型时空复杂度降低到线性复杂度。The present invention introduces a small number of learnable latent feature queries as anchor boxes to compress redundant video inputs. Through the cross-attention mechanism, the latent feature query volume compresses the input into a fixed-size latent feature space while ensuring important boundary information, and reduces the spatiotemporal complexity of the model with the squared input length to linear complexity.

本发明为时序边界检测构造了有效的隐特征查询量构成，包括边界查询量和上下文查询量。为了更好的利用视频中边界和上下文组成的语义结构，视频特征也被分为边界区域特征和上下文区域特征。边界查询量对边界区域特征进行针对性的提取；上下文查询量对上下文区域聚类，压缩冗余的上下文内容为多个上下文聚类中心。The invention constructs an effective latent feature query volume composition for timing boundary detection, including boundary query volume and context query volume. In order to better utilize the semantic structure composed of boundary and context in video, video features are also divided into boundary region features and context region features. The boundary query volume extracts the features of the boundary area in a targeted manner; the context query volume clusters the context area, and compresses the redundant context content into multiple context clustering centers.

本发明利用对齐损失函数将边界查询量和边界区域特征进行一对一的对齐，有效地降低训练难度、缩短收敛时间，并且形成稳定的压缩特征、提高定位预测性能。The invention uses the alignment loss function to align the boundary query amount and the boundary area feature one-to-one, effectively reducing the training difficulty, shortening the convergence time, forming stable compression features, and improving the positioning prediction performance.

本发明利用变换解码器和一一对应的训练标签匹配策略，有效的利用全局上下文信息进行边界预测，稀疏、高效地生成定位更准确的无类别时序边界位置，无需复杂的后处理技术。The invention utilizes the transform decoder and the one-to-one corresponding training label matching strategy, effectively utilizes the global context information for boundary prediction, sparsely and efficiently generates more accurate location-free time series boundary positions, and does not require complex post-processing techniques.

本发明在时序边界检测任务上具有通用、高效、准确等特点。和现有的方法相比，本发明在子动作级、事件级和场景级无类别时序边界数据集上都达到了更好的预测精度和更快的推理速度，体现了模型的泛化性能。The invention has the characteristics of generality, high efficiency, accuracy and the like in the task of timing boundary detection. Compared with the existing methods, the present invention achieves better prediction accuracy and faster inference speed on sub-action level, event level and scene level unclassified time series boundary data sets, which reflects the generalization performance of the model.

附图说明Description of drawings

图1是本发明所使用的系统框架图。FIG. 1 is a system frame diagram used in the present invention.

图2是本发明视频的抽帧处理示意图。FIG. 2 is a schematic diagram of frame extraction processing of video according to the present invention.

图3是本发明的编码器示意图。FIG. 3 is a schematic diagram of an encoder of the present invention.

图4是本发明的解码器示意图。FIG. 4 is a schematic diagram of a decoder of the present invention.

图5是本发明的边界提名生成和提交示意图。FIG. 5 is a schematic diagram of border nomination generation and submission of the present invention.

图6展示本发明在MovieNet数据集样例上的模型效率比较结果。FIG. 6 shows the model efficiency comparison results of the present invention on an example MovieNet dataset.

图7展示本发明在MovieNet数据集样例上和之前工作比较的结果。Figure 7 shows the results of the present invention compared with previous work on the MovieNet dataset example.

图8展示本发明在Kinetics-GEBD和TAPOS数据集样例上和之前工作比较的结果。Figure 8 shows the results of the present invention compared with previous work on the Kinetics-GEBD and TAPOS dataset samples.

图9展示本发明在MovieNet数据集样例上的结果可视化。Figure 9 shows the result visualization of the present invention on an example MovieNet dataset.

图10为本发明的总体流程示意图。FIG. 10 is a schematic diagram of the overall flow of the present invention.

具体实施方式Detailed ways

本发明构建一个无类别时序边界检测网络对视频进行时序边界检测，检测网络包括骨干网络和检测模型。本发明的检测模型为时序感知模型(Temporal Perceiver)，是一个通用的无类别时序边界检测框架，引入一小组隐特征查询量作为锚框，通过交叉注意力机制压缩冗余输入到固定维度，实现无类别时序边界检测任务。本发明方法包括生成样例阶段、网络配置阶段、训练阶段以及测试阶段，如图10所示，具体说明如下。The invention constructs a classless time sequence boundary detection network to detect the time sequence boundary of the video, and the detection network includes a backbone network and a detection model. The detection model of the present invention is a Temporal Perceiver, which is a general classless time-series boundary detection framework. A small number of latent feature query quantities are introduced as anchor boxes, and redundant inputs are compressed into a fixed dimension through a cross-attention mechanism to realize Class-free temporal boundary detection task. The method of the present invention includes a sample generation stage, a network configuration stage, a training stage, and a testing stage, as shown in FIG. 10 , and the specific description is as follows.

1)生成样例阶段：使用基于ResNet50和时序卷积层的骨干网络，对训练和测试视频进行样例生成，样例包括视频特征F和连续性打分S。对于每个视频，对其对应的所有图片帧以τ帧为间隔进行采样，得到视频图像序列

在视频图像序列L_f上采集N_f个长度为2k帧的视频段，其中第i个视频段为由第i帧图像的前后各连续k帧组成的图像序列

N_f为视频图像序列的长度，同时也为视频段的数目。每个视频段图像序列L_s,i被送入骨干网络，经过预训练和微调参数的卷积层、池化层和全连接层，输出第i帧对应的D维RGB特征F_i和连续性打分S_i。将不同视频段的特征和打分分别按照时间顺序拼接起来，得到整个视频特征

和连续性打分

其中，采样间隔τ表示在全局进行时间划分的细粒程度；视频段长度2k的大小表示特征的局部感受野范围。为减小时间复杂度的同时保留更多的局部信息，本发明实施例优选τ取3，k取5，D取2048。具体实施如下：1) Sample generation stage: Use the backbone network based on ResNet50 and time-series convolutional layers to generate samples for training and test videos, and the samples include video feature F and continuity score S. For each video, all the corresponding picture frames are sampled at intervals of τ frames to obtain a video image sequence

Collect N _f video segments with a length of 2k frames from the video image sequence L _f , wherein the ith video segment is an image sequence composed of consecutive k frames before and after the ith frame image

N _f is the length of the video image sequence, and is also the number of video segments. Each video segment image sequence L _s,i is sent to the backbone network, after pre-training and fine-tuning parameters of the convolution layer, pooling layer and fully connected layer, the D-dimensional RGB feature F _i and continuity corresponding to the i-th frame are output Score S _i . The features and scores of different video segments are spliced together in chronological order to obtain the entire video features

and continuity score

Among them, the sampling interval τ represents the fine-grained degree of global time division; the size of the video segment length 2k represents the local receptive field range of the feature. In order to reduce the time complexity while retaining more local information, in the embodiment of the present invention, τ is preferably 3, k is 5, and D is 2048. The specific implementation is as follows:

使用denseflow对原视频抽取图片帧，对视频的所有图片帧以3帧为间隔进行采样，得到长度为N_f的视频图像序列L_f。调用torchvision库的transforms包，对每个图片帧进行缩放，得到尺度为224*224的对应图像。在视频图像序列L_f上采集N_f个长度为2k＝10帧的视频段，其中第i个视频段L_s,i由视频第i帧图像的前后各连续5帧组成，即

所有视频段组成的视频段序列记为

将视频段L_s,i∈R^10×224×224送入骨干网络，经过预训练和微调参数的ResNet50网络，得到中间特征F_mid,i∈R^10×2048，F_mid,i再经过时序卷积层和池化层，得到视频段的RGB特征F_i∈R²⁰⁴⁸，最后将F_i输入全连接层，得到连续性打分S_i∈R。将不同视频段的特征F_i和打分S_i分别按照时间顺序拼接起来，得到整个视频的特征

和连续性打分

视频特征和连续性打分通过滑动窗口法形成一系列片段输入模型处理，其中窗口长度N_ws为100，滑窗时无重叠取框。具体如下：Use denseflow to extract picture frames from the original video, and sample all picture frames of the video at 3-frame intervals to obtain a video image sequence L _{f of length N f} _. Call the transforms package of the torchvision library to scale each picture frame to obtain a corresponding image with a scale of 224*224. N _f video segments with a length of 2k=10 frames are collected from the video image sequence L _f , wherein the ith video segment L _s,i consists of 5 consecutive frames before and after the ith frame image of the video, that is,

The video segment sequence composed of all video segments is denoted as

The video segment L _s,i ∈ R ^10×224×224 is sent to the backbone network, and after pre-training and fine-tuning the parameters of the ResNet50 network, the intermediate feature F _mid,i ∈ R ^10×2048 is obtained, and F _mid,i goes through the time series volume The accumulation layer and the pooling layer are used to obtain the RGB feature F _i ∈ R ²⁰⁴⁸ of the video segment. Finally, the F _i is input to the fully connected layer to obtain the continuity score S _i ∈ R . The features F _i and scoring S _i of different video segments are spliced together in chronological order to obtain the features of the entire video.

and continuity score

Video features and continuity scores are processed by a series of segment input models through the sliding window method, where the window length N _ws is 100, and there is no overlapping frame during sliding window. details as follows:

1.抽帧和采样后得到的整体视频段序列如下：1. The overall video segment sequence obtained after frame extraction and sampling is as follows:

L_s,i＝{f_i-5,f_i-4,f_i-3,f_i-2,f_i-1,f_i+1,f_i+2,f_i+3,f_i+4,f_i+5,}L _s,i ={f _i-5 ,f _i-4 ,f _i-3 ,f _i-2 ,f _i-1 ,f _i+1 ,f _i+2 ,f _i+3 ,f _i+4 ,f _i+5 ,}

其中V_f代表视频段序列，其由N_f个图像序列段L_s,i组成，每个图像序列段中包含了2k＝10张图像。Wherein V _f represents a video segment sequence, which consists of N _f image sequence segments L _s,i , and each image sequence segment contains 2k=10 images.

2.骨干网络处理输入视频图像序列的过程如下：2. The backbone network processes the input video image sequence as follows:

F_mid,i＝Resnet50(L_s,i)F _mid,i =Resnet50(L _s,i )

F_i＝MaxPooling(Tconv(F_mid,i))F _i =MaxPooling(Tconv(F _mid,i ))

S_i＝FC(F_i)S _i =FC(F _i )

其中F_mid,i代表输入视频段经过Resnet50网络处理得到的中间特征，F_i为F_mid,i经过时序卷积之后得到的视频段特征，S_i为连续性打分结果，F为不同视频段按照时间顺序拼接起来的特征序列，S为不同视频段按照时间顺序拼接起来的连续性打分序列。Among them, F _mid,i represents the intermediate feature of the input video segment processed by the Resnet50 network, F _i is the video segment feature obtained by the time series convolution of F _mid,i , S _i is the continuity scoring result, and F is the different video segments according to The feature sequence spliced in time sequence, S is the continuous scoring sequence spliced together by different video segments in time sequence.

2)网络配置阶段，基于变换解码器结构和注意力机制，建立通用的无类别时序动作检测模型Temporal Perceiver，模型包括以下配置：2) In the network configuration stage, based on the transform decoder structure and attention mechanism, a general category-free time series action detection model Temporal Perceiver is established. The model includes the following configurations:

2.1)编码器：基于1)中生成的连续性打分S，对1)中生成的视频特征F进行投影和降序排序得到F_rerank，引入M个可学习的隐特征查询量Q_e和初始化为0的压缩特征H₀，利用变换解码器结构将重排序特征压缩为M帧的压缩特征H。编码器E包括N_e层串联的变换解码层，每层记为Encoder_j，j表示编码层编号，j∈[0,(N_e-1)]。Encoder_j的输入为H_j，输出为H_j+1。每个Encoder_j包含一个多头自注意力层MSA_j、一个多头交叉注意力层MCA_j、一个线性映射层FFN_j和三次由加合与层归一化操作形成的残差结构。2.1) Encoder: Based on the continuity score S generated in 1), the video features F generated in 1) are projected and sorted in descending order to obtain F _rerank , and M learnable latent feature query quantities Q _e are introduced and initialized to 0 The compressed feature H ₀ of , uses the transform decoder structure to compress the reordered feature into the compressed feature H of M frames. The encoder E includes a series of transform and decoding layers of N _e layers, each layer is denoted as Encoder _j , j represents the coding layer number, j∈[0,(N _e -1)]. The input of Encoder _j is H _j and the output is H _j+1 . Each Encoder _j contains a multi-head self-attention layer MSA _j , a multi-head cross-attention layer MCA _j , a linear mapping layer FFN _j and three residual structures formed by addition and layer normalization operations.

多头自注意力层MSA_j的输入为键参数K、查询参数Q和值参数V，自注意力机制的键和查询为相同输入。键参数和查询参数自乘后经过Softmax函数归一化，得到权重矩阵A_s，值参数与该权重矩阵相乘，得到输出。多头结构将输入分别沿参数分为多份，分别输入自注意力层，最后再将结果沿通道拼合。多头交叉注意力层MCA_j的输入包括键、查询和值参数，将键参数和查询参数相乘并Softmax归一化得到交叉注意力权重A_c，值参数与交叉注意力权重A_c相乘，得到输出。多头结构将输入分别沿参数分为多份，分别输入交叉注意力层，最后再将结果沿通道拼合。The input of the multi-head self-attention layer MSA _j is the key parameter K, the query parameter Q and the value parameter V, and the key and query of the self-attention mechanism are the same input. The key parameter and the query parameter are self-multiplied and normalized by the Softmax function to obtain the weight matrix A _s , and the value parameter is multiplied by the weight matrix to obtain the output. The multi-head structure divides the input into multiple parts along the parameters, respectively inputs the self-attention layer, and finally combines the results along the channel. The input of the multi-head cross-attention layer MCA _j includes key, query and value parameters. The key parameter and the query parameter are multiplied and Softmax normalized to obtain the cross-attention weight A _c , and the value parameter is multiplied by the cross-attention weight A _c , get the output. The multi-head structure divides the input into multiple parts along the parameters, respectively enters the cross-attention layer, and finally combines the results along the channel.

本发明优选N_e＝6，M＝60，多头自注意力层的分支数量为8、多头交叉注意力层的分支数量为8。编码器中MSA_j的键和查询参数为隐特征查询量Q_e与压缩特征H_j相加结果，值参数为压缩特征H_j，权重矩阵A_s∈R^M×M，记MSA_j经过残差结构之后的输出为H′_j。MCA_j的键参数为F_rerank视频特征与对应的位置编码之和F_rerank+P_rerank，值参数为F_rerank，查询参数为MSA_j的输出H′_j和隐特征查询量Q_e之和，权重矩阵A_c∈R^N×M,记MCA_j经过残差结构之后的输出为H″_j。编码器E通过堆叠N_e＝6个编码层，实现输入特征的压缩和编码，最后一层编码层Encoder₅的输出即为编码器E输出的压缩特征H＝H₆。具体计算如下：In the present invention, it is preferred that N _e =6, M=60, the number of branches of the multi-head self-attention layer is 8, and the number of branches of the multi-head cross-attention layer is 8. The key and query parameters of MSA _j in the encoder are the result of the addition of the latent feature query quantity Q _e and the compressed feature H _j , the value parameter is the compressed feature H _j , the weight matrix A _s ∈ R ^M×M , and MSA _j passes through the residual error. The output after the structure is _H'j . The key parameter of MCA _j is the sum of the _Frerank video feature and the corresponding position code _Frerank +P _rerank , the value parameter is _Frerank , the query parameter is the sum of the output H′ _j of MSA _j and the latent feature query quantity Q _e , and the weight The matrix A _c ∈ R ^N×M , denote the output of MCA _j after the residual structure is H″ _j . The encoder E realizes the compression and encoding of input features by stacking N _e = 6 encoding layers, and the last layer of encoding layer The output of Encoder ₅ is the compressed feature H=H ₆ output by encoder E. The specific calculation is as follows:

1.视频特征和位置编码的重排序变换：1. Reordering transformation of video features and position encoding:

F_rerank＝sort(fc_proj(F),S)F _rerank = sort(fc _proj (F), S)

P_rerank＝sort(P,S) _Prerank = sort(P,S)

其中，P为F所对应的时序相对位置取sin函数之后形成的位置编码，和F一一对应；fc_proj为视频特征从输入维度D＝2048变为模型维度D_model＝512所使用的线性投影层,下标proj是projection的缩写。Among them, P is the position code formed after the relative position of the time series corresponding to F takes the sin function, which corresponds to F one-to-one; fc _proj is the linear projection used by the video feature from the input dimension D=2048 to the model dimension D _model =512 Layer, the subscript proj is the abbreviation of projection.

2.编码器中某层编码层Encoder_j的编码过程：2. The encoding process of a certain layer of encoding layer Encoder _j in the encoder:

H′_j＝LayerNorm(H_j+MSA_j(H_j,Q_e))H′ _j =LayerNorm(H _j +MSA _j (H _j ,Q _e ))

H″_j＝LayerNorm(MCA_j(H′_j,Q_e,F_rerank,P_rerank)+H′_j)H″ _j =LayerNorm(MCA _j (H′ _j ,Q _e ,F _rerank ,P _rerank )+H′ _j )

H_j+1＝LayerNorm(FFN_j(H″_j)+H″_j)H _j+1 =LayerNorm(FFN _j (H″ _j )+H″ _j )

3.多头自注意力层MSA的编码过程：3. The encoding process of the multi-head self-attention layer MSA:

SA_h(x_h,q_h)＝x_hW_v,hA_s,h SA _h (x _h ,q _h )=x _h W _v,h A _s,h

A_s,h＝Softmax((x_h+q_h)Wk,h·(x_h+q_h)W_q,h)A _s,h =Softmax((x _h +q _h )Wk,h·(x _h +q _h )W _q,h )

其中，N_h代表多头自注意力层中的头分支数目，W_k,h、W_q,h、W_v,h和W_o为键、查询、值和输出所对应的投影矩阵，h下标指代不同的分支，每层、每个分支的投影矩阵参数均不共享。SA_h表示单头自注意力层，x_h,q_h为特征x和其位置编码q分别沿通道维度的切片，切片共有N_h份，A_s,h表示第h个分支的自注意力矩阵。Among them, N _h represents the number of head branches in the multi-head self-attention layer, W _k,h , W _q,h , W _v,h and W _o are the projection matrices corresponding to keys, queries, values and outputs, and the h subscript Refers to different branches, and the projection matrix parameters of each layer and each branch are not shared. SA _h represents a single-head self-attention layer, x _h , q _h are slices of the feature x and its position encoding q along the channel dimension, respectively, there are N _h slices in total, A _{s, h} represent the h-th branch of the self-attention matrix .

4.多头交叉注意力层MCA的编码过程：4. The encoding process of the multi-head cross-attention layer MCA:

CA_h(x_h,q_x,h,y_h,q_y,h)＝y_hW_v,hA_c,h CA _h (x _h ,q _x,h ,y _h ,q _y,h )=y _h W _v,h A _c,h

A_c,h＝Softmax((y_h+q_y,h)W_k,h·(x_h+q_x,h)W_q,h)A _c,h =Softmax((y _h +q _y,h )W _k,h ·(x _h +q _x,h )W _q,h )

其中，W_k,h、W_q,h、W_v,h和W_o为键、查询、值和输出所对应的投影矩阵，每层、每个分支的投影矩阵参数均不共享，交叉注意力层的投影矩阵也不与自注意力层共享。CA_h表示单头交叉注意力层，x_h,y_h,q_x,h,q_y,h为x,y和对应位置编码q_x,q_y分别沿通道维度的切片，A_c,h表示第h个分支的交叉注意力矩阵。Among them, W _k,h , W _q,h , W _v,h and W _o are the projection matrices corresponding to keys, queries, values and outputs. The projection matrix parameters of each layer and each branch are not shared, and cross attention The layer's projection matrix is also not shared with the self-attention layer. CA _h represents a single-head cross-attention layer, x _h , y _h , q _{x, h} , q _{y, h} are slices of x, y and the corresponding position encoding q _x , q _y along the channel dimension, respectively, A _{c, h} represent The cross-attention matrix of the hth branch.

5.编码器的编码过程：5. The encoding process of the encoder:

H_j+1＝Encoder_j(H_j；Q_e,F_rerank,P_rerank)H _j+1 =Encoder _j (H _j ; Q _e ,F _rerank ,P _rerank )

H＝H₆＝Encoder₅(Encoder₄(Encoder₃(Encoder₂(Encoder₁(Encoder₀(H₀))))))H=H6=Encoder ₅ (Encoder ₄ (Encoder ₃ (Encoder ₂ (Encoder ₁ (Encoder ₀ (H ₀ ₎ ))))))

2.2)本发明对编码器引入了隐特征查询量Q_e，其生成具体为：为了更好地利用视频的语义结构，隐特征查询量Q_e被分为M_b个边界查询量和M_c个上下文查询量，M_b＝48，M_c＝12；重排序特征被分为边界区域特征和上下文区域特征，前M_b个特征为边界区域特征。边界区域特征和上下文区域特征的定义基于连续性打分S，视频特征根据S的降序重新排序后，打分较高的前M_b个特征形成边界区域特征，剩下为上下文区域特征。本发明的隐边界查询量Q_e为可学习参数，根据上述说明，隐边界查询量Q_e随机初始化，在模型训练过程中通过学习生成。前M_b个查询量被定义为边界查询量，在压缩过程中一对一的结合边界区域特征；剩下的M_c个查询量为上下文查询量，在压缩过程中对上下文区域特征进行聚类，得到M_b个聚类中心表征上下文信息。2.2) The present invention introduces the latent feature query quantity Q _e to the encoder, and its generation is specifically: in order to make better use of the semantic structure of the video, the latent feature query quantity Q _e is divided into M _b boundary query quantities and M _c Context query amount, M _b =48, M _c =12; the reordering features are divided into boundary area features and context area features, and the first M _b features are boundary area features. The definition of boundary area features and context area features is based on the continuity score S. After the video features are reordered according to the descending order of S, the top M _b features with higher scores form boundary area features, and the rest are context area features. The implicit boundary query quantity Q _e of the present invention is a learnable parameter. According to the above description, the implicit boundary query quantity Q _e is randomly initialized and generated by learning during the model training process. The first M _b query volumes are defined as boundary query volumes, and the boundary region features are combined one-to-one in the compression process; the remaining M _c query volumes are context query volumes, and the context region features are clustered during the compression process. , to obtain M _b cluster centers representing context information.

2.3)进一步的，为了提高隐边界查询量Q_e对视频特征的编码效果，在训练中对齐隐特征查询量与视频特征：为加快模型收敛并得到更加稳定的压缩特征，本发明引入额外的监督约束，基于最后一层交叉注意力矩阵计算对齐损失函数。该损失函数的计算为对矩阵的前M_b×M_b区域的对角线权重之和取负对数，通过最小化损失函数来最大化对角线权重，实现边界查询量和特征的一一对齐。具体计算如下:2.3) Further, in order to improve the coding effect of the hidden boundary query quantity Q _e on the video features, the hidden feature query quantity and the video features are aligned during training: in order to speed up the model convergence and obtain more stable compression features, the present invention introduces additional supervision. Constraints, the alignment loss function is calculated based on the cross-attention matrix of the last layer. The calculation of the loss function is to take the negative logarithm of the sum of the diagonal weights of the front M _b ×M _b regions of the matrix, and maximize the diagonal weight by minimizing the loss function, so as to achieve the one-to-one ratio of boundary query volume and features. Align. The specific calculation is as follows:

对齐损失函数L_align的计算过程：The calculation process of the alignment loss function L _align :

其中，α_align＝1为对齐损失函数的参数，对齐损失函数只对编码器最后一层交叉注意力层计算。Among them, α _align =1 is the parameter of the alignment loss function, and the alignment loss function is only calculated for the last cross-attention layer of the encoder.

2.4)解码器：对于2.1)中获得的压缩特征H，采用N_p个可学习的提名查询量Q_d和初始化为0的边界提名表示B₀，通过变换解码器结构进行时序边界点解析，提名查询量Q_d与隐特征查询量一样，随机初始化后再训练中学习生成；解码器D包括N_d层变换解码层，第j层解码层被记为Decoder_j，输入为边界提名表示B_j，输出为B_j+1。每层解码层包含一个多头自注意力层、一个多头交叉注意力层、一个线性映射层和三次由加合与层归一化操作形成的残差结构。在第j层，提名查询量Q_d与边界提名B_j相加，经过自注意力层和一次残差结构，在交叉注意力层和压缩特征H交互，经过残差结构-线性映射层-残差结构变换后得到更新后的边界提名B_j+1。通过堆叠的N_d个解码层后，实现压缩特征的解析，得到时序边界提名表示B。2.4) Decoder: For the compressed feature H obtained in 2.1), adopt N _p learnable nominated query quantities Q _d and the boundary nominated representation B ₀ initialized to 0, and perform time-series boundary point analysis by transforming the decoder structure. The query quantity Q _d is the same as the latent feature query quantity, which is randomly initialized and then learned and generated during training; the decoder D includes N _d layers of transform decoding layers, the j-th decoding layer is denoted as Decoder _j , and the input is the boundary nominated representation B _j , The output is B _j+1 . Each decoding layer consists of a multi-head self-attention layer, a multi-head cross-attention layer, a linear mapping layer, and three residual structures formed by addition and layer normalization operations. In the jth layer, the nominated query volume Q _d is added to the boundary nomination B _j , after the self-attention layer and the primary residual structure, the cross-attention layer interacts with the compressed feature H, and the residual structure-linear mapping layer-residual The updated boundary nomination B _j+1 is obtained after the difference structure transformation. After passing through the stacked N _d decoding layers, the compressed features are analyzed, and the temporal boundary nominated representation B is obtained.

本发明优选N_d＝6，N_p＝10，多头自注意力层的分支数量为8、多头交叉注意力层的分支数量为8。MSA_j的键和查询参数为提名查询量Q_d与边界提名表示B_j相加结果，值参数为边界提名表示B_j，权重矩阵

记MSA_j经过残差结构之后的输出为B′_j。MCA_j的键参数为压缩特征H和作为压缩位置编码的隐特征查询量Q_e之和H+Q_e，值参数为H，查询参数为MSA_j的输出B′_j和提名查询量Q_d之和，权重矩阵

记MCA_j经过残差结构之后的输出为B″_j。解码器D和编码器E对称，通过堆叠N_d＝6个编码层，实现边界提名表示的解码，最后一层解码层Decoder₅的输出B₆＝B，将会作为解码器最终输出放入全连接分支内进行边界位置和置信度的预测。具体计算如下:In the present invention, preferably N _d =6, N _p =10, the number of branches of the multi-head self-attention layer is 8, and the number of branches of the multi-head cross-attention layer is 8. The key and query parameters of MSA _j are the result of the addition of the nominated query quantity Q _d and the boundary nominated representation B _j , the value parameter is the boundary nominated representation B _j , and the weight matrix

Denote the output of MSA _j after passing through the residual structure as B′ _j . The key parameter of MCA _j is the sum H+Q _e of the compressed feature H and the latent feature query quantity Q _e as the compressed position code, the value parameter is H, and the query parameter is the sum of the output _B'j of MSA _j and the nominated query quantity Q _d and, the weight matrix

Denote the output of MCA _j after the residual structure is B″ _j . The decoder D and the encoder E are symmetrical. By stacking N _d = 6 coding layers, the decoding of the boundary nominated representation is realized, and the output of the last decoding layer Decoder ₅ B ₆ =B, it will be put into the fully connected branch as the final output of the decoder to predict the boundary position and confidence. The specific calculation is as follows:

1.解码器中某层解码层Decoder_j的解码过程：1. The decoding process of a certain layer of decoding layer Decoder _j in the decoder:

B′_j＝LayerNorm(B_j+MSA_j(B_j,Q_d))B′ _j =LayerNorm(B _j +MSA _j (B _j ,Q _d ))

B″_j＝LayerNorm(MCA_j(B′_j,Q_d,H,Q_e)+B′_j)B″ _j =LayerNorm(MCA _j (B′ _j ,Q _d ,H,Q _e )+B′ _j )

B_j+1＝LayerNorm(FFN_j(B″_j)+B″_j)B _j+1 =LayerNorm(FFN _j (B″ _j )+B″ _j )

2.解码器的解码过程：2. The decoding process of the decoder:

B_j+1＝Decoder_j(B_j；Q_d,H,Q_e)B _j+1 =Decoder _j (B _j ; Q _d , H, Q _e )

B＝B₆＝Decoder₅(Decoder₄(Decoder₃(Decoder₂(Decoder₁(Decoder₀(B₀))))))B=B6=Decoder ₅ (Decoder ₄ (Decoder ₃ (Decoder ₂ (Decoder ₁ (Decoder ₀ (B ₀ ₎ ))))))

2.5)时序无类别边界的生成与打分：对于2.4)中获得的时序边界提名表示B，送入两个不同的全连接层分支：定位分支Head_loc和分类分支Head_cls，两个分支分别用于输出时序无类别边界的时刻和置信度分数。预测的边界时刻是一个0到1之间的小数，表示了在当前片段中的相对位置；置信度分数为两个分数，包括正类置信度和负类置信度，更高的分数代表着更大概率为对应类别。分类分支由一层全连接层构成，输入和输出特征维度分别为512和2；定位分支由三层全连接层组成的多层感知机和一个Sigmoid激活函数组成，输入输出维度分别为512、512、512和512、512、1。具体计算如下：2.5) Generation and scoring of temporal classless boundaries: For the temporal boundary nomination representation B obtained in 2.4), it is sent to two different branches of the fully connected layer: the positioning branch Head _loc and the classification branch Head _cls , the two branches are respectively used for Output time series moments and confidence scores without class boundaries. The predicted boundary moment is a decimal between 0 and 1, indicating the relative position in the current segment; the confidence score is two scores, including positive class confidence and negative class confidence, higher scores represent more Most likely the corresponding category. The classification branch is composed of a fully connected layer, and the input and output feature dimensions are 512 and 2, respectively; the localization branch is composed of a multilayer perceptron composed of three fully connected layers and a sigmoid activation function, and the input and output dimensions are 512 and 512, respectively. , 512 and 512, 512, 1. The specific calculation is as follows:

1.无类别动作定位时刻t的预测：1. Prediction of classless action localization time t:

t＝sigmoid(fc₂(fc₁(fc₀(B))))t=sigmoid(fc ₂ (fc ₁ (fc ₀ (B))))

其中，记定位分支的三层全连接层分别为fc₂,fc₁,fc₀，输入输出维度分别为512、512、512和512、512、1。Among them, the three-layer fully connected layers of the positioning branch are respectively fc ₂ , fc ₁ , and fc ₀ , and the input and output dimensions are 512, 512, 512 and 512, 512, and 1, respectively.

2.二分类置信度分数p_pos的生成：2. Generation of binary classification confidence score p _pos :

p_pos,p_neg＝fc(B)p _pos ,p _neg =fc(B)

其中，记置信度分支的全连接层为fc，一般取正例分数p_pos作为置信度分数，输入维度为512，输出维度为2。Among them, the fully connected layer of the confidence branch is fc, the positive example score p _pos is generally taken as the confidence score, the input dimension is 512, and the output dimension is 2.

2.6)分配训练标签：采用严格的一对一训练标签匹配策略：根据定义的匹配代价C，利用匈牙利算法得到一组最优的一对一匹配，每个被分配到一个无类别边界真值的预测都获得正样本标签，其对应的边界真值为训练目标；匹配代价C由位置代价和分类代价两部分组成，位置代价基于预测时刻和边界真值时刻的距离绝对值定义，分类代价基于预测置信度定义。具体计算如下：2.6) Assign training labels: adopt a strict one-to-one training label matching strategy: According to the defined matching cost C, use the Hungarian algorithm to obtain a set of optimal one-to-one matches, each of which is assigned to a class-free boundary truth value. The predictions all obtain positive sample labels, and the corresponding boundary truth values are the training targets; the matching cost C consists of two parts: the position cost and the classification cost. Confidence definition. The specific calculation is as follows:

1.匈牙利算法的优化指标：1. The optimization index of the Hungarian algorithm:

记优化指标为C，优化指标有位置代价和分类代价两个分量，每个分量有对应的权重，分别记为α_loc和α_cls。本发明优选α_loc＝5,α_cls＝1。The optimization index is denoted as C. The optimization index has two components, the location cost and the classification cost, and each component has a corresponding weight, denoted as α _loc and α _cls respectively. In the present invention, α _loc =5 and α _cls =1 are preferred.

2.优化分量的定义：2. Definition of optimized components:

L_cls,n＝-p_pos,n L _cls,n =-p _pos,n

在优化分量中，第n个预测的定位分量L_loc,n由预测的边界时刻t_n与对应边界真值的位置

的距离绝对值来衡量；第n个预测的分类分量L_cls,n由预测出的该时刻为边界的置信度p_pos,n来度量，由于C为最小化目标，所以对置信度取负；σ(·)是一个从预测到边界真值的映射，σ(n)为第n个预测所对应的动作真值。In the optimization component, the nth predicted localization component L _loc,n is determined by the predicted boundary time t _n and the position of the corresponding boundary truth

Measured by the absolute value of the distance; the nth predicted classification component L _cls,n is measured by the predicted confidence p _pos,n of the boundary at this moment. Since C is the minimization target, the confidence is negative; σ(·) is a mapping from prediction to boundary truth, and σ(n) is the action truth corresponding to the nth prediction.

2.7)时序无类别边界的提交：生成一系列的时序无类别边界后，通过置信度分数阈值γ筛选出最可信的时序无类别边界时刻，提交以进行后续性能度量。当p_pos,n≥γ，则该预测位置提交为结果，当p_pos,n<γ，则该预测位置被舍弃。本发明优选γ＝0.9。2.7) Submission of time-series unclassified boundaries: After generating a series of time-series unclassified boundaries, the most credible time-series unclassified boundary moments are filtered through the confidence score threshold γ, and submitted for subsequent performance measurement. When p _pos,n ≥γ, the predicted position is submitted as the result, and when p _pos,n <γ, the predicted position is discarded. In the present invention, γ=0.9 is preferred.

3)训练阶段：对配置的模型采用训练样例进行训练，使用交叉熵、L1距离作为基于最终结果的损失函数，log函数作为基于中间结果的损失函数，使用AdamW优化器，通过反向传播算法来更新网络参数，不断重复步骤1)和步骤2)，直至达到迭代次数；3) Training phase: The configured model is trained with training samples, using cross entropy and L1 distance as the loss function based on the final result, log function as the loss function based on the intermediate result, using the AdamW optimizer, through the back propagation algorithm to update the network parameters, and repeat steps 1) and 2) until the number of iterations is reached;

4)测试阶段：将待测试数据的视频特征序列和连续性打分输入到训练完成的Temporal Perceiver模型中，生成时序无类别边界时刻及打分，再通过2.5)的方法，得到用于性能度量的时序无类别边界时刻序列。4) Test phase: Input the video feature sequence and continuity score of the data to be tested into the Temporal Perceiver model that has been trained, generate time series without category boundary time and score, and then obtain the time series for performance measurement through the method of 2.5). A class-free time series.

本发明提出了时序感知器模型，一种通用的无类别时序边界检测框架。下面通过具体实施例进行进一步说明。经过在TAPOS数据集、Kinetics-GEBD数据集和MovieNet/MovieScenes数据集上的训练和测试达到了高推理速度和高准确性，优选使用Python3.8.8编程语言，PyTorch 1.7.0深度学习框架实施。The present invention proposes a temporal perceptron model, a general classless temporal boundary detection framework. Further description will be given below through specific embodiments. After training and testing on the TAPOS dataset, the Kinetics-GEBD dataset and the MovieNet/MovieScenes dataset, high inference speed and high accuracy are achieved. The Python3.8.8 programming language is preferably used, and the PyTorch 1.7.0 deep learning framework is implemented.

图1显示了本发明所使用的系统框架图，具体实施步骤如下：Fig. 1 shows the system frame diagram used in the present invention, and the specific implementation steps are as follows:

1)生成样例的准备阶段，如图2所示，训练数据和测试数据均使用同一方式进行处理。使用denseflow对视频进行图片帧的抽取，以τ＝3为间隔采样图片帧序列，使用torchvision的transforms包对图片帧进行缩放变换为224*224的尺度，最后转化为张量形式并归一化。以图片帧序列中的每一帧为中心，取其前后各k＝5帧输入骨干网络得到该帧所对应的特征和连续性打分，沿时序维度拼合得到视频特征和视频连续性打分。视频级的特征和连续性打分被分割为一系列相同长度的视频窗口片段送入模型。1) In the preparation stage for generating samples, as shown in Figure 2, both training data and test data are processed in the same way. Use denseflow to extract picture frames from the video, sample the picture frame sequence at an interval of τ=3, use torchvision's transforms package to scale and transform the picture frames to a scale of 224*224, and finally convert them into tensor form and normalize. Taking each frame in the picture frame sequence as the center, take k=5 frames before and after it and input it into the backbone network to obtain the corresponding feature and continuity score of the frame, and combine along the time sequence dimension to obtain the video feature and video continuity score. Video-level features and continuity scores are split into a series of video window segments of the same length and fed into the model.

2)模型的配置阶段，首先基于连续性打分，如图3所示，对于提取到的视频特征，程序先将视频特征投影到较低的特征维度，并和对应的特征编码按照连续性打分进行降序排序，输入编码器。编码器包括交替的多头自注意力模块、多头交叉注意力模块、线性映射层和相加并层归一化的残差结构。一组可学习的隐特征查询量包括边界查询量和上下文查询量，被输入编码器。压缩特征被初始化为0，参与编码层的计算并累加每层编码层从输入特征提取的特征。隐特征查询量可以被看作压缩特征的位置编码。压缩特征和位置编码相加以获得位置信息，参与注意力计算。2) In the configuration stage of the model, first score based on continuity, as shown in Figure 3, for the extracted video features, the program first projects the video features to a lower feature dimension, and encodes the corresponding features according to the continuity score. Sort descending, enter the encoder. The encoder consists of alternating multi-head self-attention modules, multi-head cross-attention modules, linear mapping layers, and an additive-union layer normalized residual structure. A set of learnable latent feature query volumes, including boundary query volumes and contextual query volumes, are fed into the encoder. The compressed features are initialized to 0, participate in the calculation of the coding layer and accumulate the features extracted from the input features by each coding layer. The latent feature query volume can be viewed as a positional encoding of compressed features. Compressed features and position encoding are added to obtain position information and participate in attention calculation.

首先，压缩特征和隐特征查询量相加为键和查询输入自注意力层，单独作为值参数输入，在自注意力层通过全局自注意力建模压缩特征间关系，提取互相的信息进行特征更新。压缩特征更新后和更新前相加并层归一化得到中间表示。随后，中间表示和作为位置编码的隐特征查询量再次相加作为查询量输入交叉注意力模块，输入视频特征作为值，和对应位置编码相加后作为键输入交叉注意力层。交叉注意力层对排序后的视频特征提取边界特征并聚类上下文特征，通过压缩特征和隐特征查询量对输入视频序列计算点乘激活来提取有用的视频特征，更新压缩特征，达到特征提炼并降低复杂度的效果。最后，交叉注意力层的结果经过残差结构，线性映射层的投影和另一个残差结构，得到累加更新后的压缩特征。First, the compressed feature and latent feature query volume are added as the key and query input to the self-attention layer, which is input as a value parameter separately. In the self-attention layer, the relationship between features is compressed by global self-attention modeling, and mutual information is extracted for feature renew. The compressed features are added after the update and before the update and normalized to get the intermediate representation. Afterwards, the intermediate representation and the latent feature query amount as the position encoding are added again as the query amount and input to the cross-attention module. The cross-attention layer extracts boundary features and clusters context features from the sorted video features, calculates the dot product activation of the input video sequence through the compressed feature and latent feature query volume to extract useful video features, updates the compressed features, and achieves feature extraction and integration. The effect of reducing complexity. Finally, the results of the cross-attention layer go through the residual structure, the projection of the linear mapping layer and another residual structure to obtain the cumulative updated compressed features.

对编码后特征进行解码，得到最终结果表示的步骤，即前述步骤2.4)，如图4所示。输入的提名查询量和边界提名输入多头自注意力层进行提名表示的强化，而后输入交叉注意力层。学好的隐特征查询量作为压缩特征的位置编码，参与交叉注意力计算。交叉注意力层对压缩特征提取边界位置信息，通过压缩特征和边界提名相乘的交叉注意力矩阵找到压缩特征每一个时间维度位置的权重，对应提取相应特征。边界提名在每层累加提取的边界表示，最后得到解码结果。在自注意力层和交叉注意力层之间有一次相加归一化的残差变换，在交叉注意力层后有第二次残差变换、一个线性投影层投影和第三次残差变换。The step of decoding the encoded feature to obtain the final result representation is the aforementioned step 2.4), as shown in Figure 4. The input nomination query volume and boundary nominations are fed into the multi-head self-attention layer for reinforcement of the nomination representation, and then fed into the cross-attention layer. The learned latent feature query amount is used as the positional encoding of the compressed feature and participates in the calculation of cross attention. The cross-attention layer extracts the boundary position information for the compressed features, finds the weight of each time dimension position of the compressed features through the cross-attention matrix multiplied by the compressed features and the boundary nomination, and extracts the corresponding features accordingly. The boundary nomination accumulates the extracted boundary representations at each layer, and finally obtains the decoding result. There is an additive normalized residual transform between the self-attention layer and the cross-attention layer, a second residual transform after the cross-attention layer, a linear projection layer projection and a third residual transform .

时序边界提名的解码和提交，如图5所示。将边界提名表示分别输入由全连接层表示的定位和分类分支中得到位置和置信度打分。定位分支包括三层全连接层(fc)和一个sigmoid激活函数，最终得到位置。分类分支为二分类，包括一层fc层，得到当前预测位置对应的置信度分数。对置信度分数进行筛选，如果高于阈值γ＝0.9，那么该预测被提交为最终预测。The decoding and submission of timing boundary nominations is shown in Figure 5. The boundary nomination representations are fed into the localization and classification branches represented by fully connected layers to obtain location and confidence scores, respectively. The localization branch consists of three fully connected layers (fc) and a sigmoid activation function to finally get the position. The classification branch is binary classification, including a layer of fc layer, and obtains the confidence score corresponding to the current predicted position. Confidence scores are screened and if it is above the threshold γ=0.9, then the prediction is submitted as the final prediction.

3)训练阶段，本实例使用交叉熵、L1距离和负对数函数作为损失函数，使用AdamW优化器，设置批大小为64，即每次训练在训练集中取64个窗口样本训练，总的训练轮数设置为100，初始学习率为2e-4，学习率没有衰减策略，模型在一块NVIDIA RTX 2080ti GPU上训练。从原始视频到时序边界结果的过程分为两个阶段，在第一阶段基于预训练参数在数据集上进行微调骨干网络，得到视频特征和连续性打分；在第二阶段训练TemporalPerceiver并进行测试。在分配正负样本阶段，模型采用了严格一一对应匹配策略，减少假正例的出现，使模型能够稀疏、高效的预测无类别时序边界。3) In the training phase, this example uses cross-entropy, L1 distance and negative logarithmic function as the loss function, uses the AdamW optimizer, and sets the batch size to 64, that is, each training takes 64 window samples in the training set for training, and the total training The number of epochs is set to 100, the initial learning rate is 2e-4, the learning rate has no decay strategy, and the model is trained on an NVIDIA RTX 2080ti GPU. The process from the original video to the time series boundary results is divided into two stages. In the first stage, the backbone network is fine-tuned on the dataset based on the pre-training parameters, and the video features and continuity scores are obtained; in the second stage, the TemporalPerceiver is trained and tested. In the stage of assigning positive and negative samples, the model adopts a strict one-to-one correspondence matching strategy to reduce the occurrence of false positives, so that the model can sparsely and efficiently predict class-free time series boundaries.

4)测试阶段4) Test phase

测试集的预处理，同训练数据一样，在抽帧后进行压缩变换为224*224的尺寸，使用基于ResNet50的骨干网络进行RGB特征提取。所使用的测试指标在不同任务的数据集上不同：子动作无类别时序边界和事件无类别时序边界不同相对距离(Rel.Dis.)下的f1分数作为指标，场景级别的无类别时序边界采用AP和M_iou作为指标。F1分数的计算依据是召回率(recall)和精准率(precision)，

其中召回率指预测正确的样本数占总真值数的比例，精准率为预测正确的样本数占所有预测为真值的样本数的比例，预测误差在相对距离内的样本即可算为预测正确。AP(Average Precision)的计算依据是召回值从0到1对应的平均精度值。M_iou为预测边界和真值边界的距离和真值场景长度的交并比加权和。AP和M_iou指标均要求预测结果必须和真值位置一致才能算作正确预测。The preprocessing of the test set is the same as the training data, which is compressed and transformed into a size of 224*224 after the frame is extracted, and the RGB feature extraction is performed using the backbone network based on ResNet50. The test metrics used are different on the datasets of different tasks: the f1 scores under different relative distances (Rel.Dis.) are used as the sub-action unclassified temporal boundary and the event unclassified temporal boundary as the metric, and the scene-level unclassified time sequence boundary adopts AP and M _iou as indicators. The F1 score is calculated based on recall and precision.

The recall rate refers to the ratio of the number of correctly predicted samples to the total number of true values, the precision is the ratio of the number of correctly predicted samples to the total number of samples predicted to be true values, and the samples whose prediction error is within the relative distance can be regarded as the correct prediction . AP (Average Precision) is calculated based on the average precision value corresponding to the recall value from 0 to 1. M _iou is the weighted sum of the intersection ratio of the distance between the prediction boundary and the ground truth boundary and the length of the ground truth scene. Both AP and M _iou indicators require that the prediction result must be consistent with the true value position to be counted as a correct prediction.

在随机选取的来自MovieNet数据集的三个视频上，如图6所示，TemporalPerceiver相比经典工作LGSS拥有7倍快的每秒场景推理速度和近200倍小的浮点数计算次数，体现出模型稀疏预测、无需后处理模块的优势；和Transformer变体相比，TemporalPerceiver也拥有着更快的每秒场景推理速度和更少的浮点数运算次数，体现出特征压缩在模型轻量化上的作用，更加证明了Temporal Perceiver的高效。在预测精度上，TemporalPerceiver同之前工作相比较，在全部数据集的所有指标上获得巨大提升，体现了模型的通用和泛化性：如图7，在MovieScenes数据集上，Temporal Perceiver在AP和M_iou指标上均超过经典工作LGSS近3％；如图8，在Kinetics-GEBD和TAPOS数据集，基于f1@Rel.Dis.的所有指标，Temporal Perceiver的性能均高于前state-of-the-art工作PC，其中在Kinetics-GEBD数据集，基于f1@0.05指标，Temporal Perceiver的性能比前state-of-the-art工作PC高出12.6％；在TAPOS数据集上，基于平均f1分数，Temporal Perceiver的性能比前state-of-the-art工作PC高出9％，展现了Temporal Perceiver的预测更加灵活、准确的特点。在MovieNet数据集上更加具体的预测可视化如图9所示，Temporal Perceiver的预测避免了在真值附近的假正例，准确的预测了每个场景真值。On three randomly selected videos from the MovieNet dataset, as shown in Figure 6, TemporalPerceiver has 7 times faster scene inference speed per second and nearly 200 times smaller floating-point calculation times than the classic work LGSS, reflecting the model The advantages of sparse prediction and no post-processing module; compared with the Transformer variant, TemporalPerceiver also has faster scene inference speed per second and fewer floating-point operations, reflecting the role of feature compression in model lightweighting, It further proves the efficiency of Temporal Perceiver. In terms of prediction accuracy, compared with previous work, TemporalPerceiver has achieved a huge improvement in all indicators of all datasets, reflecting the generality and generalization of the model: as shown in Figure 7, on the MovieScenes dataset, Temporal Perceiver is in AP and M. The _iou indicators surpass the classic work LGSS by nearly 3%; as shown in Figure 8, in the Kinetics-GEBD and TAPOS datasets, based on all indicators of f1@Rel.Dis., the performance of Temporal Perceiver is higher than the previous state-of-the- art work PC, where on the Kinetics-GEBD dataset, based on the f1@0.05 metric, Temporal Perceiver outperforms the former state-of-the-art work PC by 12.6%; on the TAPOS dataset, based on the average f1 score, Temporal Perceiver Perceiver outperformed the former state-of-the-art working PC by 9%, demonstrating the more flexible and accurate predictions of Temporal Perceiver. A more specific prediction visualization on the MovieNet dataset is shown in Figure 9. Temporal Perceiver's prediction avoids false positives near the true value and accurately predicts the true value of each scene.

Claims

1. The time series boundary detection method is characterized by constructing a classless time series boundary detection network to detect the time series boundary of the video. The detection network includes a backbone network and a detection model. The implementation methods are as follows:

1) The detection sample is generated by the backbone network: the video image sequence is obtained by sampling the video interval

and continuity score

2) based on the video feature F and the continuity score S, the detection model is used to perform unclassified sequential action detection, and the detection model includes the following configurations:

2.1) Encoder: The encoder _E includes a series of transform decoding layers of Ne layers, each layer includes a multi-head self-attention layer, a multi-head cross-attention layer and a linear mapping layer, self-attention layer, cross-attention layer and The linear mapping layer has a residual structure respectively, introduces M latent feature query quantities Q _e to the encoder, sorts the video features F in descending order based on the continuity score S, and then inputs them into the encoder, and the encoder compresses the sorted video features. is the compression feature H of M frames, and the initial compression feature H ₀ is 0. In the jth layer of transformation and decoding layer, the latent feature query quantity Q _e is added to the compressed feature H _j of the current layer, and after the self-attention layer and its residual Structure, the cross attention layer interacts with the reordered video features, and then the compressed feature H _j+1 , j∈[0,(N _e -1)] is obtained after the residual structure-linear mapping layer-residual structure transformation , after stacking N _e coding layers, the compression and coding of the input features are realized, and the compressed features are obtained

Among them, the generation of latent feature query volume is: hidden feature query volume Q _e is divided into M _b boundary query volume and M _c context query volume, initialized randomly, and generated with training samples in the process of training the detection model; boundary query The amount corresponds to the boundary area feature in the video feature, and the context query amount corresponds to the context area feature in the video feature. The first M _b features after reordering in the video feature are the boundary area feature, and the others are the context feature;

2.3) Generation and scoring of time-series boundary-free boundaries: For the obtained time-series boundary nominated representation B, it is sent to two different fully-connected layer branches: the positioning branch and the classification branch, which are respectively used to output the time-series boundary-free moments. and confidence scores;

2.4) Assign training labels: adopt a strict one-to-one training label matching strategy: according to the defined matching cost C, use the Hungarian algorithm to obtain a set of optimal one-to-one matches, each of which is assigned to a class-free boundary truth value. The predictions all obtain positive sample labels, and the corresponding boundary truth values are the training targets; the matching cost C consists of two parts, the position cost and the classification cost. The position cost is defined based on the absolute value of the distance between the prediction moment and the boundary truth moment, and the classification cost is based on the prediction. Confidence definition;

2.5) Submission of time-series classless boundaries: After generating a series of time-series class-free boundaries, the most credible time-series classless-boundary moments are screened out through the confidence score threshold γ, and submitted for subsequent performance measurement;

3) Training phase: The configured model is trained with training samples, using cross entropy, L1 distance and log function as loss functions, using AdamW optimizer, and updating network parameters through backpropagation algorithm, repeating steps 1) and Step 2), until the number of iterations is reached;

4) Detection: Input the video feature sequence and continuity score of the data to be tested into the trained detection model, generate time-series unclassified boundary moments and scores, and then use the method of 2.3) to obtain the time-series classless time sequence for performance measurement. Boundary time series.

2. timing boundary detection method according to claim 1, is characterized in that when training detection model, align latent feature query volume and video feature: boundary query volume is aligned by the boundary region feature alignment of the alignment loss function and the video feature, and the alignment loss function Based on the calculation of the cross attention map of the last layer, using the condition that the number of boundary queries and the number of features in the boundary area are consistent, the diagonal attention weights formed by the corresponding relationship between the two are merged to take the negative logarithm, and the value of the alignment loss function is obtained. , by minimizing the alignment loss function, to maximize the diagonal weight, and to ensure that the boundary query volume corresponds to the extraction of boundary region features during cross-attention.

3. according to claim 1 and 2 described time series boundary detection methods, it is characterized in that backbone network is based on ResNet50 and time series convolution layer, and video is sampled and generated, for all picture frames corresponding to each video, take τ frame as Interval sampling forms a video image sequence, and the video image sequence is

It is divided into N _f video segments with a length of 2k frames, N _f is the length of the video image sequence, and is also the number of video segments, and the i-th video segment is an image sequence consisting of consecutive k frames before and after the i-th frame image.

The image sequence L _s,i is sent to the backbone network. After pre-training and fine-tuning parameters of the convolution layer, pooling layer and fully connected layer, the output obtains the RGB feature F _i and the continuity score S _i . and scores are spliced together in chronological order to obtain the entire video features.

and continuity score

Among them, the sampling interval τ represents the fine-grained degree of global time division, and the size of the video segment length 2k represents the local receptive field range of the feature.

4. timing boundary detection method according to claim 3, it is characterized in that first utilize denseflow library to carry out frame extraction process to all video and obtain picture frame, take τ frame as interval sampling, τ takes 3, and carry out following processing: call torchvision The transforms package of the library scales each picture frame to a size of 224*224, converts it into a tensor form, and finally normalizes it to obtain a video image sequence L _f ; traverse the image sequence of the entire video, take the k frames before and after each frame, and k takes 5. Obtain N _f video segments, and then use the stacked spatial convolution layer, maximum pooling layer and time series convolution layer to extract the features of each video segment; the features are then scored continuously through the fully connected layer, and the features and The scores are spliced in chronological order, respectively, to obtain the video-level feature F and the continuity score S.

5. the time sequence boundary detection method according to claim 1 and 2, it is characterized in that in the configuration of step 2), the convolution layer of backbone network is made up of convolution operation, batch normalization operation and ReLU activation function, encoder For the transform decoder structure used as the encoder, the decoder is also a transform decoder structure.

6. The time sequence boundary detection method according to claim 1 and 2 is characterized in that the encoder layer number N _e takes 6, and in the jth layer, the compressed feature H _j is used as a value parameter, the hidden feature query quantity Q _e and the compressed feature After H _j is added, it is input into the self-attention module of 8 branches as the key and query parameter. The attention matrix is formed by the self-multiplication of the key and the query parameter and Softmax normalization, and the output is obtained by multiplying it with the value parameter. In the residual structure H j ′ is added with H _j and normalized to obtain H _j ′; H _j ′ and the latent feature query quantity Q _e are added as query parameters, and the video feature F is used as a value parameter after descending reordering based on the continuity score S, It is added to the same reordered position code as the key parameter and input to the cross-attention module. In the module, the key and query parameters are cross-multiplied and Softmax normalized to form an attention matrix and multiplied with the value parameter. The residual structure and H′ _j are added to normalize, linearly map and re-residual structure to obtain the updated compressed feature H _j+1 .

7. The method for detecting time sequence boundary according to claim 1 or 2, characterized in that the decoder layer number N _d takes 6, and at the jth layer, the boundary nominates B _j as a value parameter, and the nominated query quantity Q _d and the boundary nominate B After _j is added, it is input into the self-attention module of 8 branches as the key and query parameter, and the attention matrix is formed by the self-multiplication and Softmax normalization of the key and the query parameter, and the output is obtained by multiplying it with the value parameter. In the residual structure Then add with B _j and normalize to get B _j ′; B _j ′ and the nominated query quantity Q _d are added as query parameters, the compressed feature H obtained in 2.1) is used as the value parameter, and the hidden value of the compressed position encoding is used as the query parameter. The feature query quantity Q _e is added as a key parameter and input to the cross-attention module. In the module, the key and query parameters are cross-multiplied and Softmax normalized to form an attention matrix and multiplied with the value parameter. After the residual structure and B _j ′ is added and normalized to obtain the updated boundary nomination B _j+1 .

8. A timing sensor, characterized in that it has a computer storage medium, and a computer program is configured in the computer storage medium, and the computer program is used to implement the classless timing boundary detection network of claims 1-7, and the computer When the program is executed, the timing boundary detection method described in claims 1-7 is implemented.