CN112488063B

CN112488063B - A Video Sentence Localization Method Based on Multi-stage Aggregate Transformer Model

Info

Publication number: CN112488063B
Application number: CN202011508292.1A
Authority: CN
Inventors: 杨阳; 张明星
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2022-06-14
Anticipated expiration: 2040-12-18
Also published as: CN112488063A

Abstract

The invention discloses a video sentence positioning method based on a multi-stage aggregation Transformer model. In the video sentence Transformer model, each video slice or word can be adaptively aggregated and aligned according to its own semantics. Information about other video slices or words. Through multi-layer stacking, the finally obtained joint representation of video sentences has rich visual-linguistic cue capture ability and can achieve more fine-grained matching. In the multi-stage aggregation module, the stage features of the initial stage, the stage features of the intermediate stage and the stage features of the end stage are concatenated to form the feature representation of the candidate segment. Since the obtained feature representation captures specific information at different stages, it is very suitable for accurately locating the start and end positions of video clips. These two modules are integrated together to form an effective and efficient network to improve the accuracy of video sentence localization.

Description

A Video Sentence Localization Method Based on Multi-stage Aggregate Transformer Model

技术领域technical field

本发明属于视频语句定位检索技术领域，更为具体地讲，涉及一种基于多阶段聚合Transformer模型的视频语句定位方法。The invention belongs to the technical field of video sentence positioning and retrieval, and more particularly, relates to a video sentence positioning method based on a multi-stage aggregation Transformer model.

背景技术Background technique

视频定位是计算机视觉系统中一个基本的问题，具有广泛的应用。在过去的十年里，人们对视频动作定位进行了大量的研究和相关的应用系统开发。近年来，随着多媒体数据的兴起和人们需求的多样化，视频中语句的定位问题(视频语句定位)逐渐变得重要起来。视频语句定位的目的是在一段很长的视频中定位要查询的语句相对应的某个视频片段。与视频动作定位相比，语句定位具有更大的挑战性和更广阔的应用前景，比如视频检索、视频字幕自动生成、人机智能交互等。Video localization is a fundamental problem in computer vision systems with a wide range of applications. In the past decade, a lot of research and related application system development have been carried out on video action localization. In recent years, with the rise of multimedia data and the diversification of people's needs, the problem of locating sentences in videos (video sentence localization) has gradually become important. The purpose of video sentence localization is to locate a certain video segment corresponding to the sentence to be queried in a long video. Compared with video action localization, sentence localization has more challenges and broader application prospects, such as video retrieval, automatic generation of video subtitles, human-computer intelligent interaction, etc.

视频语句定位是一项具有挑战性的任务。除了需要理解视频内容外，还需要将视频和语句之间的语义进行匹配。Video sentence localization is a challenging task. In addition to understanding the video content, it is also necessary to match the semantics between the video and the sentence.

现有视频语句定位一般可分为两类：一阶段法和两阶段法。一阶段方法以视频和查询语句作为输入，直接预测被查询视频片段起始点和终止点，直接生成与查询语句相关联的视频片段。一阶段法可以进行端到端训练，但它们很容易失去一些正确的视频片段。然而，两阶段法遵循候选片段生成和候选片段排名两个过程。它们通常先从视频中生成一系列的候选片段，然后根据候选片段与查询语句的匹配程度对候选片段进行排序。许多方法都遵循这条路线。两阶段法虽然能够召回很多可能正确的候选视频片段，但是也存在几个关键的问题没有得到很好的解决：Existing video sentence localization can generally be divided into two categories: one-stage method and two-stage method. The one-stage method takes the video and the query sentence as input, directly predicts the start point and end point of the queried video clip, and directly generates the video clip associated with the query sentence. One-stage methods can be trained end-to-end, but they can easily lose some correct video clips. However, the two-stage approach follows two processes of candidate segment generation and candidate segment ranking. They usually first generate a series of candidate segments from the video, and then rank the candidate segments according to how well they match the query. Many methods follow this route. Although the two-stage method can recall many possible correct candidate video clips, there are still several key problems that have not been well solved:

1)、如何有效地对视频和语句之间进行细粒度的语义匹配？1) How to effectively perform fine-grained semantic matching between videos and sentences?

2)、如何在原始长视频中准确定位与语句匹配的视频片段起始点和终止点？2) How to accurately locate the starting point and ending point of the video clip matching the sentence in the original long video?

对于第1个问题，现有的大多数方法通常是分别处理视频和语句序列，然后将它们进行匹配。但是，分别单独处理视频和语句序列，比如首先将语句编码成一个向量然后进行匹配，将不可避免地丢失语句中一些详细的语义内容，从而无法实现细致的匹配；For the first problem, most of the existing methods usually process the video and sentence sequences separately, and then match them. However, separately processing the video and sentence sequence, such as encoding the sentence into a vector first and then matching, will inevitably lose some detailed semantic content in the sentence, so that detailed matching cannot be achieved;

对于第2个问题，现有的方法通常使用全卷积、平均池化或RoI Pooling操作来获得候选片段的特征表示。然而这些操作所获得的特征，它们的时序区分性不够强。例如，某个视频片段中通常包含一些不同的阶段，如开始阶段、中间阶段和结束阶段。这些阶段的信息对于时刻起始点和终止点的精确定位是非常重要的。然而，平均池化操作完全丢弃了阶段信息，无法对不同阶段进行精确匹配来实现精确的定位。尽管全卷积操作或RoI Pooling操作可以在一定程度上刻画不同的阶段，但它们不依赖于显式的特定阶段的特征，因此在更精确的定位方面也存在不足。For the second problem, existing methods usually use full convolution, average pooling or RoI Pooling operations to obtain feature representations of candidate segments. However, the features obtained by these operations are not temporally discriminative enough. For example, a video clip usually contains several different phases, such as the beginning, middle, and end. The information of these stages is very important for the precise location of the start and end points of the time. However, the average pooling operation completely discards the stage information and cannot precisely match different stages to achieve precise localization. Although fully convolutional operations or RoI Pooling operations can characterize different stages to a certain extent, they do not rely on explicit stage-specific features, so they are also lacking in more precise localization.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于克服现有技术的不足，提供一种基于多阶段聚合Transformer模型的视频语句定位方法，以提高视频语句定位的准确度。The purpose of the present invention is to overcome the deficiencies of the prior art, and to provide a video sentence localization method based on a multi-stage aggregation Transformer model, so as to improve the accuracy of video sentence localization.

为实现上述发明目的，本发明基于多阶段聚合Transformer模型的视频语句定位方法，其特征在于，包括以下步骤：In order to realize the above-mentioned purpose of the invention, the present invention is based on the video sentence localization method of multi-stage aggregation Transformer model, it is characterized in that, comprises the following steps:

(1)、视频切片特征、单词特征提取(1), video slice feature, word feature extraction

将视频按照时间均匀地划分为N个时刻点，在每一个时刻点，采集一个视频切片(由连续的多帧，如50帧的图像组成)，对每个视频切片进行特征提取，得到切片特征(得到总共N个切片特征)，N个切片特征按照时刻顺序放置，形成视频特征序列；The video is evenly divided into N time points according to time. At each time point, a video slice (composed of consecutive multiple frames, such as 50 frames of images) is collected, and feature extraction is performed on each video slice to obtain slice features. (to obtain a total of N slice features), the N slice features are placed in the order of time to form a video feature sequence;

对语句的每个单词进行词转向量(Doc2Vec)，得到单词特征，然后按照在语句中的顺序放置，形成语句特征序列；Perform word turn vector (Doc2Vec) on each word of the sentence to obtain word features, and then place them in the order in the sentence to form a sentence feature sequence;

将视频特征序列中的切片特征、语句特征序列的单词特征映射到相同的维度，得到视频特征序列

语句特征序列

其中，

表示视频第i个切片的切片特征，

表示语句第j个单词的单词特征；Map the slice feature in the video feature sequence and the word feature of the sentence feature sequence to the same dimension to obtain the video feature sequence

sentence feature sequence

in,

represents the slice feature of the ith slice of the video,

Represents the word feature of the jth word of the sentence;

(2)、构建视频语句Transformer模型，并对视频特征序列、语句特征序列进行计算(2), build a video sentence Transformer model, and calculate the video feature sequence and sentence feature sequence

构建一个D层视频语句Transformer模型，其中，第d层，d＝1,2,...,D的输出为：Build a D-layer video sentence Transformer model, where the output of the d-th layer, d=1,2,...,D is:

其中，V、L分别表示视频和语句，Q、K、W为可学习参数，其中，不同下标表示不同参数，Att(·)为注意力计算函数；Among them, V and L represent videos and sentences, respectively, Q, K, and W are learnable parameters, where different subscripts represent different parameters, and Att( ) is the attention calculation function;

视频特征序列

语句特征序列

作为视频语句Transformer模型的输入进行计算，得到第D层输出视频特征序列

语句特征序列

video feature sequence

sentence feature sequence

Calculated as the input of the video sentence Transformer model to obtain the output video feature sequence of the D-th layer

sentence feature sequence

(3)、构建多阶段聚合模块，计算三个阶段的阶段特征序列和预测分数序列(3), build a multi-stage aggregation module, calculate the stage feature sequence and prediction score sequence of the three stages

计算开始阶段、中间阶段和结束阶段的阶段特征序列r^sta、r^mid、r^end：Calculate the sequence of stage features r ^sta , r ^mid , r ^end for the start, middle and end stages:

其中，开始阶段特征序列r^sta由N个切片的阶段特征

组成，中间阶段特征序列r^mid由N个切片的阶段特征

组成，结束阶段特征序列r^end由N个切片的阶段特征

组成，

分别为计算三个阶段的阶段特征序列的多层感知器(MLP，Multi-layer Perceptron)；Among them, the initial stage feature sequence r ^sta consists of stage features of N slices

The intermediate stage feature sequence r ^mid consists of N slices of stage features

composed, the end stage feature sequence r ^end consists of N slices of stage features

composition,

Multi-layer Perceptron (MLP, Multi-layer Perceptron) that calculates the three-stage stage feature sequence;

计算开始阶段、中间阶段和结束阶段的预测分数序列p^sta、p^mid、p^end：Compute the sequence of predicted scores p ^sta , p ^mid , p ^end for the start, middle, and end stages:

其中，开始阶段预测分数序列p^sta由N个切片的预测分数

组成，中间阶段预测分数序列p^mid由N个切片的预测分数

组成，结束阶段预测分数序列p^end由N个切片的预测分数

组成，

分别为计算三个阶段的预测分数序列的多层感知器；Among them, the initial stage prediction score sequence p ^sta consists of the prediction scores of N slices

composed, the intermediate stage prediction score sequence p ^mid consists of the prediction scores of N slices

composed, the end stage prediction score sequence p ^end consists of the prediction scores of N slices

composition,

are the multi-layer perceptrons that calculate the sequence of prediction scores in the three stages;

(4)、训练多阶段聚合Transformer模型(4), train a multi-stage aggregate Transformer model

视频语句Transformer模型与多阶段聚合模块构成多阶段聚合Transformer模型；The video sentence Transformer model and the multi-stage aggregation module constitute a multi-stage aggregation Transformer model;

构建视频语句训练数据集，其中每条数据包括一个视频、一个语句，以及语句定位的视频片段的视频切片开始位置

结束位置

Build a video sentence training dataset, where each piece of data includes a video, a sentence, and the start position of the video slice of the video clip where the sentence is located

end position

从视频语句训练数据集提出一条数据，将语句中随机屏蔽一个单词，并用标记“MASK”取代，然后将视频、语句按照步骤(1)～(3)进行处理，再计算每个视频切片开始阶段、中间阶段、结束阶段的真实分数

Propose a piece of data from the video sentence training data set, randomly mask a word in the sentence, and replace it with the mark "MASK", then process the video and sentence according to steps (1) to (3), and then calculate the start stage of each video slice , the real score of the middle stage, the end stage

其中，

in,

σ_sta、σ_midσ_end为未归一化的二维高斯分布的标准差，σ_sta、α_mid、α_end为正值的标量，用于控制标准差的值；σ _sta , σ _mid σ _end are the standard deviation of the unnormalized two-dimensional Gaussian distribution, σ _sta , α _mid , α _end are positive scalars used to control the value of the standard deviation;

4.1)、计算预测层上的加权交叉熵损失L_stage：4.1), calculate the weighted cross entropy loss L _stage on the prediction layer:

4.2)、计算第z个候选片段的视频切片开始位置、结束位置的预测值

以及匹配分数预测值

4.2) Calculate the predicted value of the start position and end position of the video slice of the zth candidate segment

and match score predictions

4.3)、计算边界回归损失L_regress：4.3), calculate the boundary regression loss L _regress :

其中，Z为候选片段的总数，

分别为第z个候选片段的视频切片开始位置、结束位置；where Z is the total number of candidate fragments,

are the start position and end position of the video slice of the zth candidate segment, respectively;

4.4)、计算匹配分数加权交叉熵损失L_match：4.4), calculate the matching score weighted cross entropy loss L _match :

其中，y_z为是第z个候选片段与语句定位的视频片段(开始位置

到结束位置

的视频）的重合度；Among them, y _z is the video segment where the z-th candidate segment and the sentence are located (starting position

to the end position

video) coincidence;

4.5)、计算屏蔽单词预测的交叉熵损失L_word 4.5), calculate the cross-entropy loss L _word of masked word prediction

L_word＝-logp^mask L _word = -logp ^mask

其中，p^mask是根据语句特征序列

预测为屏蔽的单词的概率；Among them, p ^mask is based on the sentence feature sequence

the probability of a word predicted to be masked;

4.6)、计算训练多阶段聚合Transformer模型的整个网络的损失L_total 4.6), calculate the loss L _total of the entire network for training the multi-stage aggregated Transformer model

L_total＝L_stage+L_regress+L_match+L_word L _total =L _stage +L _regress +L _match +L _word

4.7)、更新整个网络的参数4.7), update the parameters of the entire network

依次从视频语句训练数据集取出一条数据，依据损失L_total对整个网络的参数进行更新，直到视频语句训练数据集的数据为空，这样得到训练好的多阶段聚合Transformer模型；Take a piece of data from the video sentence training data set in turn, and update the parameters of the entire network according to the loss L _total until the data in the video sentence training data set is empty, thus obtaining a trained multi-stage aggregate Transformer model;

(5)、视频语句定位(5), video sentence positioning

输入视频和不含掩蔽单词的完整查询语句，按照步骤(1)～(3)进行处理，再按照步骤4.2)计算出每个候选片段的匹配分数预测值以及视频切片开始位置、结束位置的预测值(构成新的候选片段)，然后根据匹配分数从高到低对新的候选片段进行排序，再使用NMS(非极大值抑制)去除重叠超过70％的新的候选片段，并返回前1或前5个新的候选片段作为最终定位出的视频片段。Input the video and the complete query sentence without masked words, process according to steps (1) to (3), and then calculate the matching score prediction value of each candidate segment and the prediction of the start position and end position of the video slice according to step 4.2). value (forming a new candidate fragment), then sort the new candidate fragments according to the matching score from high to low, and then use NMS (Non-Maximum Suppression) to remove the new candidate fragments that overlap more than 70%, and return to the top 1 Or the first 5 new candidate segments are used as the final positioned video segments.

本发明的目的是这样实现的。The object of the present invention is achieved in this way.

针对上述现有方法存在的问题，本发明构建了一个多阶段聚合Transformer模型用于视频语句定位网络。多阶段聚合Transformer模型由两部分组成：视频语句Transformer模型和位于视频语句Transformer模型之上的多阶段聚合模块。在视频语句Transformer模型中，保留了一个单一的BERT架构，但将BERT参数解耦到不同的分组中，以分别处理视频和语句信息。视频语句Transformer模型在保持单BERT结构的紧凑性和效率的同时，更有效地对视频和语句两种模态进行建模。在视频语句Transformer模型中，每个视频切片或者单词都可以根据自身的语义自适应地聚合和对齐来自两种模态中所有其他视频切片或者单词的信息。通过多层叠加，最后所得到的视频语句联合表示具有丰富的视觉语言线索捕捉能力，能够实现更精细的匹配。此外。在多阶段聚合模块中，分别为每个视频切片计算三个对应于不同阶段的阶段特征，即开始阶段、中间阶段和结束阶段的阶段特征。然后对某一候选片段，将其开始阶段的阶段特征、中间阶段的阶段特征和结束阶段的阶段特征串联起来，构成该候选片段的特征表示。由于所获得的特征表示捕捉了不同阶段的特定信息，因此非常适合准确定位视频片段的起始位置和终止位置。这两个模块整合在一起，形成一个既有效又高效的网络，提高视频语句定位的准确度。In view of the problems existing in the above-mentioned existing methods, the present invention constructs a multi-stage aggregated Transformer model for the video sentence localization network. The multi-stage aggregation Transformer model consists of two parts: the video sentence Transformer model and the multi-stage aggregation module on top of the video sentence Transformer model. In the video sentence Transformer model, a single BERT architecture is kept, but the BERT parameters are decoupled into different groupings to process video and sentence information separately. The Video Sentence Transformer model models both video and sentence modalities more efficiently while maintaining the compactness and efficiency of a single BERT structure. In the Video Sentence Transformer model, each video slice or word can adaptively aggregate and align information from all other video slices or words in both modalities according to its own semantics. Through multi-layer stacking, the finally obtained joint representation of video sentences has rich visual-linguistic cue capture ability and can achieve more fine-grained matching. also. In the multi-stage aggregation module, three stage features corresponding to different stages are calculated for each video slice, namely the stage features of the start stage, the middle stage and the end stage. Then, for a candidate segment, the phase features of the beginning stage, the stage features of the intermediate stage and the stage features of the end stage are concatenated to form the feature representation of the candidate segment. Since the obtained feature representation captures specific information at different stages, it is very suitable for accurately locating the start and end positions of video clips. These two modules are integrated together to form an effective and efficient network to improve the accuracy of video sentence localization.

附图说明Description of drawings

图1是本发明基于多阶段聚合Transformer模型的视频语句定位方法一种具体实施方式流程图；Fig. 1 is a kind of specific implementation flow chart of the video sentence location method based on multi-stage aggregation Transformer model of the present invention;

图2是视频切片示意图；Fig. 2 is a video slice schematic diagram;

图3是本发明基于多阶段聚合Transformer模型的视频语句定位方法一种具体实施方式的原理示意图；Fig. 3 is the principle schematic diagram of a specific embodiment of the video sentence localization method based on the multi-stage aggregation Transformer model of the present invention;

图4是视频语句定位一种具体实施方式流程图。FIG. 4 is a flow chart of a specific implementation manner of video sentence location.

具体实施方式Detailed ways

下面结合附图对本发明的具体实施方式进行描述，以便本领域的技术人员更好地理解本发明。需要特别提醒注意的是，在以下的描述中，当已知功能和设计的详细描述也许会淡化本发明的主要内容时，这些描述在这里将被忽略。The specific embodiments of the present invention are described below with reference to the accompanying drawings, so that those skilled in the art can better understand the present invention. It should be noted that, in the following description, when the detailed description of known functions and designs may dilute the main content of the present invention, these descriptions will be omitted here.

图1是本发明基于多阶段聚合Transformer模型的视频语句定位方法一种具体实施方式流程图。FIG. 1 is a flow chart of a specific implementation of the video sentence localization method based on the multi-stage aggregation Transformer model of the present invention.

在本实施例中，如图1所示，基于多阶段聚合Transformer模型的视频语句定位方法包括以下步骤：In the present embodiment, as shown in FIG. 1 , the video sentence localization method based on the multi-stage aggregate Transformer model includes the following steps:

步骤S1：视频切片特征、单词特征提取Step S1: Video slice feature and word feature extraction

在本实施例中，如图2所示，将视频按照时间均匀地划分为N个时刻点，在每一个时刻点，采集一个视频切片(由连续的多帧，如50帧的图像组成)，对每个视频切片进行特征提取，得到切片特征(得到总共N个切片特征)，N个切片特征按照时刻顺序放置，形成视频特征序列。In this embodiment, as shown in FIG. 2, the video is evenly divided into N time points according to time, and at each time point, a video slice (composed of consecutive multiple frames, such as 50 frames of images) is collected, Feature extraction is performed on each video slice to obtain slice features (a total of N slice features are obtained), and the N slice features are placed in the order of time to form a video feature sequence.

对语句的每个单词进行词转向量(Doc2Vec)，得到单词特征，然后按照在语句中的顺序放置，形成语句特征序列。The word turn vector (Doc2Vec) is performed on each word of the sentence to obtain the word features, and then placed in the order in the sentence to form a sentence feature sequence.

语句特征序列

其中，

表示视频第i个切片的切片特征，

表示语句第j个单词的单词特征。Map the slice feature in the video feature sequence and the word feature of the sentence feature sequence to the same dimension to obtain the video feature sequence

sentence feature sequence

in,

represents the slice feature of the ith slice of the video,

The word feature representing the jth word of the sentence.

步骤S2：构建视频语句Transformer模型，并对视频特征序列、语句特征序列进行计算Step S2: Build a video sentence Transformer model, and calculate the video feature sequence and sentence feature sequence

现有的大多数方法通常是分别处理视频和语句序列，然后将它们进行匹配。但是，单独处理这两种序列，比如首先将语句编码成一个向量然后进行匹配，将不可避免地丢失语句中一些详细的语义内容，从而无法实现细致的匹配。为了解决这一问题，如图3所示，本发明构建了一个全新的视频语句Transformer模型作为主干网，其中，第d层，d＝1,2,...,D的输出为：Most existing methods usually process video and sentence sequences separately and then match them. However, processing these two sequences separately, such as first encoding the sentence into a vector and then matching, will inevitably lose some of the detailed semantic content in the sentence, making it impossible to achieve fine-grained matching. In order to solve this problem, as shown in Figure 3, the present invention constructs a brand-new video sentence Transformer model as the backbone network, wherein the output of the d-th layer, d=1,2,...,D is:

视频特征序列

语句特征序列

作为视频语句Transformer模型的输入进行计算，按照公式(1)逐层计算，得到第D层输出视频特征序列

语句特征序列

video feature sequence

sentence feature sequence

Calculated as the input of the video sentence Transformer model, calculated layer by layer according to formula (1), to obtain the output video feature sequence of the D layer

sentence feature sequence

与之前广泛使用的单BERT模型相比，本发明中的多阶段聚合Transformer模型其架构没有改变，没有引入任何额外的计算，只是使用不同的参数来处理不同的模态内容，这样既保持了模型的紧凑性和效率，又提高了模型的多模态建模能力。同时，本发明中的多阶段聚合Transformer模型，也不同于其他多模态BERT模型，后者使用两个BERT流来实现不同模态的内容。这种基于两个BERT流的模型引入了额外的跨模态层来实现多模态交互，而本发明中的多阶段聚合Transformer模型保持了与原始BERT模型相同的结构，更加紧凑和高效。Compared with the single BERT model widely used before, the architecture of the multi-stage aggregated Transformer model in the present invention has not changed, and no additional calculation has been introduced, but different parameters are used to process different modal contents, which not only maintains the model The compactness and efficiency of the model also improve the multi-modal modeling ability of the model. At the same time, the multi-stage aggregated Transformer model in the present invention is also different from other multi-modal BERT models, which use two BERT streams to realize the content of different modalities. This model based on two BERT streams introduces an additional cross-modal layer to realize multi-modal interaction, while the multi-stage aggregated Transformer model in the present invention maintains the same structure as the original BERT model, which is more compact and efficient.

多阶段聚合Transformer模型由多层如公式(1)的计算过程组成。经过多层叠加后，得到的视频语句联合表示具有丰富的视觉语言线索的聚合和对齐能力。视频中的每一切片都可以与查询语句中的每个单词进行交互，从而实现更详细、更准确的视频语句匹配。这对于精确定位非常重要。The multi-stage aggregate Transformer model consists of multiple layers of computational processes such as Equation (1). After multi-layer stacking, the resulting joint representation of video sentences has the ability to aggregate and align rich visual-linguistic cues. Each slice in the video can interact with each word in the query, enabling more detailed and accurate video sentence matching. This is very important for precise positioning.

步骤S3：构建多阶段聚合模块，计算三个阶段的阶段特征序列和预测分数序列Step S3: Build a multi-stage aggregation module to calculate the stage feature sequence and prediction score sequence of the three stages

在经过视频语句Transformer模型之后，所得到的视频语句联合表示即视频特征序列

语句特征序列

具有更丰富的信息和更精细的匹配。然而，为了克服现有方法忽略了视频片段中所包含的不同阶段的问题，本发明在视频语句Transformer模型上提出了一个多阶段聚合模块，以其达到能精确地定位所查询的视频片段的起始位置和结束位置。在多阶段聚合模块中，本发明分别为视频序列中的每一视频切片计算对应于不同时间阶段的三个阶段的阶段特征，即开始阶段、中间阶段和结束阶段的阶段特征

为了提高对不同阶段特征的区分能力，本发明在这些阶段特征上添加了一个预测层，分别对开始阶段、中间阶段和结束阶段的分数进行预测。After passing through the video sentence Transformer model, the obtained joint representation of the video sentence is the video feature sequence

sentence feature sequence

With richer information and finer matching. However, in order to overcome the problem that the existing method ignores the different stages contained in the video clip, the present invention proposes a multi-stage aggregation module on the video sentence Transformer model, so as to achieve the goal of accurately locating the queried video clip. start and end positions. In the multi-stage aggregation module, the present invention calculates the stage features of the three stages corresponding to different time stages, namely the stage features of the start stage, the middle stage and the end stage, for each video slice in the video sequence.

In order to improve the ability to distinguish different stage features, the present invention adds a prediction layer to these stage features, and predicts the scores of the starting stage, the middle stage and the ending stage respectively.

在本实施例中，如图3所示，多阶段聚合模块用于计算开始阶段、中间阶段和结束阶段的阶段特征序列r^sta、r^mid、r^end：In this embodiment, as shown in FIG. 3 , the multi-stage aggregation module is used to calculate the stage feature sequences r ^sta , r ^mid , and r ^end of the start stage, the middle stage and the end stage:

其中，开始阶段特征序列r^sta由N个切片的阶段特征

组成，中间阶段特征序列r^mid由N个切片的阶段特征

组成，结束阶段特征序列r^end由N个切片的阶段特征

组成，

分别为计算三个阶段的阶段特征序列的多层感知器(MLP，Multi-layer Perceptron)。Among them, the initial stage feature sequence r ^sta consists of stage features of N slices

composition,

They are the Multi-layer Perceptron (MLP, Multi-layer Perceptron) that calculates the three-stage stage feature sequence, respectively.

其中，开始阶段预测分数序列p^sta由N个切片的预测分数

组成，中间阶段预测分数序列p^mid由N个切片的预测分数

组成，结束阶段预测分数序列p^end由N个切片的预测分数

组成，

分别为计算三个阶段的预测分数序列的多层感知器。Among them, the initial stage prediction score sequence p ^sta consists of the prediction scores of N slices

composition,

are the multi-layer perceptrons that compute the sequence of prediction scores for the three stages, respectively.

步骤S4：训练多阶段聚合Transformer模型Step S4: Train a multi-stage aggregated Transformer model

结束位置

end position

其中，

σ_sta、σ_mid、σ_end为未归一化的二维高斯分布的标准差，α_sta、α_mid、α_end为正值的标量，用于控制标准差的值，α_sta、α_mid、α_end的值越大，在视频片段的起始/中间/结束位置附近视频切片的起始/中间/结束位置的得分越高。in,

σ _sta , σ _mid , σ _end are the standard deviations of the unnormalized two-dimensional Gaussian distribution, α _sta , α _mid , α _end are positive scalars, used to control the value of the standard deviation, α _sta , α _mid , The larger the value of α _end , the higher the score for the start/middle/end position of the video slice near the start/middle/end position of the video segment.

步骤S4.1：计算预测层上的加权交叉熵损失L_stage：Step S4.1: Calculate the weighted cross-entropy loss L _stage on the prediction layer:

步骤S4.2：计算第z个候选片段的视频切片开始位置、结束位置的预测值

以及匹配分数预测值

Step S4.2: Calculate the predicted value of the start position and end position of the video slice of the zth candidate segment

and match score predictions

因为三个阶段都是特定于起始、中间和结束阶段的，所以这个串联的特征对于精确的视频片段定位是非常有区分力的。Because all three stages are specific to the onset, middle, and end stages, this concatenated feature is very discriminative for precise video segment localization.

步骤S4.3：计算边界回归损失L_regress：Step S4.3: Calculate the boundary regression loss L _regress :

其中，Z为候选片段的总数，

分别为第z个候选片段的视频切片开始位置、结束位置。where Z is the total number of candidate fragments,

are the start position and end position of the video slice of the zth candidate segment, respectively.

步骤S4.4：计算匹配分数加权交叉熵损失L_match：Step S4.4: Calculate the matching score weighted cross-entropy loss L _match :

到结束位置

的视频)的重合度。Among them, y _z is the video segment where the z-th candidate segment and the sentence are located (starting position

to the end position

video) coincidence.

不同于以往预测IoU时不计算回归，在本发明的IoU预测是预测回归后的候选片段和真实之间的IoU，这使得本发明可以度量边界回归的质量。Unlike the previous IoU prediction without calculating the regression, the IoU prediction in the present invention is to predict the IoU between the candidate segment after regression and the real one, which enables the present invention to measure the quality of boundary regression.

为了生成候选片段，任何候选片段生成方法都可以应用在本发明的框架中。为了简便，本发明首先枚举由连续视频切片组成的所有可能的视频片段。然后，对于较短的视频，可以密集地选取它们作为候选片段。对于较长的视频，可以逐渐增加采样间隔，稀疏地选取它们作为候选片段。这种方法的主要思想是去除冗余的有很大的重叠的候选片段。To generate candidate segments, any candidate segment generation method can be applied within the framework of the present invention. For simplicity, the present invention first enumerates all possible video segments consisting of consecutive video slices. Then, for shorter videos, they can be densely selected as candidate segments. For longer videos, the sampling interval can be gradually increased and sparsely selected as candidate segments. The main idea of this method is to remove redundant candidate segments with large overlaps.

步骤S4.5：计算屏蔽单词预测的交叉熵损失L_word Step S4.5: Calculate the cross-entropy loss L _word for masked word prediction

L_word＝-logp^mask L _word = -logp ^mask

其中，p^mask是根据语句特征序列

预测为屏蔽的单词的概率。Among them, p ^mask is based on the sentence feature sequence

Probability of words predicted to be masked.

在训练过程中，本发明将语句对作为网络的输入。与原来的Transformer模型类似，语句序列中一个单词随机屏蔽。对于屏蔽单词，它被替换为一个特殊的标记“[MASK]”。然后，让模型根据未掩蔽的单词和来自视频序列的信息来预测屏蔽单词。值得说明的是，预测一些重要的词，例如一些对应物体的名词和一些对应动作的动词，需要视频序列的信息。因此，屏蔽单词预测不仅使Transformer模型学会了语言，还能更好地对齐视频和语句模态。匹配单词预测的损失函数是标准的交叉熵损失。本发明没有在任何其他数据集上预先训练视频语句Transformer模型，所有参数都是随机初始化的。During the training process, the present invention takes sentence pairs as the input of the network. Similar to the original Transformer model, a word in the sentence sequence is randomly masked. For masked words, it is replaced with a special token "[MASK]". Then, let the model predict masked words based on the unmasked words and information from the video sequence. It is worth noting that predicting some important words, such as some nouns corresponding to objects and some verbs corresponding to actions, requires information from video sequences. Therefore, masked word prediction not only allows the Transformer model to learn language, but also better aligns video and sentence modalities. The loss function for matching word predictions is the standard cross-entropy loss. The present invention does not pre-train the video sentence Transformer model on any other data set, and all parameters are initialized randomly.

步骤S4.6：计算训练多阶段聚合Transformer模型的整个网络的损失L_total Step S4.6: Calculate the loss L _total of the entire network for training the multi-stage aggregated Transformer model

步骤S4.7：更新整个网络的参数Step S4.7: Update the parameters of the entire network

本发明没有在任何其他数据集上预先训练视频语句Transformer模型，所有参数都是随机初始化的。The present invention does not pre-train the video sentence Transformer model on any other data set, and all parameters are initialized randomly.

步骤S5：视频语句定位Step S5: Video Sentence Location

输入视频和不含掩蔽单词的完整查询语句，如图4所示，按照步骤(1)～(3)进行处理，再按照步骤4.2)计算出每个候选片段的匹配分数预测值以及视频切片开始位置、结束位置的预测值(构成新的候选片段)，然后根据匹配分数从高到低对新的候选片段进行排序，再使用NMS(非极大值抑制)去除重叠超过70％的新的候选片段，并返回前1或前5个新的候选片段作为最终定位出的视频片段。Input the video and the complete query sentence without masked words, as shown in Figure 4, process according to steps (1) to (3), and then calculate the matching score prediction value of each candidate segment and the start of video slice according to step 4.2). The predicted value of the position and end position (forming a new candidate segment), then sort the new candidate segments according to the matching score from high to low, and then use NMS (Non-Maximum Suppression) to remove new candidates that overlap more than 70% segment, and return the top 1 or top 5 new candidate segments as the final positioned video segment.

性能测评Performance evaluation

本发明在两个大型公共数据集上进行实验，分别是ActivityNet_Captions[14]和TACoS[24]。ActivityNet_Captions包含20K个视频和100K个查询语句。视频的平均长度是2分钟。它包含127个与烹饪活动相关的视频，平均持续时间为4.79分钟。平均每个视频有148个查询语句。TACoS数据集包含18,818个片段语句对。TACoS是一个非常具有挑战性的数据。其查询语句包含多层次的活动，这些活动包含不同层次的细节。The present invention conducts experiments on two large public datasets, ActivityNet_Captions [14] and TACoS [24]. ActivityNet_Captions contains 20K videos and 100K query sentences. The average length of the video is 2 minutes. It contains 127 videos related to cooking activities with an average duration of 4.79 minutes. On average, there are 148 queries per video. The TACoS dataset contains 18,818 segment-statement pairs. TACoS is a very challenging data. Its query statement contains multiple levels of activities that contain different levels of detail.

利用Rank n@IoU＝m来评估本发明。它表示正确的定位占所有定位的百分比，其中正确的定位定义为在输出的结果中至少有一个与ground truth相匹配的片段。如果某个片段和ground truth之间的IoU大于m则该片段与ground truth相匹配。The invention is evaluated using Rank n@IoU=m. It represents the percentage of correct localizations of all localizations, where correct localization is defined as having at least one segment in the output that matches the ground truth. A segment matches the ground truth if the IoU between it and the ground truth is greater than m.

使用Adam优化本发明中的网络。批大小设置为16，学习率设置为0.0001。Transformer层数设置为6层。所有层的特征维度设置为512。标准差标量分别为0.25,α_s，α_m0.21。在ActivityNet_Captions和TACoS中，Transformer注意力头的数量被分别设置为16和32。利用C3D网络提取视频切片特征。对于ActivityNet_Captions数据集，将采样的视频切片长度设置为32，对于TACo S数据集设置为128。The network in the present invention is optimized using Adam. The batch size is set to 16 and the learning rate is set to 0.0001. The number of Transformer layers is set to 6 layers. The feature dimension of all layers is set to 512. The standard deviation scalars are 0.25, α _s , and α _m 0.21, respectively. In ActivityNet_Captions and TACoS, the number of Transformer attention heads is set to 16 and 32, respectively. Extract video slice features using C3D network. The sampled video slice length is set to 32 for the ActivityNet_Captions dataset and 128 for the TACo S dataset.

将本发明提出的多阶段聚合Transformer网络与当前各种最先进的方法进行了比较，对比结果如表1-2所示。The multi-stage aggregation Transformer network proposed by the present invention is compared with various current state-of-the-art methods, and the comparison results are shown in Table 1-2.

表1Table 1

表1是与其他方法在ActivityNet_Captions数据集上的对比结果。Table 1 shows the comparison results with other methods on the ActivityNet_Captions dataset.

表2Table 2

表2是与其他方法在TACoS数据集上的对比结果。Table 2 shows the comparison results with other methods on the TACoS dataset.

从实验结果可以看出，与以前的方法相比，本发明取得了显著的提升。尽管本发明在ActivityNet_captions数据集上比CSMGAN[18]的Rank1@IoU＝0.5值低1.09点，但它在所有其他指标上都优于CSMGAN。特别是对于Rank1@IoU＝0.7和Rank5@IoU＝0.7指标，本发明分别比CSMGAN高出2.63和3.55点。注意到IoU＝0.7是评判一个片段是否正确的更严格的标准，这表明本发明可以实现更高质量的定位。此外，在TACoS数据集上，本发明比CSMGAN在所有评估指标上都高出了10个百分点以上，这表明了本发明对比CSMGAN方法的优越性。此外，对比其他方法，本发明也取得了压倒性的优势，这些结果充分说明了本发明的有效性。在本发明视频语句Transformer模型中，每个视频切片可以与查询语句中的每个单词进行交互，从而得到更详细、更准确的视频语句对齐。由于本发明的多阶段聚合模块，计算的视频片段表示可以匹配不同阶段的活动。本发明的两个模块紧密结合在一起，形成了一个非常有效且高效的片段定位网络。It can be seen from the experimental results that the present invention achieves a significant improvement compared with the previous method. Although the present invention is 1.09 points lower than the Rank1@IoU=0.5 value of CSMGAN [18] on the ActivityNet_captions dataset, it outperforms CSMGAN on all other metrics. Especially for Rank1@IoU=0.7 and Rank5@IoU=0.7 indicators, the present invention is 2.63 and 3.55 points higher than CSMGAN, respectively. Note that IoU=0.7 is a stricter criterion for judging whether a segment is correct, which indicates that the present invention can achieve higher quality localization. In addition, on the TACoS dataset, the present invention is more than 10 percentage points higher than CSMGAN in all evaluation indicators, which shows the superiority of the present invention compared with the CSMGAN method. In addition, compared with other methods, the present invention also achieves overwhelming advantages, and these results fully demonstrate the effectiveness of the present invention. In the video sentence Transformer model of the present invention, each video slice can interact with each word in the query sentence, so as to obtain more detailed and accurate video sentence alignment. Thanks to the multi-stage aggregation module of the present invention, the computed video segment representation can match different stages of activity. The two modules of the present invention are closely combined to form a very effective and efficient fragment localization network.

尽管上面对本发明说明性的具体实施方式进行了描述，以便于本技术领域的技术人员理解本发明，但应该清楚，本发明不限于具体实施方式的范围，对本技术领域的普通技术人员来讲，只要各种变化在所附的权利要求限定和确定的本发明的精神和范围内，这些变化是显而易见的，一切利用本发明构思的发明创造均在保护之列。Although the illustrative specific embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be clear that the present invention is not limited to the scope of the specific embodiments. For those skilled in the art, As long as various changes are within the spirit and scope of the present invention as defined and determined by the appended claims, these changes are obvious, and all inventions and creations utilizing the inventive concept are included in the protection list.

Claims

1. a video sentence positioning method based on multi-stage aggregation Transformer model, is characterized in that, comprises the following steps:

(1), video slice feature, word feature extraction

The video is evenly divided into N time points according to time. At each time point, a video slice is collected, which is composed of consecutive multi-frame images, and feature extraction is performed on each video slice to obtain a total of N slice features, N The slice features are placed in the order of time to form a video feature sequence;

Perform word turn on each word of the sentence to obtain word features, and then place them in the order in the sentence to form a sentence feature sequence;

Map the slice feature in the video feature sequence and the word feature of the sentence feature sequence to the same dimension to obtain the video feature sequence

sentence feature sequence

in,

represents the slice feature of the ith slice of the video,

Represents the word feature of the jth word of the sentence;

(2), build a video sentence Transformer model, and calculate the video feature sequence and sentence feature sequence

Build a D-layer video sentence Transformer model, where the output of the d-th layer, d=1,2,...,D is:

Among them, V and L represent videos and sentences, respectively, Q, K, and W are learnable parameters, where different subscripts represent different parameters, and Att( ) is the attention calculation function;

video feature sequence

sentence feature sequence

sentence feature sequence

(3), build a multi-stage aggregation module, calculate the stage feature sequence and prediction score sequence of the three stages

Calculate the sequence of stage features r ^sta , r ^mid , r ^end for the start, middle and end stages:

Among them, the initial stage feature sequence r ^sta consists of the stage features r _i ^sta of N slices, i=1, 2,...N, and the intermediate stage feature sequence r ^mid consists of the stage features r _i ^mid of N slices, i= 1,2,...N, the ^end stage feature sequence rend consists of N slices of stage features ^riend , _i =1,2,...N, MLP ₁ ^sta , MLP ₁ ^mid , MLP ₁ ^end Multi-layer Perceptron (MLP, Multi-layer Perceptron) that calculates the three-stage stage feature sequence;

Compute the sequence of predicted scores p ^sta , p ^mid , p ^end for the start, middle, and end stages:

Among them, the initial stage prediction score sequence p ^sta consists of the prediction scores of N slices

composition,

(4), train a multi-stage aggregate Transformer model

The video sentence Transformer model and the multi-stage aggregation module constitute a multi-stage aggregation Transformer model;

end position

in,

σ _sta , σ _mid σ _end are the standard deviation of the unnormalized two-dimensional Gaussian distribution, σ _sta , α _mid , α _end are positive scalars used to control the value of the standard deviation;

4.1), calculate the weighted cross entropy loss L _stage on the prediction layer:

and match score predictions

in,

are the video slice start position, middle position, and end position of the zth candidate segment, respectively,

are the stage features of the corresponding positions of the stage feature sequences r ^sta , r ^mid , and r ^end obtained in step (3), respectively;

4.3), calculate the boundary regression loss L _regress :

Among them, Z is the total number of candidate fragments;

4.4), calculate the matching score weighted cross entropy loss L _match :

Among them, y _z is the starting position of the video segment where the zth candidate segment and the sentence are located

to the end position

The coincidence of the video;

4.5), calculate the cross-entropy loss L _word of masked word prediction

L _word = -log p ^mask

Among them, p ^mask is based on the sentence feature sequence

the probability of a word predicted to be masked;

4.6), calculate the loss L _total of the entire network for training the multi-stage aggregated Transformer model

L _total =L _stage +L _regress +L _match +L _word

4.7), update the parameters of the entire network

Take a piece of data from the video sentence training data set in turn, and update the parameters of the entire network according to the loss L _total until the data in the video sentence training data set is empty, thus obtaining a trained multi-stage aggregate Transformer model;

(5), video sentence positioning

Input the video and the complete query sentence without masked words, process according to steps (1) to (3), and then calculate the matching score prediction value of each candidate segment and the prediction of the start position and end position of the video slice according to step 4.2). value, and form new candidate fragments, then sort the new candidate fragments according to the matching score from high to low, and then use non-maximum suppression to remove the new candidate fragments that overlap more than 70%, and return to the top 1 or top 5 A new candidate segment is used as the final positioned video segment.