CN114064967A

CN114064967A - Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network

Info

Publication number: CN114064967A
Application number: CN202210052687.8A
Authority: CN
Inventors: 王聪; 鲍虎军; 宋明黎
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2022-02-18
Anticipated expiration: 2042-01-18
Also published as: CN114064967B

Abstract

The invention discloses a cross-modal time sequence behavior positioning method and device of a multi-granularity cascade interaction network, which are used to solve the time sequence behavior positioning problem based on a given text query in an untrimmed video. The present invention implements a new multi-granularity cascaded cross-modal interaction network, which performs cascaded cross-modal interaction in a coarse-to-fine manner, so as to improve the cross-modal alignment capability of the model. In addition, the present invention introduces a local-global context-aware video encoder, which is used to improve the context timing dependency modeling capability of the video encoder. The present invention has a simple implementation method and flexible means, and has advantages in improving the visual-language cross-modal alignment accuracy, and the model obtained by training can significantly improve the time-series positioning accuracy on the paired video-query test data.

Description

Cross-modal timing behavior localization method and device for multi-granularity cascade interaction network

技术领域technical field

本发明涉及视觉-语言跨模态学习领域，尤其是涉及跨模态时序行为定位方法及装置。The present invention relates to the field of vision-language cross-modal learning, in particular to a cross-modal timing behavior positioning method and device.

背景技术Background technique

随着多媒体和网络技术的迅猛发展，以及交通、校园和商场等场所大规模视频监控的日益普及，海量的视频数据呈现快速的几何式增长，视频理解已成为一个重要且亟待解决的问题。其中，时序行为定位是视频理解的基础和重要组成部分。基于视觉单模态的时序行为定位研究将待定位的行为限定在预定义的行为集合中，然而，在真实世界中行为复杂多样，预定义的行为集合难以满足真实世界的需要。如图1所示，视觉-语言跨模态时序行为定位任务给定视频中某段行为的文本描述作为查询，在视频中对相应行为片段进行时序定位。视觉-语言跨模态时序行为定位是一种非常自然的人机交互方式，该项技术在网络短视频内容检索与生产、智能视频监控以及人机交互等领域具有广阔的应用前景。With the rapid development of multimedia and network technologies, and the increasing popularity of large-scale video surveillance in places such as transportation, campuses, and shopping malls, massive video data presents a rapid geometric growth, and video understanding has become an important and urgent problem to be solved. Among them, temporal behavior localization is the basis and important part of video understanding. The research on temporal behavior localization based on visual single modality limits the behavior to be localized in a predefined behavior set. However, in the real world, the behavior is complex and diverse, and the predefined behavior set is difficult to meet the needs of the real world. As shown in Figure 1, the visual-linguistic cross-modal temporal behavior localization task is given the text description of a certain behavior in a video as a query, and temporally localizes the corresponding behavior segment in the video. Vision-language cross-modal timing behavior localization is a very natural human-computer interaction method. This technology has broad application prospects in the fields of network short video content retrieval and production, intelligent video surveillance, and human-computer interaction.

在深度学习的推动下，视觉-语言跨模态时序行为定位任务引起了工业界和学术界的广泛关注。由于异构的文本模态与视觉模态之间存在显著的语义鸿沟，在从文本模态到视觉模态的跨模态时序行为定位任务中，如何实现模态间的语义对齐是一个核心问题。现有的视觉-语言跨模态时序行为定位方法主要有三类，包括基于候选片段提名的方法、免候选片段提名的方法以及基于序列决策的方法。视觉-语言跨模态对齐在现有的三类方法中均为不可或缺的重要环节。然而，现有方法在视觉-语言跨模态交互环节没有充分利用多粒度的文本查询信息，且在视频表征编码环节没有充分建模视频的局部上下文时序依赖特性。Driven by deep learning, the task of visual-linguistic cross-modal temporal behavior localization has attracted extensive attention from industry and academia. Due to the significant semantic gap between heterogeneous text modalities and visual modalities, how to achieve semantic alignment between modalities is a core issue in the task of cross-modal temporal behavior localization from text modalities to visual modalities. . Existing visual-linguistic cross-modal temporal behavior localization methods mainly fall into three categories, including candidate segment nomination-based methods, candidate segment-nomination-free methods, and sequence decision-based methods. Vision-linguistic cross-modal alignment is an indispensable and important link in all three existing methods. However, existing methods do not fully utilize the multi-granularity of textual query information in the visual-linguistic cross-modal interaction link, and do not fully model the local contextual temporal dependence of video in the video representation coding link.

发明内容SUMMARY OF THE INVENTION

为解决现有技术的不足，在视觉-语言跨模态时序行为定位任务中，实现提升视觉-语言跨模态对齐精度的目的，本发明采用如下的技术方案：In order to solve the deficiencies of the prior art, in the visual-language cross-modal timing behavior positioning task, to achieve the purpose of improving the visual-language cross-modal alignment accuracy, the present invention adopts the following technical solutions:

多粒度级联交互网络的跨模态时序行为定位方法，包括以下步骤：A cross-modal timing behavior localization method for multi-granularity cascade interaction network, including the following steps:

步骤S1：给定未修剪的视频样本，利用视觉预训练模型进行视频表征的初步提取，并采用局部-全局的方式，对初步提取后的视频表征进行上下文感知的时序依赖编码，得到最终的视频表征，从而提升视频表征的上下文时序依赖建模能力；Step S1: Given an untrimmed video sample, use a visual pre-training model to perform preliminary extraction of video representations, and use a local-global approach to perform context-aware time-series-dependent coding on the preliminarily extracted video representations to obtain the final video. characterization, thereby improving the contextual timing dependence modeling ability of video characterization;

步骤S2：对于未修剪视频相应的文本查询，采用预训练的词嵌入模型，对查询文本中各个单词进行词嵌入初始化，然后采用多层双向长短时记忆网络，进行上下文编码，得到文本查询的单词级表征和全局级表征；Step S2: For the text query corresponding to the untrimmed video, a pre-trained word embedding model is used to initialize the word embedding for each word in the query text, and then a multi-layer bidirectional long-term memory network is used to perform context encoding to obtain the words of the text query. level characterization and global level characterization;

步骤S3：对于已提取的视频表征和文本查询表征，采用多粒度级联交互网络进行视频模态和文本查询模态间的交互，得到查询引导的增强化视频表征，从而提升跨模态对齐精度；Step S3: For the extracted video representations and text query representations, a multi-granularity cascaded interaction network is used to perform the interaction between the video modality and the text query modality to obtain query-guided enhanced video representations, thereby improving cross-modal alignment accuracy ;

步骤S4：对于经过多粒度级联交互后得到的视频表征，采用基于注意力的时序位置回归模块，预测文本查询相应的目标视频片段时序位置；Step S4: For the video representation obtained after the multi-granularity cascade interaction, an attention-based time series position regression module is used to predict the time series position of the corresponding target video segment for the text query;

步骤S5：对于步骤S1~S4所组成的基于多粒度级联交互网络的跨模态时序行为定位模型，利用训练样本集进行该模型的训练，训练时所采用的总损失函数包括注意力对齐损失和边界损失，其中，边界损失包括平滑

损失和时序广义交并比损失，从而更好地适应于时序定位任务的评测准则，训练样本集由若干{视频，查询，目标视频片段时序位置标注} 三元组样本构成。 Step S5: For the cross-modal time series behavior localization model based on the multi-granularity cascade interaction network composed of steps S1~S4, use the training sample set to train the model, and the total loss function used during training includes the attention alignment loss. and boundary loss, where boundary loss includes smoothing

The loss and the time-series generalized intersection loss are better adapted to the evaluation criteria of the time-series location task. The training sample set consists of several {video, query, target video segment time-series location annotation} triple samples.

进一步地，所述步骤S1中，基于视觉预训练模型，以离线方式提取视频帧特征并均匀地采样T帧，然后经过一个线性变换层，获取一组视频表征

，

为视频第i帧的表征，进而对视频表征

采用局部-全局的方式，进行上下文感知的时序依赖编码。 Further, in the step S1, based on the visual pre-training model, the video frame features are extracted offline and T frames are sampled uniformly, and then a set of video representations are obtained through a linear transformation layer.

,

is the representation of the i-th frame of the video, and then the representation of the video

Context-aware time-dependent encoding is performed in a local-global manner.

进一步地，所述步骤S1中的局部-全局上下文感知编码，首先对视频表征

进行局部上下文感知编码，得到视频表征

；然后对视频表征

进行全局上下文感知编码，得到视频表征

。 Further, the local-global context-aware coding in the step S1 firstly characterizes the video

Perform local context-aware encoding to obtain video representations

; then characterize the video

Perform global context-aware coding to obtain video representations

.

进一步地，所述步骤S1中的局部上下文感知编码和全局上下文感知编码，分别以如下方式进行实施：Further, the local context-aware coding and the global context-aware coding in the step S1 are respectively implemented in the following ways:

步骤S1.1，局部上下文感知编码采用一组配备一维偏移窗口的连续局部变压器（local transformer）块，将视频表征

作为初始表征，输入第一块一维偏移窗口的连续局部变压器块，将得到的结果输入第二块一维偏移窗口的连续局部变压器块，以此类推，将最后一块一维偏移窗口的连续局部变压器块的输出，作为局部上下文感知编码输出的视频表征

；一维偏移窗口的连续局部变压器块内部操作如下： Step S1.1, local context-aware coding employs a set of continuous local transformer blocks equipped with a one-dimensional offset window to represent the video

As an initial representation, input the continuous local transformer block of the first one-dimensional offset window, input the obtained result into the continuous local transformer block of the second one-dimensional offset window, and so on, put the last block of the one-dimensional offset window into the continuous local transformer block. The output of successive local transformer blocks, as the video representation of the output of the local context-aware encoding

; The internal operation of the continuous local transformer block for a one-dimensional offset window is as follows:

对获取的视频表征

进行层标准化后，通过一维窗口多头自注意力模块，将得到的结果与视频表征

相加，得到视频表征

；对视频表征

进行层标准化后，通过多层感知器，将得到的结果与视频表征

相加，得到视频表征

；对视频表征

进行层标准化后，通过一维偏移窗口多头自注意力模块，将得到的结果与视频表征

相加，得到视频表征

；对视频表征

相加，输出视频表征

作为一维偏移窗口的连续局部变压器块的输出，

表示第

块配备一维偏移窗口的连续局部变压器块。 Characterization of the acquired video

After layer normalization, the obtained results are compared with the video representation through the one-dimensional window multi-head self-attention module.

Add to get the video representation

; representation of video

After layer normalization, the obtained results are compared with the video representation through a multi-layer perceptron

Add to get the video representation

; representation of video

After layer normalization, the obtained results are compared with the video representation through the one-dimensional offset window multi-head self-attention module.

Add to get the video representation

; representation of video

Add, output video representation

The output of successive local transformer blocks as a 1D offset window,

means the first

The block is equipped with a continuous local transformer block of one-dimensional offset windows.

具体地，第

块配备一维偏移窗口的连续局部变压器块表示为： Specifically, the first

A continuous local transformer block equipped with a one-dimensional offset window is represented as:

其中，

，

为层标准化，

为一维窗口多头自注意力模块，

为多层感知器，

为一维偏移窗口多头自注意力模块。 in,

,

for layer normalization,

is a one-dimensional window multi-head self-attention module,

is a multilayer perceptron,

A multi-head self-attention module for one-dimensional offset windows.

步骤S1.2，全局上下文感知编码包括一组常规变压器块，将视频表征

做出初始表征输入第一块常规变压器块，将得到的结果输入第二块常规变压器块，以此类推，将最后一块常规变压器块的输出，作为全局上下文感知编码输出视频表征

；常规变压器块内部操作如下： Step S1.2, the global context-aware coding consists of a set of conventional transformer blocks that characterize the video

Make the initial representation and input it into the first regular transformer block, input the obtained result into the second regular transformer block, and so on, use the output of the last regular transformer block as the global context-aware coding output video representation

; The internal operation of a conventional transformer block is as follows:

获取的视频表征

，通过常规多头自注意力模块后，将得到的结果与视频表征

相加后，再进行层标准化，得到视频表征

；视频表征

通过多层感知器后，将得到的结果与视频表征

相加后，再进行层标准化，得到的视频表征

作为常规变压器块的输出，

表示第

块常规变压器块。 Acquired video representation

, after passing the conventional multi-head self-attention module, the obtained results are compared with the video representation

After addition, layer normalization is performed to obtain the video representation

; video representation

After passing through the multi-layer perceptron, the obtained results are compared with the video representation

After addition, layer normalization is performed, and the resulting video representation

As the output of a regular transformer block,

means the first

block regular transformer block.

具体地，第

个变压器块表示为： Specifically, the first

A transformer block is represented as:

其中，

，

为常规多头自注意力模块，

为层标准化，

为多层感知器。 in,

,

is the conventional multi-head self-attention module,

for layer normalization,

is a multilayer perceptron.

进一步地，所述步骤S2中，查询文本中每个单词对应的可学习词嵌入向量，使用预训练的词嵌入模型进行初始化，得到文本查询的嵌入向量序列

，

为视频第i 个单词的表征，通过多层的双向长短时记忆网络（BLSTM），对文本查询的嵌入向量序列

进行上下文编码，得到查询的单词级文本查询表征

，通过

的前向隐状态向量和

的后向隐状态向量的拼接，得到全局级文本查询表征

，最终得到文本查询表征

。 Further, in the step S2, the learnable word embedding vector corresponding to each word in the query text is initialized using a pre-trained word embedding model to obtain the embedding vector sequence of the text query.

,

For the representation of the ith word in the video, through a multi-layer bidirectional long short-term memory network (BLSTM), the embedding vector sequence of the text query

Perform context encoding to obtain a word-level textual query representation of the query

,pass

The forward hidden state vector sum of

The concatenation of the backward hidden state vectors of , to obtain the global-level text query representation

, and finally get the text query representation

.

具体实施方式如下：The specific implementation is as follows:

其中

为

的前向隐状态向量和

的后向隐状态向量的拼接。 in

for

The forward hidden state vector sum of

The concatenation of the backward hidden state vectors of .

进一步地，所述步骤S3中的多粒度级联交互网络，首先将视频表征

和文本查询表征

，通过视频引导的查询解码，得到视频引导的查询表征

，

表示全局级视频引导的查询表征，

表示单词级视频引导的查询表征，然后将视频引导的查询表征

与视频模态表征

，通过级联跨模态融合，得到最终的增强化视频表征。视频引导的查询解码，用以缩小视频表征

和文本查询表征

模态之间的语义鸿沟。 Further, in the multi-granularity cascade interaction network in the step S3, the video is first characterized

and text query representation

, through video-guided query decoding to obtain video-guided query representations

,

represents a global-level video-guided query representation,

represent word-level video-guided query representations, and then characterize the video-guided query representations

and video modality characterization

, through cascaded cross-modal fusion, the final enhanced video representation is obtained. Video-guided query decoding to narrow down video representations

and text query representation

Semantic gap between modalities.

进一步地，所述步骤S3包括如下步骤：Further, the step S3 includes the following steps:

步骤S3.1，视频引导的查询解码采用一组跨模态解码块，将文本查询表征

作为初始表征输入第一块跨模态解码块，将得到的结果输入第二块跨模态解码块，以此类推，将最后一块跨模态解码块的输出，作为视频引导的查询表征

；所述步骤S3.1中的跨模态解码块的内部操作如下： Step S3.1, video-guided query decoding uses a set of cross-modal decoding blocks to characterize the text query.

Input the first cross-modal decoding block as the initial representation, input the obtained result to the second cross-modal decoding block, and so on, use the output of the last cross-modal decoding block as the video-guided query representation

; The internal operation of the cross-modal decoding block in the step S3.1 is as follows:

将获取的文本查询表征

，通过多头自注意力模块，得到文本查询表征

；将文本查询表征

作为查询，将视频表征

作为键和值，通过多头交叉注意力模块，得到文本查询表征

；文本查询表征

通过常规前向网络，得到的文本查询表征

作为跨模态解码块的输出；

表示第

块跨模态解码块。 The text query representation that will be obtained

, through the multi-head self-attention module, the text query representation is obtained

; characterize the text query

As a query, the video representation

As keys and values, the text query representation is obtained through a multi-head cross-attention module

; text query representation

Text query representations obtained through conventional feed-forward networks

as the output of the cross-modal decoding block;

means the first

Block cross-modal decoding blocks.

具体地，第

个跨模态解码块表示为： Specifically, the first

A cross-modal decoding block is represented as:

其中，

，

和

分别为多头自注意力模块和多头交叉注意力模块，

为常规前向网络（feed forward network）。 in,

,

and

are the multi-head self-attention module and the multi-head cross-attention module, respectively.

It is a regular feed forward network.

步骤S3.2，级联跨模态融合，首先将全局级视频引导的查询表征

与视频模态表征

，通过逐元素乘，在粗粒度级进行跨模态融合，得到粗粒度级融合后的视频表征

，然后将单词级视频引导的查询表征

与粗粒度级融合后的视频表征

，通过另一组跨模态解码块，在细粒度级进行跨模态融合，将粗粒度级融合后的视频表征

作为初始表征输入第一块跨模态解码块，将得到的结果输入第二块跨模态解码块，以此类推，将最后一块跨模态解码块的输出，作为增强化视频表征

；所述步骤S3.2中的跨模态解码块的内部操作如下： Step S3.2, cascaded cross-modal fusion, first characterizes the global-level video-guided query

and video modality characterization

, through element-by-element multiplication, cross-modal fusion is performed at the coarse-grained level, and the video representation after coarse-grained level fusion is obtained

, and then characterize the word-level video-guided query

Video representation after fusion with coarse-grained level

, through another set of cross-modal decoding blocks, cross-modal fusion is performed at the fine-grained level, and the fused video representation at the coarse-grained level is

Input the first cross-modal decoding block as the initial representation, input the obtained result to the second cross-modal decoding block, and so on, take the output of the last cross-modal decoding block as the enhanced video representation

; The internal operation of the cross-modal decoding block in the step S3.2 is as follows:

将获取的视频表征

，通过多头自注意力模块，得到视频表征

；将视频表征

作为查询，将单词级视频引导的查询表征

作为键和值，通过多头交叉注意力模块，得到视频表征

；视频表征

通过常规前向网络，得到的视频表征

作为跨模态解码块的输出；

表示第

块跨模态解码块。粗粒度级进行跨模态融合用于抑制背景视频帧和强调前景视频帧，可表示为

，

表示逐元素乘。 Video representations to be acquired

, through the multi-head self-attention module, the video representation is obtained

; characterize the video

As a query, word-level video-guided query representation

As keys and values, the video representation is obtained through a multi-head cross-attention module

; video representation

Through the conventional feed-forward network, the resulting video representation

as the output of the cross-modal decoding block;

means the first

Block cross-modal decoding blocks. Coarse-grained cross-modal fusion is used to suppress background video frames and emphasize foreground video frames, which can be expressed as

,

Represents element-wise multiplication.

第

个跨模态解码块表示为： the first

A cross-modal decoding block is represented as:

其中，

，

和

分别为多头自注意力模块和多头交叉注意力模块，

为常规前向网络（feed forward network）。 in,

,

and

It is a regular feed forward network.

进一步地，所述步骤S4中的基于注意力的时序位置回归模块，将经过多粒度级联交互的视频序列表征

，通过多层感知器和SoftMax激活层，得到视频的时序注意力分数

；再将增强化视频表征

与时序注意力分数

，通过注意力池化层，得到目标片段的表征

；最后，将目标片段的表征

，通过多层感知器，对目标片段归一化后的时序中心坐标

和片段时长

进行直接回归。 Further, the attention-based time series position regression module in the step S4 characterizes the video sequence that has undergone multi-granularity cascade interaction.

, through the multi-layer perceptron and SoftMax activation layer, the temporal attention score of the video is obtained

; then the enhanced video representation

with temporal attention scores

, through the attention pooling layer, the representation of the target segment is obtained

; Finally, the representation of the target segment

, through the multi-layer perceptron, the normalized time series center coordinates of the target segment

and segment duration

Do a direct regression.

具体基于注意力的时序位置回归表示为：The specific attention-based time-series location regression is expressed as:

。

.

其中，

为增强化视频表征，即经过多粒度级联交互后输出的视频序列表征，注意力池化层用于汇聚视频序列表征。 in,

In order to enhance the video representation, that is, the representation of the video sequence output after multi-granularity cascade interaction, the attention pooling layer is used to aggregate the representation of the video sequence.

进一步地，所述步骤S5中模型的训练，包括如下步骤：Further, the training of the model in the step S5 includes the following steps:

步骤S5.1，计算注意力对齐损失

，将第i帧对应的时序注意力分数的对数与指示值

的乘积，根据采样帧数进行累加，通过该累加的结果比上

根据采样帧数累加的结果求损失

，

表明视频的第i帧位于时序标注片段内，反之则

；注意力对齐损失

用于鼓励标注时序片段内的视频帧具有更高的注意力分数，具体计算过程可表示为： Step S5.1, calculate the attention alignment loss

, the logarithm of the temporal attention score corresponding to the i -th frame and the indicated value

The product is accumulated according to the number of sampling frames, and the accumulated result is higher than

Calculate the loss according to the accumulated result of the number of sampling frames

,

Indicates that the i-th frame of the video is located in the timing annotation segment, and vice versa

; attention alignment loss

It is used to encourage video frames within annotated time series segments to have higher attention scores. The specific calculation process can be expressed as:

其中，T表示采样的帧数，

表示第i帧的时序注意力分数，

表明视频的第i帧位于时序标注片段内，反之则

。 Among them, T represents the number of frames sampled,

represents the temporal attention score of the i-th frame,

.

步骤S5.2，计算边界损失

，通过结合平滑

损失

和时序广义交并比损失

进行边界损失度量；对预测片段的归一化时序中心坐标

与时序标注片段的归一化时序中心坐标

的差值，求第一平滑

损失，对预测片段的片段时长

与时序标注片段的片段时长

的差值，求第二平滑

损失，将第一、第二平滑

损失的和作为损失

；计算回归片段

和相应标注片段

的广义交并比，将该广义交并比的负值加上1，作为时序广义交并比损失

；将损失

与时序广义交并比损失

的和作为边界损失

；边界损失

的具体计算过程可表示如下：Step S5.2, calculate the boundary loss

, by combining smooth

loss

and time series generalized intersection loss

Perform boundary loss metrics; normalized temporal center coordinates for predicted segments

Normalized temporal center coordinates with temporal annotation fragments

The difference of , find the first smoothing

loss, segment duration for predicted segments

Clip duration with timing annotation clips

The difference of , find the second smoothing

loss, smoothing the first and second

loss and as loss

; Calculate the regression segment

and corresponding annotation fragments

The generalized intersection and union ratio of , add 1 to the negative value of the generalized intersection and union ratio, as the loss of the time series generalized intersection and union ratio

; will lose

Generalized intersection loss with time series

The sum as the boundary loss

; frontier loss

The specific calculation process can be expressed as follows:

其中，

表示平滑

损失函数，

表示两片段的交并比，

表示覆盖模型回归片段

和相应标注片段

的最小时序框。 in,

means smooth

loss function,

represents the intersection ratio of the two fragments,

Represents coverage model regression snippet

and corresponding annotation fragments

minimum timing frame.

步骤S5.3，将注意力对齐损失

与边界损失

的加权和作为模型训练的总损失。 Step S5.3, align the attention to the loss

with boundary loss

The weighted sum is used as the total loss for model training.

具体总损失函数

为： Specific total loss function

for:

其中，

和

为权值超参数，且在训练阶段使用优化器更新模型参数。 in,

and

are weight hyperparameters, and the optimizer is used to update the model parameters during the training phase.

多粒度级联交互网络的跨模态时序行为定位装置，包括一个或多个处理器，用于实现所述的多粒度级联交互网络的跨模态时序行为定位方法。An apparatus for locating cross-modal timing behavior of a multi-granularity cascading interaction network includes one or more processors for implementing the method for locating the cross-modal timing behavior of a multi-granularity cascading interaction network.

本发明的优势和有益效果在于：The advantages and beneficial effects of the present invention are:

本发明的多粒度级联交互网络的跨模态时序行为定位方法及装置，在视觉-语言跨模态交互环节以由粗到细的方式充分利用多粒度的文本查询信息，并在视频表征编码环节充分建模视频的局部-全局上下文时序依赖特性，用于解决未修剪视频中基于文本查询的时序行为定位问题。对于给定的未修剪视频和文本查询，本发明可提升视觉-语言跨模态对齐精度，进而提升跨模态时序行为定位任务的定位准确度。The cross-modal timing behavior positioning method and device of the multi-granularity cascading interaction network of the present invention make full use of the multi-granularity text query information in the visual-language cross-modal interaction link in a coarse-to-fine manner, and use the multi-granularity text query information in the video representation coding. The link fully models the local-global context temporal dependence characteristics of video, and is used to solve the problem of text query-based temporal behavior localization in untrimmed video. For a given untrimmed video and text query, the present invention can improve the visual-linguistic cross-modal alignment accuracy, thereby improving the positioning accuracy of the cross-modal timing behavior positioning task.

附图说明Description of drawings

图1是视觉-语言跨模态时序行为定位任务示例图。Figure 1 is an example diagram of the visual-linguistic cross-modal temporal behavior localization task.

图2是本发明中多粒度级联交互网络的跨模态时序行为定位的流程框图。FIG. 2 is a flow chart of the cross-modal timing behavior positioning of the multi-granularity cascade interaction network in the present invention.

图3是本发明中多粒度级联交互网络的跨模态时序行为定位方法的流程图。FIG. 3 is a flow chart of a method for locating cross-modal timing behavior of a multi-granularity cascade interaction network in the present invention.

图4是本发明中多粒度级联交互网络的跨模态时序行为定位装置的结构图。FIG. 4 is a structural diagram of a cross-modal timing behavior positioning device of a multi-granularity cascade interaction network in the present invention.

具体实施方式Detailed ways

以下结合附图对本发明的具体实施方式进行详细说明。应当理解的是，此处所描述的具体实施方式仅用于说明和解释本发明，并不用于限制本发明。The specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are only used to illustrate and explain the present invention, but not to limit the present invention.

本发明公开了一种多粒度级联交互网络的跨模态时序行为定位方法及装置，基于多粒度级联交互网络的视觉-语言跨模态时序行为定位，用于解决未修剪视频中基于给定文本查询的时序行为定位问题。该方法提出一种简单有效的多粒度级联跨模态交互网络，用以提升模型的跨模态对齐能力。此外，本发明引入一种局部-全局上下文感知的视频编码器，用于提升视频编码器的上下文时序依赖建模能力。因此训练所得模型在成对的视频-查询测试数据上可显著提升时序定位准确度。The invention discloses a method and device for locating cross-modal timing behavior of multi-granularity cascading interactive networks. The visual-language cross-modal timing behavior positioning based on multi-granularity cascading interactive networks is used to solve problems in untrimmed videos based on given The problem of temporal behavior localization of text query. This method proposes a simple and effective multi-granularity cascaded cross-modal interaction network to improve the cross-modal alignment ability of the model. In addition, the present invention introduces a local-global context-aware video encoder for improving the context timing dependency modeling capability of the video encoder. Therefore, the trained model can significantly improve the time-series positioning accuracy on the paired video-query test data.

多粒度级联交互网络的跨模态时序行为定位方法，基于Pytorch框架进行实验，使用预训练的C3D网络以离线方式提取视频帧特征，视频均匀采样为256帧，方法中所有自注意力子模块和交叉注意力子模块的头数均设置为8。在训练阶段使用Adam优化器训练模型，学习率固定为0.0004，且每个批处理由100对视频-查询组成。此外，实验实施时的性能评测准则可采用“R@n, IoU=m”评测准则，该评测准则表示评测数据集中被正确定位的查询所占百分比，其中置信度最高的n个预测片段与真实标注的交并比(IoU)最大值若大于m则认为该查询被正确定位。Cross-modal timing behavior localization method of multi-granularity cascade interaction network, based on the Pytorch framework for experiments, using the pre-trained C3D network to extract video frame features in an offline manner, the video is uniformly sampled as 256 frames, all self-attention sub-modules in the method and the number of heads of the cross-attention submodule are both set to 8. The model is trained using the Adam optimizer during the training phase, with a fixed learning rate of 0.0004, and each batch consists of 100 video-query pairs. In addition, the performance evaluation criterion during the implementation of the experiment can use the "R@n, IoU=m" evaluation criterion, which represents the percentage of correctly located queries in the evaluation data set, and the n prediction segments with the highest confidence If the maximum value of the marked intersection ratio (IoU) is greater than m, the query is considered to be correctly located.

具体实施例中，给定一个未修剪视频

，将其均匀采样为视频帧序列

，并给定视频

中某个行为片段的文本描述

，视觉-语言跨模态时序行为定位任务是预测视频

中文本描述

所对应视频片段的开始时间

和结束时间

。该任务的训练数据集可定义为

，其中

和

为目标视频片段开始时间和结束时间的真实标注。 In specific embodiments, given an untrimmed video

, which is uniformly sampled as a sequence of video frames

, and given a video

A textual description of a behavior fragment in

, the visual-linguistic cross-modal temporal behavior localization task is to predict video

Chinese text description

The start time of the corresponding video clip

and end time

. The training dataset for this task can be defined as

,in

and

Ground-truth annotations for the start and end times of the target video clips.

如图2、图3所示，多粒度级联交互网络的跨模态时序行为定位方法，包括如下步骤：As shown in Figure 2 and Figure 3, the cross-modal timing behavior positioning method of multi-granularity cascade interaction network includes the following steps:

所述步骤S1中，基于视觉预训练模型，以离线方式提取视频帧特征并均匀地采样T 帧，然后经过一个线性变换层，获取一组视频表征

，

为视频第i帧的表征，进而对视频表征

采用局部-全局的方式，进行上下文感知的时序依赖编码。 In the step S1, based on the visual pre-training model, the video frame features are extracted offline and T frames are sampled uniformly, and then a set of video representations are obtained through a linear transformation layer.

,

Context-aware time-dependent encoding is performed in a local-global manner.

所述S1中的局部-全局上下文感知编码，首先对视频表征

进行局部上下文感知编码，得到视频表征

；然后对视频表征

进行全局上下文感知编码，得到视频表征

。 The local-global context-aware coding in S1 firstly characterizes the video

Perform local context-aware encoding to obtain video representations

; then characterize the video

Perform global context-aware coding to obtain video representations

.

所述步骤S1中的局部上下文感知编码和全局上下文感知编码，分别以如下方式进行实施：The local context-aware coding and the global context-aware coding in the step S1 are implemented in the following ways:

对获取的视频表征

相加，得到视频表征

；对视频表征

相加，得到视频表征

；对视频表征

相加，得到视频表征

；对视频表征

相加，输出视频表征

作为一维偏移窗口的连续局部变压器块的输出，

表示第

块配备一维偏移窗口的连续局部变压器块。Characterization of the acquired video

Add to get the video representation

; representation of video

Add to get the video representation

; representation of video

Add to get the video representation

; representation of video

Add, output video representation

The output of successive local transformer blocks as a 1D offset window,

means the first

具体地，第

其中，

，

为层标准化，

为一维窗口多头自注意力模块，

为多层感知器，

为一维偏移窗口多头自注意力模块。 in,

,

for layer normalization,

is a one-dimensional window multi-head self-attention module,

is a multilayer perceptron,

A multi-head self-attention module for one-dimensional offset windows.

; The internal operation of a conventional transformer block is as follows:

获取的视频表征

，通过常规多头自注意力模块后，将得到的结果与视频表征

相加后，再进行层标准化，得到视频表征

；视频表征

通过多层感知器后，将得到的结果与视频表征

相加后，再进行层标准化，得到的视频表征

作为常规变压器块的输出，

表示第

块常规变压器块。 Acquired video representation

; video representation

As the output of a regular transformer block,

means the first

block regular transformer block.

具体地，第

个变压器块表示为： Specifically, the first

A transformer block is represented as:

其中，

，

为常规多头自注意力模块，

为层标准化，

为多层感知器。 in,

,

is the conventional multi-head self-attention module,

for layer normalization,

is a multilayer perceptron.

所述步骤S2中，查询文本中每个单词对应的可学习词嵌入向量，使用预训练的词嵌入模型进行初始化，得到文本查询的嵌入向量序列

，

为视频第i个单词的表征，通过多层的双向长短时记忆网络（BLSTM），对文本查询的嵌入向量序列

进行上下文编码，得到查询的单词级文本查询表征

，通过

的前向隐状态向量和

的后向隐状态向量的拼接，得到全局级文本查询表征

，最终得到文本查询表征

。 In the step S2, the learnable word embedding vector corresponding to each word in the query text is initialized by using the pre-trained word embedding model to obtain the embedding vector sequence of the text query.

,

For the representation of the i-th word in the video, through a multi-layer bidirectional long short-term memory network (BLSTM), the embedding vector sequence of the text query

,pass

The forward hidden state vector sum of

, and finally get the text query representation

.

具体实施方式如下：The specific implementation is as follows:

其中

为

的前向隐状态向量和

的后向隐状态向量的拼接。 in

for

The forward hidden state vector sum of

The concatenation of the backward hidden state vectors of .

所述步骤S3中的多粒度级联交互网络，首先将视频表征

和文本查询表征

，通过视频引导的查询解码，得到视频引导的查询表征

，

表示全局级视频引导的查询表征，

与视频模态表征

和文本查询表征

模态之间的语义鸿沟。 The multi-granularity cascade interaction network in the step S3 firstly characterizes the video

and text query representation

,

represents a global-level video-guided query representation,

and video modality characterization

and text query representation

Semantic gap between modalities.

所述步骤S3具体包括如下步骤：The step S3 specifically includes the following steps:

将获取的文本查询表征

，通过多头自注意力模块，得到文本查询表征

；将文本查询表征

作为查询，将视频表征

作为键和值，通过多头交叉注意力模块，得到文本查询表征

；文本查询表征

通过常规前向网络，得到的文本查询表征

作为跨模态解码块的输出；

表示第

块跨模态解码块。 The text query representation that will be obtained

; characterize the text query

As a query, the video representation

; text query representation

Text query representations obtained through conventional feed-forward networks

as the output of the cross-modal decoding block;

means the first

Block cross-modal decoding blocks.

具体地，第

个跨模态解码块表示为： Specifically, the first

A cross-modal decoding block is represented as:

其中，

，

和

分别为多头自注意力模块和多头交叉注意力模块，

为常规前向网络（feed forward network）。 in,

,

and

It is a regular feed forward network.

与视频模态表征

，然后将单词级视频引导的查询表征

与粗粒度级融合后的视频表征

and video modality characterization

, and then characterize the word-level video-guided query

Video representation after fusion with coarse-grained level

将获取的视频表征

，通过多头自注意力模块，得到视频表征

；将视频表征

作为查询，将单词级视频引导的查询表征

作为键和值，通过多头交叉注意力模块，得到视频表征

；视频表征

通过常规前向网络，得到的视频表征

作为跨模态解码块的输出；

表示第

，

表示逐元素乘。 Video representations to be acquired

; characterize the video

As a query, word-level video-guided query representation

; video representation

as the output of the cross-modal decoding block;

means the first

,

Represents element-wise multiplication.

第

个跨模态解码块表示为： the first

A cross-modal decoding block is represented as:

其中，

，

和

分别为多头自注意力模块和多头交叉注意力模块，

为常规前向网络（feed forward network）。 in,

,

and

It is a regular feed forward network.

所述步骤S4中的基于注意力的时序位置回归模块，将经过多粒度级联交互的视频序列表征

；再将增强化视频表征

与时序注意力分数

，通过注意力池化层，得到目标片段的表征

；最后，将目标片段的表征

，通过多层感知器，对目标片段归一化后的时序中心坐标

和片段时长

进行直接回归。 The attention-based time-series position regression module in the step S4 characterizes the video sequence that has undergone multi-granularity cascade interaction

; then the enhanced video representation

with temporal attention scores

; finally, the representation of the target fragment

and segment duration

Do a direct regression.

。

.

其中，

为增强化视频表征，即经过多粒度级联交互的视频序列表征，注意力池化层用于汇聚视频序列表征， in,

In order to enhance the video representation, that is, the video sequence representation through multi-granularity cascade interaction, the attention pooling layer is used to aggregate the video sequence representation,

所述步骤S5中模型的训练，包括如下步骤：The training of the model in the step S5 includes the following steps:

步骤S5.1，计算注意力对齐损失

，将第i帧对应的时序注意力分数的对数与指示值

的乘积，根据采样帧数进行累加，通过该累加的结果比上

根据采样帧数累加的结果求损失

，

表明视频的第i帧位于时序标注片段内，反之则

；注意力对齐损失

,

; attention alignment loss

其中，T表示采样的帧数，

表示第i帧的时序注意力分数，

表明视频的第i 帧位于时序标注片段内，反之则

。 Among them, T represents the number of frames sampled,

represents the temporal attention score of the i-th frame,

Indicates that the i-th frame of the video is within the timing annotation segment, and vice versa

.

步骤S5.2，计算边界损失

，通过结合平滑

损失

和时序广义交并比损失

进行边界损失度量；对预测片段的归一化时序中心坐标

与时序标注片段的归一化时序中心坐标

的差值，求第一平滑

损失，对预测片段的片段时长

与时序标注片段的片段时长

的差值，求第二平滑

损失，将第一、第二平滑

损失的和作为损失

；计算回归片段

和相应标注片段

；将损失

与时序广义交并比损失

的和作为边界损失

；边界损失

的具体计算过程可表示如下： Step S5.2, calculate the boundary loss

, by combining smooth

loss

and time series generalized intersection loss

Normalized temporal center coordinates with temporal annotation fragments

The difference of , find the first smoothing

loss, segment duration for predicted segments

Clip duration with timing annotation clips

The difference of , find the second smoothing

loss, smoothing the first and second

loss and as loss

; Calculate the regression segment

and corresponding annotation fragments

; will lose

Generalized intersection loss with time series

The sum as the boundary loss

; frontier loss

The specific calculation process can be expressed as follows:

其中，

表示平滑

损失函数，

表示两片段的交并比，

表示覆盖模型回归片段

和相应标注片段

的最小时序框。 in,

means smooth

loss function,

represents the intersection ratio of the two fragments,

Represents coverage model regression snippet

and corresponding annotation fragments

minimum timing frame.

步骤S5.3，将注意力对齐损失

与边界损失

with boundary loss

The weighted sum is used as the total loss for model training.

具体总损失函数

为： Specific total loss function

for:

其中，

和

为权值超参数，且在训练阶段使用优化器更新模型参数。 in,

and

本发明方法与其它现有代表性方法在TACoS测试集上的准确率对比，如表1所示，采用“R@n, IoU=m”的评测准则，这里n=1，m={0.1, 0.3, 0.5}。The accuracy of the method of the present invention and other existing representative methods on the TACoS test set is compared, as shown in Table 1, using the evaluation criterion of "R@n, IoU=m", where n=1, m={0.1, 0.3, 0.5}.

表1Table 1

与前述跨模态时序行为定位方法的实施例相对应，本发明还提供了多粒度级联交互网络的跨模态时序行为定位装置的实施例。Corresponding to the foregoing embodiments of the cross-modal timing behavior positioning method, the present invention also provides an embodiment of a cross-modal timing behavior positioning apparatus for a multi-granularity cascade interaction network.

参见图4，本发明实施例提供的多粒度级联交互网络的跨模态时序行为定位装置，包括一个或多个处理器，用于实现上述实施例中的多粒度级联交互网络的跨模态时序行为定位方法。Referring to FIG. 4 , an apparatus for locating cross-modal timing behavior of a multi-granularity cascading interaction network provided by an embodiment of the present invention includes one or more processors for implementing the cross-modality of the multi-granularity cascading interaction network in the foregoing embodiment. Temporal timing behavior localization method.

本发明多粒度级联交互网络的跨模态时序行为定位装置的实施例可以应用在任意具备数据处理能力的设备上，该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。装置实施例可以通过软件实现，也可以通过硬件或者软硬件结合的方式实现。以软件实现为例，作为一个逻辑意义上的装置，是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言，如图4所示，为本发明多粒度级联交互网络的跨模态时序行为定位装置所在任意具备数据处理能力的设备的一种硬件结构图，除了图4所示的处理器、内存、网络接口、以及非易失性存储器之外，实施例中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能，还可以包括其他硬件，对此不再赘述。The embodiment of the device for locating cross-modal timing behavior in a multi-granularity cascade interaction network of the present invention can be applied to any device with data processing capability, which can be a device or device such as a computer. The apparatus embodiment may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, a device in a logical sense is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory through the processor of any device with data processing capability where it is located. From the perspective of hardware, as shown in FIG. 4 , it is a hardware structure diagram of any device with data processing capability where the cross-modal timing behavior positioning device of the multi-granularity cascading interaction network of the present invention is located, except that shown in FIG. 4 In addition to the processor, memory, network interface, and non-volatile memory, any device with data processing capability where the apparatus in the embodiment is located may also include other hardware, usually according to the actual function of any device with data processing capability, This will not be repeated here.

上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程，在此不再赘述。For details of the implementation process of the functions and functions of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method, which will not be repeated here.

对于装置实施例而言，由于其基本对应于方法实施例，所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。As for the apparatus embodiments, since they basically correspond to the method embodiments, reference may be made to the partial descriptions of the method embodiments for related parts. The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. Those of ordinary skill in the art can understand and implement it without creative effort.

本发明实施例还提供一种计算机可读存储介质，其上存储有程序，该程序被处理器执行时，实现上述实施例中的多粒度级联交互网络的跨模态时序行为定位方法。Embodiments of the present invention further provide a computer-readable storage medium on which a program is stored, and when the program is executed by a processor, implements the method for locating cross-modal timing behavior of a multi-granularity cascaded interaction network in the foregoing embodiment.

所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元，例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备的外部存储设备，例如所述设备上配备的插接式硬盘、智能存储卡（Smart Media Card，SMC）、SD卡、闪存卡（Flash Card）等。进一步的，所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据，还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be an internal storage unit of any device with data processing capability described in any of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium may also be an external storage device of any device with data processing capability, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), an SD card, a flash memory card equipped on the device (Flash Card) etc. Further, the computer-readable storage medium may also include both an internal storage unit of any device with data processing capability and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the device with data processing capability, and can also be used to temporarily store data that has been output or will be output.

以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明实施例技术方案的范围。The above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be described in the foregoing embodiments. The technical solutions of the present invention are modified, or some or all of the technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. a cross-modal time sequence behavior positioning method of a multi-granularity cascade interaction network is characterized in that comprising the following steps:

Step S1: Given an untrimmed video sample, use a visual pre-training model to perform preliminary extraction of video representations, and use a local-global approach to perform context-aware time-series-dependent coding on the preliminarily extracted video representations to obtain the final video. representation;

Step S2: For the text query corresponding to the untrimmed video, a pre-trained word embedding model is used to initialize the word embedding for each word in the query text, and then a multi-layer bidirectional long-term memory network is used to perform context encoding to obtain the words of the text query. level characterization and global level characterization;

Step S3: for the extracted video representation and text query representation, a multi-granularity cascade interaction network is used to perform interaction between the video modality and the text query modality, so as to obtain a query-guided enhanced video representation;

Step S4: For the enhanced video representation obtained after the multi-granularity cascade interaction, an attention-based time-series position regression module is used to predict the time-series position of the corresponding target video segment for the text query;

Step S5: For the cross-modal time series behavior localization model based on the multi-granularity cascade interaction network composed of steps S1~S4, use the training sample set to train the model, and the total loss function used during training includes the attention alignment loss. and boundary loss, where boundary loss includes smoothing

Loss and time series generalized intersection loss.

2. The cross-modal time sequence behavior positioning method of multi-granularity cascading interaction network according to claim 1, it is characterized in that in described step S1, based on visual pre-training model, extract video frame feature in offline mode and sample uniformly T frames, and then go through a linear transformation layer to obtain a set of video representations

,

Context-aware time-dependent encoding is performed in a local-global manner.

3. The cross-modal timing behavior positioning method of multi-granularity cascade interaction network according to claim 2, it is characterized in that the local-global context-aware coding method in the described step S1, first to the video representation

Perform local context-aware encoding to obtain video representations

; then characterize the video

Perform global context-aware coding to obtain video representations

.

4. The cross-modal timing behavior positioning method of multi-granularity cascade interaction network according to claim 3, is characterized in that the local context-aware coding and the global context-aware coding in the described step S1 are respectively implemented in the following manner:

Step S1.1, local context-aware coding employs a set of continuous local transformer blocks equipped with one-dimensional offset windows to represent the video

Characterization of the acquired video

Add to get the video representation

; representation of video

Add to get the video representation

; representation of video

Add to get the video representation

; representation of video

Add, output video representation

The output of successive local transformer blocks as a 1D offset window,

means the first

block is equipped with a continuous local transformer block with a one-dimensional offset window;

Step S1.2, the global context-aware coding consists of a set of conventional transformer blocks that characterize the video

; The internal operation of a conventional transformer block is as follows:

Acquired video representation

; video representation

As the output of a regular transformer block,

means the first

block regular transformer block.

5. The method for locating cross-modal time-series behavior of multi-granularity cascade interaction network according to claim 1, is characterized in that in said step S2, the learnable word embedding vector corresponding to each word in the query text, using pre-training The word embedding model is initialized to obtain the embedding vector sequence of the text query

,

is the representation of the i-th word in the video, through the multi-layer bidirectional long-short-term memory network, the embedding vector sequence of the text query

,pass

The forward hidden state vector sum of

, and finally get the text query representation

.

6. The cross-modal time sequence behavior positioning method of multi-granularity cascading interaction network according to claim 1, it is characterized in that the multi-granularity cascading interaction network in the described step S3, first by video representation and text query representation

,

represents a global-level video-guided query representation,

The final enhanced video representation is obtained by cascading cross-modal fusion with the video modality representation.

7. The cross-modal timing behavior positioning method of multi-granularity cascade interaction network according to claim 6, is characterized in that the query decoding and cascade cross-modal fusion of video guidance in the described step S3 are respectively implemented as follows:

Step S3.1, video-guided query decoding uses a set of cross-modal decoding blocks to characterize the text query.

The text query representation that will be obtained

; characterize the text query

As a query, the video representation

; text query representation

Text query representations obtained through conventional feed-forward networks

as the output of the cross-modal decoding block;

means the first

block cross-modal decoding block;

Step S3.2, cascaded cross-modal fusion, first characterizes the global-level video-guided query

and video modality characterization

, through element-by-element multiplication, cross-modal fusion is performed at the coarse-grained level, and the video representation after fusion at the coarse-grained level is obtained.

, and then characterize the word-level video-guided query

Video representation after fusion with coarse-grained level

Video representations to be acquired

; characterize the video

As a query, word-level video-guided query representation

; video representation

as the output of the cross-modal decoding block;

means the first

Block cross-modal decoding blocks.

8. The cross-modal time sequence behavior positioning method of the multi-granularity cascade interaction network according to claim 1, wherein the attention-based time sequence position regression module in the step S4, the multi-granularity cascade interaction output Enhanced Video Representation

; then the enhanced video representation

with temporal attention scores

; finally, the representation of the target fragment

and segment duration

Do a direct regression.

9. The cross-modal time sequence behavior positioning method of multi-granularity cascade interaction network according to claim 1, is characterized in that the training of the model in described step S5, comprises the steps:

Step S5.1, calculate the attention alignment loss

,

; attention alignment loss

The specific calculation process can be expressed as:

Step S5.2, calculate the boundary loss

, by combining smooth

loss

and time series generalized intersection loss

Normalized temporal center coordinates with temporal annotation fragments

The difference of , find the first smoothing

loss, segment duration for predicted segments

Clip duration with timing annotation clips

The difference of , find the second smoothing

loss, smoothing the first and second

loss and as loss

; Calculate the regression segment

and corresponding annotation fragments

; will lose

Generalized intersection loss with time series

The sum as the boundary loss

; frontier loss

The specific calculation process can be expressed as follows:

in,

means smooth

loss function,

represents the intersection ratio of the two fragments,

Represents coverage model regression snippet

and corresponding annotation fragments

The minimum timing frame of ;

Step S5.3, align the attention to the loss

with boundary loss

The weighted sum is taken as the total loss of model training, combined with the optimizer to update the model parameters.

10. A cross-modal timing behavior positioning device for a multi-granularity cascade interaction network, characterized in that it comprises one or more processors for implementing the multi-granularity cascade described in any one of claims 1-9 A cross-modal temporal behavior localization method for interaction networks.