CN114064967A - Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network - Google Patents

Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network Download PDF

Info

Publication number
CN114064967A
CN114064967A CN202210052687.8A CN202210052687A CN114064967A CN 114064967 A CN114064967 A CN 114064967A CN 202210052687 A CN202210052687 A CN 202210052687A CN 114064967 A CN114064967 A CN 114064967A
Authority
CN
China
Prior art keywords
video
representation
cross
modal
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210052687.8A
Other languages
Chinese (zh)
Other versions
CN114064967B (en
Inventor
王聪
鲍虎军
宋明黎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202210052687.8A priority Critical patent/CN114064967B/en
Publication of CN114064967A publication Critical patent/CN114064967A/en
Application granted granted Critical
Publication of CN114064967B publication Critical patent/CN114064967B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/146Data rate or code amount at the encoder output
    • H04N19/149Data rate or code amount at the encoder output by estimating the code amount by means of a model, e.g. mathematical model or statistical model
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/20Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
    • H04N19/21Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding with binary alpha-plane coding for video objects, e.g. context-based arithmetic encoding [CAE]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本发明公开了多粒度级联交互网络的跨模态时序行为定位方法及装置,用于解决未修剪视频中基于给定文本查询的时序行为定位问题。本发明实施一种新的多粒度级联跨模态交互网络,以由粗到细的方式进行级联跨模态交互,用以提升模型的跨模态对齐能力。此外,本发明引入了一种局部‑全局上下文感知的视频编码器(local‑global context‑aware video encoder),用于提升视频编码器的上下文时序依赖建模能力。本发明实现方法简单,手段灵活,在提升视觉‑语言跨模态对齐精度方面具有优势,且训练所得模型在成对的视频‑查询测试数据上可显著提升时序定位准确度。

Figure 202210052687

The invention discloses a cross-modal time sequence behavior positioning method and device of a multi-granularity cascade interaction network, which are used to solve the time sequence behavior positioning problem based on a given text query in an untrimmed video. The present invention implements a new multi-granularity cascaded cross-modal interaction network, which performs cascaded cross-modal interaction in a coarse-to-fine manner, so as to improve the cross-modal alignment capability of the model. In addition, the present invention introduces a local-global context-aware video encoder, which is used to improve the context timing dependency modeling capability of the video encoder. The present invention has a simple implementation method and flexible means, and has advantages in improving the visual-language cross-modal alignment accuracy, and the model obtained by training can significantly improve the time-series positioning accuracy on the paired video-query test data.

Figure 202210052687

Description

多粒度级联交互网络的跨模态时序行为定位方法及装置Cross-modal timing behavior localization method and device for multi-granularity cascade interaction network

技术领域technical field

本发明涉及视觉-语言跨模态学习领域,尤其是涉及跨模态时序行为定位方法及装置。The present invention relates to the field of vision-language cross-modal learning, in particular to a cross-modal timing behavior positioning method and device.

背景技术Background technique

随着多媒体和网络技术的迅猛发展,以及交通、校园和商场等场所大规模视频监控的日益普及,海量的视频数据呈现快速的几何式增长,视频理解已成为一个重要且亟待解决的问题。其中,时序行为定位是视频理解的基础和重要组成部分。基于视觉单模态的时序行为定位研究将待定位的行为限定在预定义的行为集合中,然而,在真实世界中行为复杂多样,预定义的行为集合难以满足真实世界的需要。如图1所示,视觉-语言跨模态时序行为定位任务给定视频中某段行为的文本描述作为查询,在视频中对相应行为片段进行时序定位。视觉-语言跨模态时序行为定位是一种非常自然的人机交互方式,该项技术在网络短视频内容检索与生产、智能视频监控以及人机交互等领域具有广阔的应用前景。With the rapid development of multimedia and network technologies, and the increasing popularity of large-scale video surveillance in places such as transportation, campuses, and shopping malls, massive video data presents a rapid geometric growth, and video understanding has become an important and urgent problem to be solved. Among them, temporal behavior localization is the basis and important part of video understanding. The research on temporal behavior localization based on visual single modality limits the behavior to be localized in a predefined behavior set. However, in the real world, the behavior is complex and diverse, and the predefined behavior set is difficult to meet the needs of the real world. As shown in Figure 1, the visual-linguistic cross-modal temporal behavior localization task is given the text description of a certain behavior in a video as a query, and temporally localizes the corresponding behavior segment in the video. Vision-language cross-modal timing behavior localization is a very natural human-computer interaction method. This technology has broad application prospects in the fields of network short video content retrieval and production, intelligent video surveillance, and human-computer interaction.

在深度学习的推动下,视觉-语言跨模态时序行为定位任务引起了工业界和学术界的广泛关注。由于异构的文本模态与视觉模态之间存在显著的语义鸿沟,在从文本模态到视觉模态的跨模态时序行为定位任务中,如何实现模态间的语义对齐是一个核心问题。现有的视觉-语言跨模态时序行为定位方法主要有三类,包括基于候选片段提名的方法、免候选片段提名的方法以及基于序列决策的方法。视觉-语言跨模态对齐在现有的三类方法中均为不可或缺的重要环节。然而,现有方法在视觉-语言跨模态交互环节没有充分利用多粒度的文本查询信息,且在视频表征编码环节没有充分建模视频的局部上下文时序依赖特性。Driven by deep learning, the task of visual-linguistic cross-modal temporal behavior localization has attracted extensive attention from industry and academia. Due to the significant semantic gap between heterogeneous text modalities and visual modalities, how to achieve semantic alignment between modalities is a core issue in the task of cross-modal temporal behavior localization from text modalities to visual modalities. . Existing visual-linguistic cross-modal temporal behavior localization methods mainly fall into three categories, including candidate segment nomination-based methods, candidate segment-nomination-free methods, and sequence decision-based methods. Vision-linguistic cross-modal alignment is an indispensable and important link in all three existing methods. However, existing methods do not fully utilize the multi-granularity of textual query information in the visual-linguistic cross-modal interaction link, and do not fully model the local contextual temporal dependence of video in the video representation coding link.

发明内容SUMMARY OF THE INVENTION

为解决现有技术的不足,在视觉-语言跨模态时序行为定位任务中,实现提升视觉-语言跨模态对齐精度的目的,本发明采用如下的技术方案:In order to solve the deficiencies of the prior art, in the visual-language cross-modal timing behavior positioning task, to achieve the purpose of improving the visual-language cross-modal alignment accuracy, the present invention adopts the following technical solutions:

多粒度级联交互网络的跨模态时序行为定位方法,包括以下步骤:A cross-modal timing behavior localization method for multi-granularity cascade interaction network, including the following steps:

步骤S1:给定未修剪的视频样本,利用视觉预训练模型进行视频表征的初步提取,并采用局部-全局的方式,对初步提取后的视频表征进行上下文感知的时序依赖编码,得到最终的视频表征,从而提升视频表征的上下文时序依赖建模能力;Step S1: Given an untrimmed video sample, use a visual pre-training model to perform preliminary extraction of video representations, and use a local-global approach to perform context-aware time-series-dependent coding on the preliminarily extracted video representations to obtain the final video. characterization, thereby improving the contextual timing dependence modeling ability of video characterization;

步骤S2:对于未修剪视频相应的文本查询,采用预训练的词嵌入模型,对查询文本中各个单词进行词嵌入初始化,然后采用多层双向长短时记忆网络,进行上下文编码,得到文本查询的单词级表征和全局级表征;Step S2: For the text query corresponding to the untrimmed video, a pre-trained word embedding model is used to initialize the word embedding for each word in the query text, and then a multi-layer bidirectional long-term memory network is used to perform context encoding to obtain the words of the text query. level characterization and global level characterization;

步骤S3:对于已提取的视频表征和文本查询表征,采用多粒度级联交互网络进行视频模态和文本查询模态间的交互,得到查询引导的增强化视频表征,从而提升跨模态对齐精度;Step S3: For the extracted video representations and text query representations, a multi-granularity cascaded interaction network is used to perform the interaction between the video modality and the text query modality to obtain query-guided enhanced video representations, thereby improving cross-modal alignment accuracy ;

步骤S4:对于经过多粒度级联交互后得到的视频表征,采用基于注意力的时序位置回归模块,预测文本查询相应的目标视频片段时序位置;Step S4: For the video representation obtained after the multi-granularity cascade interaction, an attention-based time series position regression module is used to predict the time series position of the corresponding target video segment for the text query;

步骤S5:对于步骤S1~S4所组成的基于多粒度级联交互网络的跨模态时序行为定 位模型,利用训练样本集进行该模型的训练,训练时所采用的总损失函数包括注意力对齐 损失和边界损失,其中,边界损失包括平滑

Figure 645443DEST_PATH_IMAGE001
损失和时序广义交并比损失,从而更好地适应 于时序定位任务的评测准则,训练样本集由若干{视频,查询,目标视频片段时序位置标注} 三元组样本构成。 Step S5: For the cross-modal time series behavior localization model based on the multi-granularity cascade interaction network composed of steps S1~S4, use the training sample set to train the model, and the total loss function used during training includes the attention alignment loss. and boundary loss, where boundary loss includes smoothing
Figure 645443DEST_PATH_IMAGE001
The loss and the time-series generalized intersection loss are better adapted to the evaluation criteria of the time-series location task. The training sample set consists of several {video, query, target video segment time-series location annotation} triple samples.

进一步地,所述步骤S1中,基于视觉预训练模型,以离线方式提取视频帧特征并均 匀地采样T帧,然后经过一个线性变换层,获取一组视频表征

Figure 93742DEST_PATH_IMAGE002
Figure 404638DEST_PATH_IMAGE003
为视频 第i帧的表征,进而对视频表征
Figure 331005DEST_PATH_IMAGE004
采用局部-全局的方式,进行上下文感知的时序依赖编码。 Further, in the step S1, based on the visual pre-training model, the video frame features are extracted offline and T frames are sampled uniformly, and then a set of video representations are obtained through a linear transformation layer.
Figure 93742DEST_PATH_IMAGE002
,
Figure 404638DEST_PATH_IMAGE003
is the representation of the i-th frame of the video, and then the representation of the video
Figure 331005DEST_PATH_IMAGE004
Context-aware time-dependent encoding is performed in a local-global manner.

进一步地,所述步骤S1中的局部-全局上下文感知编码,首先对视频表征

Figure 925804DEST_PATH_IMAGE004
进行局 部上下文感知编码,得到视频表征
Figure 279425DEST_PATH_IMAGE005
;然后对视频表征
Figure 77616DEST_PATH_IMAGE005
进行全局上下文感知编码,得到视 频表征
Figure 807675DEST_PATH_IMAGE006
。 Further, the local-global context-aware coding in the step S1 firstly characterizes the video
Figure 925804DEST_PATH_IMAGE004
Perform local context-aware encoding to obtain video representations
Figure 279425DEST_PATH_IMAGE005
; then characterize the video
Figure 77616DEST_PATH_IMAGE005
Perform global context-aware coding to obtain video representations
Figure 807675DEST_PATH_IMAGE006
.

进一步地,所述步骤S1中的局部上下文感知编码和全局上下文感知编码,分别以如下方式进行实施:Further, the local context-aware coding and the global context-aware coding in the step S1 are respectively implemented in the following ways:

步骤S1.1,局部上下文感知编码采用一组配备一维偏移窗口的连续局部变压器 (local transformer)块,将视频表征

Figure 492865DEST_PATH_IMAGE004
作为初始表征,输入第一块一维偏移窗口的连续局 部变压器块,将得到的结果输入第二块一维偏移窗口的连续局部变压器块,以此类推,将最 后一块一维偏移窗口的连续局部变压器块的输出,作为局部上下文感知编码输出的视频表 征
Figure 282967DEST_PATH_IMAGE005
;一维偏移窗口的连续局部变压器块内部操作如下: Step S1.1, local context-aware coding employs a set of continuous local transformer blocks equipped with a one-dimensional offset window to represent the video
Figure 492865DEST_PATH_IMAGE004
As an initial representation, input the continuous local transformer block of the first one-dimensional offset window, input the obtained result into the continuous local transformer block of the second one-dimensional offset window, and so on, put the last block of the one-dimensional offset window into the continuous local transformer block. The output of successive local transformer blocks, as the video representation of the output of the local context-aware encoding
Figure 282967DEST_PATH_IMAGE005
; The internal operation of the continuous local transformer block for a one-dimensional offset window is as follows:

对获取的视频表征

Figure 568455DEST_PATH_IMAGE007
进行层标准化后,通过一维窗口多头自注意力模块,将得 到的结果与视频表征
Figure 836625DEST_PATH_IMAGE007
相加,得到视频表征
Figure 874857DEST_PATH_IMAGE008
;对视频表征
Figure 835860DEST_PATH_IMAGE009
进行层标准化后,通过多 层感知器,将得到的结果与视频表征
Figure 874223DEST_PATH_IMAGE009
相加,得到视频表征
Figure 680505DEST_PATH_IMAGE010
;对视频表征
Figure 605867DEST_PATH_IMAGE011
进行层标准 化后,通过一维偏移窗口多头自注意力模块,将得到的结果与视频表征
Figure 472191DEST_PATH_IMAGE011
相加,得到视频表 征
Figure 466692DEST_PATH_IMAGE012
;对视频表征
Figure 76665DEST_PATH_IMAGE012
进行层标准化后,通过多层感知器,将得到的结果与视频表征
Figure 355069DEST_PATH_IMAGE012
相加, 输出视频表征
Figure 657874DEST_PATH_IMAGE013
作为一维偏移窗口的连续局部变压器块的输出,
Figure 139671DEST_PATH_IMAGE014
表示第
Figure 553335DEST_PATH_IMAGE014
块配备一维偏 移窗口的连续局部变压器块。 Characterization of the acquired video
Figure 568455DEST_PATH_IMAGE007
After layer normalization, the obtained results are compared with the video representation through the one-dimensional window multi-head self-attention module.
Figure 836625DEST_PATH_IMAGE007
Add to get the video representation
Figure 874857DEST_PATH_IMAGE008
; representation of video
Figure 835860DEST_PATH_IMAGE009
After layer normalization, the obtained results are compared with the video representation through a multi-layer perceptron
Figure 874223DEST_PATH_IMAGE009
Add to get the video representation
Figure 680505DEST_PATH_IMAGE010
; representation of video
Figure 605867DEST_PATH_IMAGE011
After layer normalization, the obtained results are compared with the video representation through the one-dimensional offset window multi-head self-attention module.
Figure 472191DEST_PATH_IMAGE011
Add to get the video representation
Figure 466692DEST_PATH_IMAGE012
; representation of video
Figure 76665DEST_PATH_IMAGE012
After layer normalization, the obtained results are compared with the video representation through a multi-layer perceptron
Figure 355069DEST_PATH_IMAGE012
Add, output video representation
Figure 657874DEST_PATH_IMAGE013
The output of successive local transformer blocks as a 1D offset window,
Figure 139671DEST_PATH_IMAGE014
means the first
Figure 553335DEST_PATH_IMAGE014
The block is equipped with a continuous local transformer block of one-dimensional offset windows.

具体地,第

Figure 187710DEST_PATH_IMAGE015
块配备一维偏移窗口的连续局部变压器块表示为: Specifically, the first
Figure 187710DEST_PATH_IMAGE015
A continuous local transformer block equipped with a one-dimensional offset window is represented as:

Figure 661416DEST_PATH_IMAGE016
Figure 661416DEST_PATH_IMAGE016

Figure 630509DEST_PATH_IMAGE017
Figure 630509DEST_PATH_IMAGE017

Figure 582285DEST_PATH_IMAGE018
Figure 582285DEST_PATH_IMAGE018

Figure 835280DEST_PATH_IMAGE019
Figure 835280DEST_PATH_IMAGE019

其中,

Figure 479888DEST_PATH_IMAGE020
Figure 201857DEST_PATH_IMAGE021
为层标准化,
Figure 957323DEST_PATH_IMAGE022
为一维窗口多头自注意力模块,
Figure 300711DEST_PATH_IMAGE023
为多 层感知器,
Figure 116220DEST_PATH_IMAGE024
为一维偏移窗口多头自注意力模块。 in,
Figure 479888DEST_PATH_IMAGE020
,
Figure 201857DEST_PATH_IMAGE021
for layer normalization,
Figure 957323DEST_PATH_IMAGE022
is a one-dimensional window multi-head self-attention module,
Figure 300711DEST_PATH_IMAGE023
is a multilayer perceptron,
Figure 116220DEST_PATH_IMAGE024
A multi-head self-attention module for one-dimensional offset windows.

步骤S1.2,全局上下文感知编码包括一组常规变压器块,将视频表征

Figure 59905DEST_PATH_IMAGE005
做出初始 表征输入第一块常规变压器块,将得到的结果输入第二块常规变压器块,以此类推,将最后 一块常规变压器块的输出,作为全局上下文感知编码输出视频表征
Figure 619063DEST_PATH_IMAGE006
;常规变压器块内部操 作如下: Step S1.2, the global context-aware coding consists of a set of conventional transformer blocks that characterize the video
Figure 59905DEST_PATH_IMAGE005
Make the initial representation and input it into the first regular transformer block, input the obtained result into the second regular transformer block, and so on, use the output of the last regular transformer block as the global context-aware coding output video representation
Figure 619063DEST_PATH_IMAGE006
; The internal operation of a conventional transformer block is as follows:

获取的视频表征

Figure 535066DEST_PATH_IMAGE025
,通过常规多头自注意力模块后,将得到的结果与视频表征
Figure 505165DEST_PATH_IMAGE025
相加后,再进行层标准化,得到视频表征
Figure 936146DEST_PATH_IMAGE026
;视频表征
Figure 33415DEST_PATH_IMAGE026
通过多层感知器后,将得到的 结果与视频表征
Figure 866242DEST_PATH_IMAGE026
相加后,再进行层标准化,得到的视频表征
Figure 774286DEST_PATH_IMAGE027
作为常规变压器块的输 出,
Figure 692564DEST_PATH_IMAGE028
表示第
Figure 593524DEST_PATH_IMAGE028
块常规变压器块。 Acquired video representation
Figure 535066DEST_PATH_IMAGE025
, after passing the conventional multi-head self-attention module, the obtained results are compared with the video representation
Figure 505165DEST_PATH_IMAGE025
After addition, layer normalization is performed to obtain the video representation
Figure 936146DEST_PATH_IMAGE026
; video representation
Figure 33415DEST_PATH_IMAGE026
After passing through the multi-layer perceptron, the obtained results are compared with the video representation
Figure 866242DEST_PATH_IMAGE026
After addition, layer normalization is performed, and the resulting video representation
Figure 774286DEST_PATH_IMAGE027
As the output of a regular transformer block,
Figure 692564DEST_PATH_IMAGE028
means the first
Figure 593524DEST_PATH_IMAGE028
block regular transformer block.

具体地,第

Figure 15278DEST_PATH_IMAGE028
个变压器块表示为: Specifically, the first
Figure 15278DEST_PATH_IMAGE028
A transformer block is represented as:

Figure 592759DEST_PATH_IMAGE029
Figure 592759DEST_PATH_IMAGE029

Figure 732753DEST_PATH_IMAGE030
Figure 732753DEST_PATH_IMAGE030

其中,

Figure 437404DEST_PATH_IMAGE031
Figure 182506DEST_PATH_IMAGE032
为常规多头自注意力模块,
Figure 681620DEST_PATH_IMAGE021
为层标准化,
Figure 59643DEST_PATH_IMAGE023
为多层感知 器。 in,
Figure 437404DEST_PATH_IMAGE031
,
Figure 182506DEST_PATH_IMAGE032
is the conventional multi-head self-attention module,
Figure 681620DEST_PATH_IMAGE021
for layer normalization,
Figure 59643DEST_PATH_IMAGE023
is a multilayer perceptron.

进一步地,所述步骤S2中,查询文本中每个单词对应的可学习词嵌入向量,使用预 训练的词嵌入模型进行初始化,得到文本查询的嵌入向量序列

Figure 567985DEST_PATH_IMAGE033
Figure 964331DEST_PATH_IMAGE034
为视频第i 个单词的表征,通过多层的双向长短时记忆网络(BLSTM),对文本查询的嵌入向量序列
Figure 634347DEST_PATH_IMAGE035
进 行上下文编码,得到查询的单词级文本查询表征
Figure 998201DEST_PATH_IMAGE036
,通过
Figure 779075DEST_PATH_IMAGE037
的前向隐状态 向量和
Figure 295507DEST_PATH_IMAGE038
的后向隐状态向量的拼接,得到全局级文本查询表征
Figure 136424DEST_PATH_IMAGE039
,最终得到文本查询表征
Figure 941569DEST_PATH_IMAGE040
。 Further, in the step S2, the learnable word embedding vector corresponding to each word in the query text is initialized using a pre-trained word embedding model to obtain the embedding vector sequence of the text query.
Figure 567985DEST_PATH_IMAGE033
,
Figure 964331DEST_PATH_IMAGE034
For the representation of the ith word in the video, through a multi-layer bidirectional long short-term memory network (BLSTM), the embedding vector sequence of the text query
Figure 634347DEST_PATH_IMAGE035
Perform context encoding to obtain a word-level textual query representation of the query
Figure 998201DEST_PATH_IMAGE036
,pass
Figure 779075DEST_PATH_IMAGE037
The forward hidden state vector sum of
Figure 295507DEST_PATH_IMAGE038
The concatenation of the backward hidden state vectors of , to obtain the global-level text query representation
Figure 136424DEST_PATH_IMAGE039
, and finally get the text query representation
Figure 941569DEST_PATH_IMAGE040
.

具体实施方式如下:The specific implementation is as follows:

Figure 276867DEST_PATH_IMAGE041
Figure 276867DEST_PATH_IMAGE041

Figure 647805DEST_PATH_IMAGE042
Figure 647805DEST_PATH_IMAGE042

其中

Figure 659624DEST_PATH_IMAGE043
Figure 483223DEST_PATH_IMAGE037
的前向隐状态向量和
Figure 120747DEST_PATH_IMAGE038
的后向隐状态向量的拼接。 in
Figure 659624DEST_PATH_IMAGE043
for
Figure 483223DEST_PATH_IMAGE037
The forward hidden state vector sum of
Figure 120747DEST_PATH_IMAGE038
The concatenation of the backward hidden state vectors of .

进一步地,所述步骤S3中的多粒度级联交互网络,首先将视频表征

Figure 346192DEST_PATH_IMAGE044
和文本查询 表征
Figure 528911DEST_PATH_IMAGE040
,通过视频引导的查询解码,得到视频引导的查询表征
Figure 370965DEST_PATH_IMAGE045
Figure 48065DEST_PATH_IMAGE046
表示 全局级视频引导的查询表征,
Figure 128017DEST_PATH_IMAGE047
表示单词级视频引导的查询表征,然后将视频引导的查询 表征
Figure 747217DEST_PATH_IMAGE048
与视频模态表征
Figure 545409DEST_PATH_IMAGE006
,通过级联跨模态融合,得到最终的增强化视频表征。视频引导的 查询解码,用以缩小视频表征
Figure 524735DEST_PATH_IMAGE006
和文本查询表征
Figure 459193DEST_PATH_IMAGE040
模态之间的语义鸿沟。 Further, in the multi-granularity cascade interaction network in the step S3, the video is first characterized
Figure 346192DEST_PATH_IMAGE044
and text query representation
Figure 528911DEST_PATH_IMAGE040
, through video-guided query decoding to obtain video-guided query representations
Figure 370965DEST_PATH_IMAGE045
,
Figure 48065DEST_PATH_IMAGE046
represents a global-level video-guided query representation,
Figure 128017DEST_PATH_IMAGE047
represent word-level video-guided query representations, and then characterize the video-guided query representations
Figure 747217DEST_PATH_IMAGE048
and video modality characterization
Figure 545409DEST_PATH_IMAGE006
, through cascaded cross-modal fusion, the final enhanced video representation is obtained. Video-guided query decoding to narrow down video representations
Figure 524735DEST_PATH_IMAGE006
and text query representation
Figure 459193DEST_PATH_IMAGE040
Semantic gap between modalities.

进一步地,所述步骤S3包括如下步骤:Further, the step S3 includes the following steps:

步骤S3.1,视频引导的查询解码采用一组跨模态解码块,将文本查询表征

Figure 249294DEST_PATH_IMAGE049
作为 初始表征输入第一块跨模态解码块,将得到的结果输入第二块跨模态解码块,以此类推,将 最后一块跨模态解码块的输出,作为视频引导的查询表征
Figure 534782DEST_PATH_IMAGE050
;所述步骤S3.1中的跨模态解 码块的内部操作如下: Step S3.1, video-guided query decoding uses a set of cross-modal decoding blocks to characterize the text query.
Figure 249294DEST_PATH_IMAGE049
Input the first cross-modal decoding block as the initial representation, input the obtained result to the second cross-modal decoding block, and so on, use the output of the last cross-modal decoding block as the video-guided query representation
Figure 534782DEST_PATH_IMAGE050
; The internal operation of the cross-modal decoding block in the step S3.1 is as follows:

将获取的文本查询表征

Figure 553685DEST_PATH_IMAGE051
,通过多头自注意力模块,得到文本查询表征
Figure 608229DEST_PATH_IMAGE052
;将文 本查询表征
Figure 303652DEST_PATH_IMAGE052
作为查询,将视频表征
Figure 810857DEST_PATH_IMAGE006
作为键和值,通过多头交叉注意力模块,得到文本查 询表征
Figure 131986DEST_PATH_IMAGE053
;文本查询表征
Figure 572194DEST_PATH_IMAGE053
通过常规前向网络,得到的文本查询表征
Figure 438519DEST_PATH_IMAGE054
作为跨模态解码块 的输出;
Figure 698599DEST_PATH_IMAGE055
表示第
Figure 59304DEST_PATH_IMAGE055
块跨模态解码块。 The text query representation that will be obtained
Figure 553685DEST_PATH_IMAGE051
, through the multi-head self-attention module, the text query representation is obtained
Figure 608229DEST_PATH_IMAGE052
; characterize the text query
Figure 303652DEST_PATH_IMAGE052
As a query, the video representation
Figure 810857DEST_PATH_IMAGE006
As keys and values, the text query representation is obtained through a multi-head cross-attention module
Figure 131986DEST_PATH_IMAGE053
; text query representation
Figure 572194DEST_PATH_IMAGE053
Text query representations obtained through conventional feed-forward networks
Figure 438519DEST_PATH_IMAGE054
as the output of the cross-modal decoding block;
Figure 698599DEST_PATH_IMAGE055
means the first
Figure 59304DEST_PATH_IMAGE055
Block cross-modal decoding blocks.

具体地,第

Figure 291703DEST_PATH_IMAGE055
个跨模态解码块表示为: Specifically, the first
Figure 291703DEST_PATH_IMAGE055
A cross-modal decoding block is represented as:

Figure 328929DEST_PATH_IMAGE056
Figure 328929DEST_PATH_IMAGE056

Figure 76305DEST_PATH_IMAGE057
Figure 76305DEST_PATH_IMAGE057

Figure 489969DEST_PATH_IMAGE058
Figure 489969DEST_PATH_IMAGE058

其中,

Figure 622879DEST_PATH_IMAGE059
Figure 96585DEST_PATH_IMAGE060
Figure 331258DEST_PATH_IMAGE061
分别为多头自注意力模块和多 头交叉注意力模块,
Figure 283033DEST_PATH_IMAGE062
为常规前向网络(feed forward network)。 in,
Figure 622879DEST_PATH_IMAGE059
,
Figure 96585DEST_PATH_IMAGE060
and
Figure 331258DEST_PATH_IMAGE061
are the multi-head self-attention module and the multi-head cross-attention module, respectively.
Figure 283033DEST_PATH_IMAGE062
It is a regular feed forward network.

步骤S3.2,级联跨模态融合,首先将全局级视频引导的查询表征

Figure 37494DEST_PATH_IMAGE046
与视频模态表 征
Figure 416522DEST_PATH_IMAGE006
,通过逐元素乘,在粗粒度级进行跨模态融合,得到粗粒度级融合后的视频表征
Figure 872912DEST_PATH_IMAGE063
,然后 将单词级视频引导的查询表征
Figure 628378DEST_PATH_IMAGE047
与粗粒度级融合后的视频表征
Figure 735880DEST_PATH_IMAGE063
,通过另一组跨模态解码 块,在细粒度级进行跨模态融合,将粗粒度级融合后的视频表征
Figure 551389DEST_PATH_IMAGE063
作为初始表征输入第一 块跨模态解码块,将得到的结果输入第二块跨模态解码块,以此类推,将最后一块跨模态解 码块的输出,作为增强化视频表征
Figure 495075DEST_PATH_IMAGE064
;所述步骤S3.2中的跨模态解码块的内部操作如下: Step S3.2, cascaded cross-modal fusion, first characterizes the global-level video-guided query
Figure 37494DEST_PATH_IMAGE046
and video modality characterization
Figure 416522DEST_PATH_IMAGE006
, through element-by-element multiplication, cross-modal fusion is performed at the coarse-grained level, and the video representation after coarse-grained level fusion is obtained
Figure 872912DEST_PATH_IMAGE063
, and then characterize the word-level video-guided query
Figure 628378DEST_PATH_IMAGE047
Video representation after fusion with coarse-grained level
Figure 735880DEST_PATH_IMAGE063
, through another set of cross-modal decoding blocks, cross-modal fusion is performed at the fine-grained level, and the fused video representation at the coarse-grained level is
Figure 551389DEST_PATH_IMAGE063
Input the first cross-modal decoding block as the initial representation, input the obtained result to the second cross-modal decoding block, and so on, take the output of the last cross-modal decoding block as the enhanced video representation
Figure 495075DEST_PATH_IMAGE064
; The internal operation of the cross-modal decoding block in the step S3.2 is as follows:

将获取的视频表征

Figure 788653DEST_PATH_IMAGE065
,通过多头自注意力模块,得到视频表征
Figure 252126DEST_PATH_IMAGE066
;将视频表征
Figure 238537DEST_PATH_IMAGE066
作为查询,将单词级视频引导的查询表征
Figure 669518DEST_PATH_IMAGE047
作为键和值,通过多头交叉注意力模块,得到 视频表征
Figure 766787DEST_PATH_IMAGE067
;视频表征
Figure 583302DEST_PATH_IMAGE067
通过常规前向网络,得到的视频表征
Figure 740614DEST_PATH_IMAGE068
作为跨模态解码块的输 出;
Figure 658891DEST_PATH_IMAGE069
表示第
Figure 294272DEST_PATH_IMAGE069
块跨模态解码块。粗粒度级进行跨模态融合用于抑制背景视频帧和强调前景 视频帧,可表示为
Figure 732338DEST_PATH_IMAGE070
Figure 60551DEST_PATH_IMAGE071
表示逐元素乘。 Video representations to be acquired
Figure 788653DEST_PATH_IMAGE065
, through the multi-head self-attention module, the video representation is obtained
Figure 252126DEST_PATH_IMAGE066
; characterize the video
Figure 238537DEST_PATH_IMAGE066
As a query, word-level video-guided query representation
Figure 669518DEST_PATH_IMAGE047
As keys and values, the video representation is obtained through a multi-head cross-attention module
Figure 766787DEST_PATH_IMAGE067
; video representation
Figure 583302DEST_PATH_IMAGE067
Through the conventional feed-forward network, the resulting video representation
Figure 740614DEST_PATH_IMAGE068
as the output of the cross-modal decoding block;
Figure 658891DEST_PATH_IMAGE069
means the first
Figure 294272DEST_PATH_IMAGE069
Block cross-modal decoding blocks. Coarse-grained cross-modal fusion is used to suppress background video frames and emphasize foreground video frames, which can be expressed as
Figure 732338DEST_PATH_IMAGE070
,
Figure 60551DEST_PATH_IMAGE071
Represents element-wise multiplication.

Figure 200545DEST_PATH_IMAGE072
个跨模态解码块表示为: the first
Figure 200545DEST_PATH_IMAGE072
A cross-modal decoding block is represented as:

Figure 905196DEST_PATH_IMAGE073
Figure 905196DEST_PATH_IMAGE073

Figure 650298DEST_PATH_IMAGE074
Figure 650298DEST_PATH_IMAGE074

Figure 218856DEST_PATH_IMAGE075
Figure 218856DEST_PATH_IMAGE075

其中,

Figure 846146DEST_PATH_IMAGE076
Figure 26592DEST_PATH_IMAGE060
Figure 688517DEST_PATH_IMAGE061
分别为多头自注意力模块和 多头交叉注意力模块,
Figure 109265DEST_PATH_IMAGE062
为常规前向网络(feed forward network)。 in,
Figure 846146DEST_PATH_IMAGE076
,
Figure 26592DEST_PATH_IMAGE060
and
Figure 688517DEST_PATH_IMAGE061
are the multi-head self-attention module and the multi-head cross-attention module, respectively.
Figure 109265DEST_PATH_IMAGE062
It is a regular feed forward network.

进一步地,所述步骤S4中的基于注意力的时序位置回归模块,将经过多粒度级联 交互的视频序列表征

Figure 223852DEST_PATH_IMAGE064
,通过多层感知器和SoftMax激活层,得到视频的时序注意力分数
Figure 4726DEST_PATH_IMAGE077
; 再将增强化视频表征
Figure 521158DEST_PATH_IMAGE064
与时序注意力分数
Figure 611343DEST_PATH_IMAGE077
,通过注意力池化层,得到目标片段的表征
Figure 213225DEST_PATH_IMAGE078
; 最后,将目标片段的表征
Figure 797790DEST_PATH_IMAGE078
,通过多层感知器,对目标片段归一化后的时序中心坐标
Figure 168729DEST_PATH_IMAGE079
和片 段时长
Figure 931280DEST_PATH_IMAGE080
进行直接回归。 Further, the attention-based time series position regression module in the step S4 characterizes the video sequence that has undergone multi-granularity cascade interaction.
Figure 223852DEST_PATH_IMAGE064
, through the multi-layer perceptron and SoftMax activation layer, the temporal attention score of the video is obtained
Figure 4726DEST_PATH_IMAGE077
; then the enhanced video representation
Figure 521158DEST_PATH_IMAGE064
with temporal attention scores
Figure 611343DEST_PATH_IMAGE077
, through the attention pooling layer, the representation of the target segment is obtained
Figure 213225DEST_PATH_IMAGE078
; Finally, the representation of the target segment
Figure 797790DEST_PATH_IMAGE078
, through the multi-layer perceptron, the normalized time series center coordinates of the target segment
Figure 168729DEST_PATH_IMAGE079
and segment duration
Figure 931280DEST_PATH_IMAGE080
Do a direct regression.

具体基于注意力的时序位置回归表示为:The specific attention-based time-series location regression is expressed as:

Figure 20458DEST_PATH_IMAGE081
Figure 20458DEST_PATH_IMAGE081

Figure 408714DEST_PATH_IMAGE082
Figure 408714DEST_PATH_IMAGE082

Figure 634159DEST_PATH_IMAGE083
Figure 634159DEST_PATH_IMAGE083
.

其中,

Figure 800567DEST_PATH_IMAGE064
为增强化视频表征,即经过多粒度级联交互后输出的视频序列表征,注 意力池化层用于汇聚视频序列表征。 in,
Figure 800567DEST_PATH_IMAGE064
In order to enhance the video representation, that is, the representation of the video sequence output after multi-granularity cascade interaction, the attention pooling layer is used to aggregate the representation of the video sequence.

进一步地,所述步骤S5中模型的训练,包括如下步骤:Further, the training of the model in the step S5 includes the following steps:

步骤S5.1,计算注意力对齐损失

Figure 377042DEST_PATH_IMAGE084
,将第i帧对应的时序注意力分数的对数与指 示值
Figure 303410DEST_PATH_IMAGE085
的乘积,根据采样帧数进行累加,通过该累加的结果比上
Figure 383361DEST_PATH_IMAGE085
根据采样帧数累加的结 果求损失
Figure 18873DEST_PATH_IMAGE084
Figure 817065DEST_PATH_IMAGE086
表明视频的第i帧位于时序标注片段内,反之则
Figure 281544DEST_PATH_IMAGE087
;注意力对齐 损失
Figure 793166DEST_PATH_IMAGE084
用于鼓励标注时序片段内的视频帧具有更高的注意力分数,具体计算过程可表 示为: Step S5.1, calculate the attention alignment loss
Figure 377042DEST_PATH_IMAGE084
, the logarithm of the temporal attention score corresponding to the i -th frame and the indicated value
Figure 303410DEST_PATH_IMAGE085
The product is accumulated according to the number of sampling frames, and the accumulated result is higher than
Figure 383361DEST_PATH_IMAGE085
Calculate the loss according to the accumulated result of the number of sampling frames
Figure 18873DEST_PATH_IMAGE084
,
Figure 817065DEST_PATH_IMAGE086
Indicates that the i-th frame of the video is located in the timing annotation segment, and vice versa
Figure 281544DEST_PATH_IMAGE087
; attention alignment loss
Figure 793166DEST_PATH_IMAGE084
It is used to encourage video frames within annotated time series segments to have higher attention scores. The specific calculation process can be expressed as:

Figure 583267DEST_PATH_IMAGE088
Figure 583267DEST_PATH_IMAGE088

其中,T表示采样的帧数,

Figure 603176DEST_PATH_IMAGE089
表示第i帧的时序注意力分数,
Figure 261559DEST_PATH_IMAGE086
表明视频的第i帧 位于时序标注片段内,反之则
Figure 316103DEST_PATH_IMAGE087
。 Among them, T represents the number of frames sampled,
Figure 603176DEST_PATH_IMAGE089
represents the temporal attention score of the i-th frame,
Figure 261559DEST_PATH_IMAGE086
Indicates that the i-th frame of the video is located in the timing annotation segment, and vice versa
Figure 316103DEST_PATH_IMAGE087
.

步骤S5.2,计算边界损失

Figure 11526DEST_PATH_IMAGE090
,通过结合平滑
Figure 987573DEST_PATH_IMAGE001
损失
Figure 59434DEST_PATH_IMAGE091
和时序广义交并比损失
Figure 719217DEST_PATH_IMAGE092
进行边界损失度量;对预测片段的归一化时序中心坐标
Figure 851121DEST_PATH_IMAGE079
与时序标注片段的归一化时 序中心坐标
Figure 845621DEST_PATH_IMAGE093
的差值,求第一平滑
Figure 986753DEST_PATH_IMAGE001
损失,对预测片段的片段时长
Figure 265156DEST_PATH_IMAGE080
与时序标注片段的片 段时长
Figure 302382DEST_PATH_IMAGE094
的差值,求第二平滑
Figure 49758DEST_PATH_IMAGE001
损失,将第一、第二平滑
Figure 463422DEST_PATH_IMAGE001
损失的和作为损失
Figure 97797DEST_PATH_IMAGE091
;计算回归 片段
Figure 509187DEST_PATH_IMAGE095
和相应标注片段
Figure 478280DEST_PATH_IMAGE096
的广义交并比,将该广义交并比的负值加上1,作为时序广义交并 比损失
Figure 695635DEST_PATH_IMAGE092
;将损失
Figure 699363DEST_PATH_IMAGE091
与时序广义交并比损失
Figure 593238DEST_PATH_IMAGE092
的和作为边界损失
Figure 49627DEST_PATH_IMAGE090
;边界损失
Figure 805094DEST_PATH_IMAGE090
的具体计算过程可表示如下:Step S5.2, calculate the boundary loss
Figure 11526DEST_PATH_IMAGE090
, by combining smooth
Figure 987573DEST_PATH_IMAGE001
loss
Figure 59434DEST_PATH_IMAGE091
and time series generalized intersection loss
Figure 719217DEST_PATH_IMAGE092
Perform boundary loss metrics; normalized temporal center coordinates for predicted segments
Figure 851121DEST_PATH_IMAGE079
Normalized temporal center coordinates with temporal annotation fragments
Figure 845621DEST_PATH_IMAGE093
The difference of , find the first smoothing
Figure 986753DEST_PATH_IMAGE001
loss, segment duration for predicted segments
Figure 265156DEST_PATH_IMAGE080
Clip duration with timing annotation clips
Figure 302382DEST_PATH_IMAGE094
The difference of , find the second smoothing
Figure 49758DEST_PATH_IMAGE001
loss, smoothing the first and second
Figure 463422DEST_PATH_IMAGE001
loss and as loss
Figure 97797DEST_PATH_IMAGE091
; Calculate the regression segment
Figure 509187DEST_PATH_IMAGE095
and corresponding annotation fragments
Figure 478280DEST_PATH_IMAGE096
The generalized intersection and union ratio of , add 1 to the negative value of the generalized intersection and union ratio, as the loss of the time series generalized intersection and union ratio
Figure 695635DEST_PATH_IMAGE092
; will lose
Figure 699363DEST_PATH_IMAGE091
Generalized intersection loss with time series
Figure 593238DEST_PATH_IMAGE092
The sum as the boundary loss
Figure 49627DEST_PATH_IMAGE090
; frontier loss
Figure 805094DEST_PATH_IMAGE090
The specific calculation process can be expressed as follows:

Figure 663328DEST_PATH_IMAGE097
Figure 663328DEST_PATH_IMAGE097

Figure 229570DEST_PATH_IMAGE098
Figure 229570DEST_PATH_IMAGE098

Figure 110938DEST_PATH_IMAGE099
Figure 110938DEST_PATH_IMAGE099

其中,

Figure 670096DEST_PATH_IMAGE100
表示平滑
Figure 382837DEST_PATH_IMAGE001
损失函数,
Figure 369247DEST_PATH_IMAGE101
表示两片段的交并比,
Figure 49496DEST_PATH_IMAGE102
表示覆盖模型回归片 段
Figure 412344DEST_PATH_IMAGE095
和相应标注片段
Figure 979592DEST_PATH_IMAGE096
的最小时序框。 in,
Figure 670096DEST_PATH_IMAGE100
means smooth
Figure 382837DEST_PATH_IMAGE001
loss function,
Figure 369247DEST_PATH_IMAGE101
represents the intersection ratio of the two fragments,
Figure 49496DEST_PATH_IMAGE102
Represents coverage model regression snippet
Figure 412344DEST_PATH_IMAGE095
and corresponding annotation fragments
Figure 979592DEST_PATH_IMAGE096
minimum timing frame.

步骤S5.3,将注意力对齐损失

Figure 136904DEST_PATH_IMAGE084
与边界损失
Figure 540335DEST_PATH_IMAGE090
的加权和作为模型训练的总损 失。 Step S5.3, align the attention to the loss
Figure 136904DEST_PATH_IMAGE084
with boundary loss
Figure 540335DEST_PATH_IMAGE090
The weighted sum is used as the total loss for model training.

具体总损失函数

Figure 441294DEST_PATH_IMAGE103
为: Specific total loss function
Figure 441294DEST_PATH_IMAGE103
for:

Figure 331890DEST_PATH_IMAGE104
Figure 331890DEST_PATH_IMAGE104

其中,

Figure 660103DEST_PATH_IMAGE105
Figure 800098DEST_PATH_IMAGE106
为权值超参数,且在训练阶段使用优化器更新模型参数。 in,
Figure 660103DEST_PATH_IMAGE105
and
Figure 800098DEST_PATH_IMAGE106
are weight hyperparameters, and the optimizer is used to update the model parameters during the training phase.

多粒度级联交互网络的跨模态时序行为定位装置,包括一个或多个处理器,用于实现所述的多粒度级联交互网络的跨模态时序行为定位方法。An apparatus for locating cross-modal timing behavior of a multi-granularity cascading interaction network includes one or more processors for implementing the method for locating the cross-modal timing behavior of a multi-granularity cascading interaction network.

本发明的优势和有益效果在于:The advantages and beneficial effects of the present invention are:

本发明的多粒度级联交互网络的跨模态时序行为定位方法及装置,在视觉-语言跨模态交互环节以由粗到细的方式充分利用多粒度的文本查询信息,并在视频表征编码环节充分建模视频的局部-全局上下文时序依赖特性,用于解决未修剪视频中基于文本查询的时序行为定位问题。对于给定的未修剪视频和文本查询,本发明可提升视觉-语言跨模态对齐精度,进而提升跨模态时序行为定位任务的定位准确度。The cross-modal timing behavior positioning method and device of the multi-granularity cascading interaction network of the present invention make full use of the multi-granularity text query information in the visual-language cross-modal interaction link in a coarse-to-fine manner, and use the multi-granularity text query information in the video representation coding. The link fully models the local-global context temporal dependence characteristics of video, and is used to solve the problem of text query-based temporal behavior localization in untrimmed video. For a given untrimmed video and text query, the present invention can improve the visual-linguistic cross-modal alignment accuracy, thereby improving the positioning accuracy of the cross-modal timing behavior positioning task.

附图说明Description of drawings

图1是视觉-语言跨模态时序行为定位任务示例图。Figure 1 is an example diagram of the visual-linguistic cross-modal temporal behavior localization task.

图2是本发明中多粒度级联交互网络的跨模态时序行为定位的流程框图。FIG. 2 is a flow chart of the cross-modal timing behavior positioning of the multi-granularity cascade interaction network in the present invention.

图3是本发明中多粒度级联交互网络的跨模态时序行为定位方法的流程图。FIG. 3 is a flow chart of a method for locating cross-modal timing behavior of a multi-granularity cascade interaction network in the present invention.

图4是本发明中多粒度级联交互网络的跨模态时序行为定位装置的结构图。FIG. 4 is a structural diagram of a cross-modal timing behavior positioning device of a multi-granularity cascade interaction network in the present invention.

具体实施方式Detailed ways

以下结合附图对本发明的具体实施方式进行详细说明。应当理解的是,此处所描述的具体实施方式仅用于说明和解释本发明,并不用于限制本发明。The specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are only used to illustrate and explain the present invention, but not to limit the present invention.

本发明公开了一种多粒度级联交互网络的跨模态时序行为定位方法及装置,基于多粒度级联交互网络的视觉-语言跨模态时序行为定位,用于解决未修剪视频中基于给定文本查询的时序行为定位问题。该方法提出一种简单有效的多粒度级联跨模态交互网络,用以提升模型的跨模态对齐能力。此外,本发明引入一种局部-全局上下文感知的视频编码器,用于提升视频编码器的上下文时序依赖建模能力。因此训练所得模型在成对的视频-查询测试数据上可显著提升时序定位准确度。The invention discloses a method and device for locating cross-modal timing behavior of multi-granularity cascading interactive networks. The visual-language cross-modal timing behavior positioning based on multi-granularity cascading interactive networks is used to solve problems in untrimmed videos based on given The problem of temporal behavior localization of text query. This method proposes a simple and effective multi-granularity cascaded cross-modal interaction network to improve the cross-modal alignment ability of the model. In addition, the present invention introduces a local-global context-aware video encoder for improving the context timing dependency modeling capability of the video encoder. Therefore, the trained model can significantly improve the time-series positioning accuracy on the paired video-query test data.

多粒度级联交互网络的跨模态时序行为定位方法,基于Pytorch框架进行实验,使用预训练的C3D网络以离线方式提取视频帧特征,视频均匀采样为256帧,方法中所有自注意力子模块和交叉注意力子模块的头数均设置为8。在训练阶段使用Adam优化器训练模型,学习率固定为0.0004,且每个批处理由100对视频-查询组成。此外,实验实施时的性能评测准则可采用“R@n, IoU=m”评测准则,该评测准则表示评测数据集中被正确定位的查询所占百分比,其中置信度最高的n个预测片段与真实标注的交并比(IoU)最大值若大于m则认为该查询被正确定位。Cross-modal timing behavior localization method of multi-granularity cascade interaction network, based on the Pytorch framework for experiments, using the pre-trained C3D network to extract video frame features in an offline manner, the video is uniformly sampled as 256 frames, all self-attention sub-modules in the method and the number of heads of the cross-attention submodule are both set to 8. The model is trained using the Adam optimizer during the training phase, with a fixed learning rate of 0.0004, and each batch consists of 100 video-query pairs. In addition, the performance evaluation criterion during the implementation of the experiment can use the "R@n, IoU=m" evaluation criterion, which represents the percentage of correctly located queries in the evaluation data set, and the n prediction segments with the highest confidence If the maximum value of the marked intersection ratio (IoU) is greater than m, the query is considered to be correctly located.

具体实施例中,给定一个未修剪视频

Figure 488437DEST_PATH_IMAGE107
,将其均匀采样为视频帧序列
Figure 30276DEST_PATH_IMAGE108
,并给定 视频
Figure 794970DEST_PATH_IMAGE107
中某个行为片段的文本描述
Figure 422261DEST_PATH_IMAGE109
,视觉-语言跨模态时序行为定位任务是预测视频
Figure 415756DEST_PATH_IMAGE107
中 文本描述
Figure 77681DEST_PATH_IMAGE109
所对应视频片段的开始时间
Figure 747697DEST_PATH_IMAGE110
和结束时间
Figure 596704DEST_PATH_IMAGE111
。该任务的训练数据集可定义为
Figure 846420DEST_PATH_IMAGE112
,其中
Figure 612119DEST_PATH_IMAGE113
Figure 453036DEST_PATH_IMAGE114
为目标视频片段开始时间和结束时间的真实标注。 In specific embodiments, given an untrimmed video
Figure 488437DEST_PATH_IMAGE107
, which is uniformly sampled as a sequence of video frames
Figure 30276DEST_PATH_IMAGE108
, and given a video
Figure 794970DEST_PATH_IMAGE107
A textual description of a behavior fragment in
Figure 422261DEST_PATH_IMAGE109
, the visual-linguistic cross-modal temporal behavior localization task is to predict video
Figure 415756DEST_PATH_IMAGE107
Chinese text description
Figure 77681DEST_PATH_IMAGE109
The start time of the corresponding video clip
Figure 747697DEST_PATH_IMAGE110
and end time
Figure 596704DEST_PATH_IMAGE111
. The training dataset for this task can be defined as
Figure 846420DEST_PATH_IMAGE112
,in
Figure 612119DEST_PATH_IMAGE113
and
Figure 453036DEST_PATH_IMAGE114
Ground-truth annotations for the start and end times of the target video clips.

如图2、图3所示,多粒度级联交互网络的跨模态时序行为定位方法,包括如下步骤:As shown in Figure 2 and Figure 3, the cross-modal timing behavior positioning method of multi-granularity cascade interaction network includes the following steps:

步骤S1:给定未修剪的视频样本,利用视觉预训练模型进行视频表征的初步提取,并采用局部-全局的方式,对初步提取后的视频表征进行上下文感知的时序依赖编码,得到最终的视频表征,从而提升视频表征的上下文时序依赖建模能力;Step S1: Given an untrimmed video sample, use a visual pre-training model to perform preliminary extraction of video representations, and use a local-global approach to perform context-aware time-series-dependent coding on the preliminarily extracted video representations to obtain the final video. characterization, thereby improving the contextual timing dependence modeling ability of video characterization;

所述步骤S1中,基于视觉预训练模型,以离线方式提取视频帧特征并均匀地采样T 帧,然后经过一个线性变换层,获取一组视频表征

Figure 789340DEST_PATH_IMAGE002
Figure 373905DEST_PATH_IMAGE003
为视频第i帧的表 征,进而对视频表征
Figure 495576DEST_PATH_IMAGE004
采用局部-全局的方式,进行上下文感知的时序依赖编码。 In the step S1, based on the visual pre-training model, the video frame features are extracted offline and T frames are sampled uniformly, and then a set of video representations are obtained through a linear transformation layer.
Figure 789340DEST_PATH_IMAGE002
,
Figure 373905DEST_PATH_IMAGE003
is the representation of the i-th frame of the video, and then the representation of the video
Figure 495576DEST_PATH_IMAGE004
Context-aware time-dependent encoding is performed in a local-global manner.

所述S1中的局部-全局上下文感知编码,首先对视频表征

Figure 772973DEST_PATH_IMAGE004
进行局部上下文感知 编码,得到视频表征
Figure 862152DEST_PATH_IMAGE005
;然后对视频表征
Figure 984829DEST_PATH_IMAGE005
进行全局上下文感知编码,得到视频表征
Figure 725121DEST_PATH_IMAGE006
。 The local-global context-aware coding in S1 firstly characterizes the video
Figure 772973DEST_PATH_IMAGE004
Perform local context-aware encoding to obtain video representations
Figure 862152DEST_PATH_IMAGE005
; then characterize the video
Figure 984829DEST_PATH_IMAGE005
Perform global context-aware coding to obtain video representations
Figure 725121DEST_PATH_IMAGE006
.

所述步骤S1中的局部上下文感知编码和全局上下文感知编码,分别以如下方式进行实施:The local context-aware coding and the global context-aware coding in the step S1 are implemented in the following ways:

步骤S1.1,局部上下文感知编码采用一组配备一维偏移窗口的连续局部变压器 (local transformer)块,将视频表征

Figure 907840DEST_PATH_IMAGE004
作为初始表征,输入第一块一维偏移窗口的连续局 部变压器块,将得到的结果输入第二块一维偏移窗口的连续局部变压器块,以此类推,将最 后一块一维偏移窗口的连续局部变压器块的输出,作为局部上下文感知编码输出的视频表 征
Figure 218736DEST_PATH_IMAGE005
;一维偏移窗口的连续局部变压器块内部操作如下: Step S1.1, local context-aware coding employs a set of continuous local transformer blocks equipped with a one-dimensional offset window to represent the video
Figure 907840DEST_PATH_IMAGE004
As an initial representation, input the continuous local transformer block of the first one-dimensional offset window, input the obtained result into the continuous local transformer block of the second one-dimensional offset window, and so on, put the last block of the one-dimensional offset window into the continuous local transformer block. The output of successive local transformer blocks, as the video representation of the output of the local context-aware encoding
Figure 218736DEST_PATH_IMAGE005
; The internal operation of the continuous local transformer block for a one-dimensional offset window is as follows:

对获取的视频表征

Figure 348366DEST_PATH_IMAGE007
进行层标准化后,通过一维窗口多头自注意力模块,将得 到的结果与视频表征
Figure 693897DEST_PATH_IMAGE007
相加,得到视频表征
Figure 798250DEST_PATH_IMAGE008
;对视频表征
Figure 596442DEST_PATH_IMAGE009
进行层标准化后,通过多 层感知器,将得到的结果与视频表征
Figure 592080DEST_PATH_IMAGE009
相加,得到视频表征
Figure 526538DEST_PATH_IMAGE010
;对视频表征
Figure 565907DEST_PATH_IMAGE011
进行层标准 化后,通过一维偏移窗口多头自注意力模块,将得到的结果与视频表征
Figure 851394DEST_PATH_IMAGE011
相加,得到视频表 征
Figure 385144DEST_PATH_IMAGE012
;对视频表征
Figure 439688DEST_PATH_IMAGE012
进行层标准化后,通过多层感知器,将得到的结果与视频表征
Figure 885844DEST_PATH_IMAGE012
相加, 输出视频表征
Figure 658628DEST_PATH_IMAGE013
作为一维偏移窗口的连续局部变压器块的输出,
Figure 730489DEST_PATH_IMAGE014
表示第
Figure 842801DEST_PATH_IMAGE014
块配备一维偏 移窗口的连续局部变压器块。Characterization of the acquired video
Figure 348366DEST_PATH_IMAGE007
After layer normalization, the obtained results are compared with the video representation through the one-dimensional window multi-head self-attention module.
Figure 693897DEST_PATH_IMAGE007
Add to get the video representation
Figure 798250DEST_PATH_IMAGE008
; representation of video
Figure 596442DEST_PATH_IMAGE009
After layer normalization, the obtained results are compared with the video representation through a multi-layer perceptron
Figure 592080DEST_PATH_IMAGE009
Add to get the video representation
Figure 526538DEST_PATH_IMAGE010
; representation of video
Figure 565907DEST_PATH_IMAGE011
After layer normalization, the obtained results are compared with the video representation through the one-dimensional offset window multi-head self-attention module.
Figure 851394DEST_PATH_IMAGE011
Add to get the video representation
Figure 385144DEST_PATH_IMAGE012
; representation of video
Figure 439688DEST_PATH_IMAGE012
After layer normalization, the obtained results are compared with the video representation through a multi-layer perceptron
Figure 885844DEST_PATH_IMAGE012
Add, output video representation
Figure 658628DEST_PATH_IMAGE013
The output of successive local transformer blocks as a 1D offset window,
Figure 730489DEST_PATH_IMAGE014
means the first
Figure 842801DEST_PATH_IMAGE014
The block is equipped with a continuous local transformer block of one-dimensional offset windows.

具体地,第

Figure 974705DEST_PATH_IMAGE015
块配备一维偏移窗口的连续局部变压器块表示为: Specifically, the first
Figure 974705DEST_PATH_IMAGE015
A continuous local transformer block equipped with a one-dimensional offset window is represented as:

Figure 15211DEST_PATH_IMAGE016
Figure 15211DEST_PATH_IMAGE016

Figure 94026DEST_PATH_IMAGE017
Figure 94026DEST_PATH_IMAGE017

Figure 857583DEST_PATH_IMAGE018
Figure 857583DEST_PATH_IMAGE018

Figure 160388DEST_PATH_IMAGE019
Figure 160388DEST_PATH_IMAGE019

其中,

Figure 392917DEST_PATH_IMAGE020
Figure 72160DEST_PATH_IMAGE021
为层标准化,
Figure 955803DEST_PATH_IMAGE022
为一维窗口多头自注意力模块,
Figure 429509DEST_PATH_IMAGE023
为多 层感知器,
Figure 647870DEST_PATH_IMAGE024
为一维偏移窗口多头自注意力模块。 in,
Figure 392917DEST_PATH_IMAGE020
,
Figure 72160DEST_PATH_IMAGE021
for layer normalization,
Figure 955803DEST_PATH_IMAGE022
is a one-dimensional window multi-head self-attention module,
Figure 429509DEST_PATH_IMAGE023
is a multilayer perceptron,
Figure 647870DEST_PATH_IMAGE024
A multi-head self-attention module for one-dimensional offset windows.

步骤S1.2,全局上下文感知编码包括一组常规变压器块,将视频表征

Figure 865225DEST_PATH_IMAGE005
做出初始 表征输入第一块常规变压器块,将得到的结果输入第二块常规变压器块,以此类推,将最后 一块常规变压器块的输出,作为全局上下文感知编码输出视频表征
Figure 603373DEST_PATH_IMAGE006
;常规变压器块内部操 作如下: Step S1.2, the global context-aware coding consists of a set of conventional transformer blocks that characterize the video
Figure 865225DEST_PATH_IMAGE005
Make the initial representation and input it into the first regular transformer block, input the obtained result into the second regular transformer block, and so on, use the output of the last regular transformer block as the global context-aware coding output video representation
Figure 603373DEST_PATH_IMAGE006
; The internal operation of a conventional transformer block is as follows:

获取的视频表征

Figure 247981DEST_PATH_IMAGE025
,通过常规多头自注意力模块后,将得到的结果与视频表征
Figure 907633DEST_PATH_IMAGE025
相加后,再进行层标准化,得到视频表征
Figure 413832DEST_PATH_IMAGE026
;视频表征
Figure 272066DEST_PATH_IMAGE026
通过多层感知器后,将得到的 结果与视频表征
Figure 618734DEST_PATH_IMAGE026
相加后,再进行层标准化,得到的视频表征
Figure 765682DEST_PATH_IMAGE027
作为常规变压器块的输 出,
Figure 370844DEST_PATH_IMAGE028
表示第
Figure 286848DEST_PATH_IMAGE028
块常规变压器块。 Acquired video representation
Figure 247981DEST_PATH_IMAGE025
, after passing the conventional multi-head self-attention module, the obtained results are compared with the video representation
Figure 907633DEST_PATH_IMAGE025
After addition, layer normalization is performed to obtain the video representation
Figure 413832DEST_PATH_IMAGE026
; video representation
Figure 272066DEST_PATH_IMAGE026
After passing through the multi-layer perceptron, the obtained results are compared with the video representation
Figure 618734DEST_PATH_IMAGE026
After addition, layer normalization is performed, and the resulting video representation
Figure 765682DEST_PATH_IMAGE027
As the output of a regular transformer block,
Figure 370844DEST_PATH_IMAGE028
means the first
Figure 286848DEST_PATH_IMAGE028
block regular transformer block.

具体地,第

Figure 273258DEST_PATH_IMAGE028
个变压器块表示为: Specifically, the first
Figure 273258DEST_PATH_IMAGE028
A transformer block is represented as:

Figure 438660DEST_PATH_IMAGE029
Figure 438660DEST_PATH_IMAGE029

Figure 552241DEST_PATH_IMAGE030
Figure 552241DEST_PATH_IMAGE030

其中,

Figure 385068DEST_PATH_IMAGE031
Figure 542379DEST_PATH_IMAGE032
为常规多头自注意力模块,
Figure 195078DEST_PATH_IMAGE021
为层标准化,
Figure 345305DEST_PATH_IMAGE023
为多层感知 器。 in,
Figure 385068DEST_PATH_IMAGE031
,
Figure 542379DEST_PATH_IMAGE032
is the conventional multi-head self-attention module,
Figure 195078DEST_PATH_IMAGE021
for layer normalization,
Figure 345305DEST_PATH_IMAGE023
is a multilayer perceptron.

步骤S2:对于未修剪视频相应的文本查询,采用预训练的词嵌入模型,对查询文本中各个单词进行词嵌入初始化,然后采用多层双向长短时记忆网络,进行上下文编码,得到文本查询的单词级表征和全局级表征;Step S2: For the text query corresponding to the untrimmed video, a pre-trained word embedding model is used to initialize the word embedding for each word in the query text, and then a multi-layer bidirectional long-term memory network is used to perform context encoding to obtain the words of the text query. level characterization and global level characterization;

所述步骤S2中,查询文本中每个单词对应的可学习词嵌入向量,使用预训练的词 嵌入模型进行初始化,得到文本查询的嵌入向量序列

Figure 32638DEST_PATH_IMAGE033
Figure 564114DEST_PATH_IMAGE034
为视频第i个单词的 表征,通过多层的双向长短时记忆网络(BLSTM),对文本查询的嵌入向量序列
Figure 235267DEST_PATH_IMAGE035
进行上下文 编码,得到查询的单词级文本查询表征
Figure 690650DEST_PATH_IMAGE036
,通过
Figure 232490DEST_PATH_IMAGE037
的前向隐状态向量和
Figure 731604DEST_PATH_IMAGE038
的后向隐状态向量的拼接,得到全局级文本查询表征
Figure 562157DEST_PATH_IMAGE039
,最终得到文本查询表征
Figure 804919DEST_PATH_IMAGE040
。 In the step S2, the learnable word embedding vector corresponding to each word in the query text is initialized by using the pre-trained word embedding model to obtain the embedding vector sequence of the text query.
Figure 32638DEST_PATH_IMAGE033
,
Figure 564114DEST_PATH_IMAGE034
For the representation of the i-th word in the video, through a multi-layer bidirectional long short-term memory network (BLSTM), the embedding vector sequence of the text query
Figure 235267DEST_PATH_IMAGE035
Perform context encoding to obtain a word-level textual query representation of the query
Figure 690650DEST_PATH_IMAGE036
,pass
Figure 232490DEST_PATH_IMAGE037
The forward hidden state vector sum of
Figure 731604DEST_PATH_IMAGE038
The concatenation of the backward hidden state vectors of , to obtain the global-level text query representation
Figure 562157DEST_PATH_IMAGE039
, and finally get the text query representation
Figure 804919DEST_PATH_IMAGE040
.

具体实施方式如下:The specific implementation is as follows:

Figure 473971DEST_PATH_IMAGE115
Figure 473971DEST_PATH_IMAGE115

Figure 143987DEST_PATH_IMAGE042
Figure 143987DEST_PATH_IMAGE042

其中

Figure 258573DEST_PATH_IMAGE043
Figure 305027DEST_PATH_IMAGE037
的前向隐状态向量和
Figure 572191DEST_PATH_IMAGE038
的后向隐状态向量的拼接。 in
Figure 258573DEST_PATH_IMAGE043
for
Figure 305027DEST_PATH_IMAGE037
The forward hidden state vector sum of
Figure 572191DEST_PATH_IMAGE038
The concatenation of the backward hidden state vectors of .

步骤S3:对于已提取的视频表征和文本查询表征,采用多粒度级联交互网络进行视频模态和文本查询模态间的交互,得到查询引导的增强化视频表征,从而提升跨模态对齐精度;Step S3: For the extracted video representations and text query representations, a multi-granularity cascaded interaction network is used to perform the interaction between the video modality and the text query modality to obtain query-guided enhanced video representations, thereby improving cross-modal alignment accuracy ;

所述步骤S3中的多粒度级联交互网络,首先将视频表征

Figure 413108DEST_PATH_IMAGE044
和文本查询表征
Figure 14991DEST_PATH_IMAGE040
,通过视频引导的查询解码,得到视频引导的查询表征
Figure 599556DEST_PATH_IMAGE045
Figure 219762DEST_PATH_IMAGE046
表示全局 级视频引导的查询表征,
Figure 434842DEST_PATH_IMAGE047
表示单词级视频引导的查询表征,然后将视频引导的查询表征
Figure 524021DEST_PATH_IMAGE048
与视频模态表征
Figure 646698DEST_PATH_IMAGE006
,通过级联跨模态融合,得到最终的增强化视频表征。视频引导的查询解 码,用以缩小视频表征
Figure 872143DEST_PATH_IMAGE006
和文本查询表征
Figure 71174DEST_PATH_IMAGE040
模态之间的语义鸿沟。 The multi-granularity cascade interaction network in the step S3 firstly characterizes the video
Figure 413108DEST_PATH_IMAGE044
and text query representation
Figure 14991DEST_PATH_IMAGE040
, through video-guided query decoding to obtain video-guided query representations
Figure 599556DEST_PATH_IMAGE045
,
Figure 219762DEST_PATH_IMAGE046
represents a global-level video-guided query representation,
Figure 434842DEST_PATH_IMAGE047
represent word-level video-guided query representations, and then characterize the video-guided query representations
Figure 524021DEST_PATH_IMAGE048
and video modality characterization
Figure 646698DEST_PATH_IMAGE006
, through cascaded cross-modal fusion, the final enhanced video representation is obtained. Video-guided query decoding to narrow down video representations
Figure 872143DEST_PATH_IMAGE006
and text query representation
Figure 71174DEST_PATH_IMAGE040
Semantic gap between modalities.

所述步骤S3具体包括如下步骤:The step S3 specifically includes the following steps:

步骤S3.1,视频引导的查询解码采用一组跨模态解码块,将文本查询表征

Figure 647649DEST_PATH_IMAGE049
作为 初始表征输入第一块跨模态解码块,将得到的结果输入第二块跨模态解码块,以此类推,将 最后一块跨模态解码块的输出,作为视频引导的查询表征
Figure 574017DEST_PATH_IMAGE050
;所述步骤S3.1中的跨模态解 码块的内部操作如下: Step S3.1, video-guided query decoding uses a set of cross-modal decoding blocks to characterize the text query.
Figure 647649DEST_PATH_IMAGE049
Input the first cross-modal decoding block as the initial representation, input the obtained result to the second cross-modal decoding block, and so on, use the output of the last cross-modal decoding block as the video-guided query representation
Figure 574017DEST_PATH_IMAGE050
; The internal operation of the cross-modal decoding block in the step S3.1 is as follows:

将获取的文本查询表征

Figure 919547DEST_PATH_IMAGE051
,通过多头自注意力模块,得到文本查询表征
Figure 522436DEST_PATH_IMAGE052
;将文 本查询表征
Figure 320628DEST_PATH_IMAGE052
作为查询,将视频表征
Figure 50686DEST_PATH_IMAGE006
作为键和值,通过多头交叉注意力模块,得到文本查 询表征
Figure 453986DEST_PATH_IMAGE053
;文本查询表征
Figure 978508DEST_PATH_IMAGE053
通过常规前向网络,得到的文本查询表征
Figure 14728DEST_PATH_IMAGE054
作为跨模态解码块 的输出;
Figure 282899DEST_PATH_IMAGE055
表示第
Figure 603022DEST_PATH_IMAGE055
块跨模态解码块。 The text query representation that will be obtained
Figure 919547DEST_PATH_IMAGE051
, through the multi-head self-attention module, the text query representation is obtained
Figure 522436DEST_PATH_IMAGE052
; characterize the text query
Figure 320628DEST_PATH_IMAGE052
As a query, the video representation
Figure 50686DEST_PATH_IMAGE006
As keys and values, the text query representation is obtained through a multi-head cross-attention module
Figure 453986DEST_PATH_IMAGE053
; text query representation
Figure 978508DEST_PATH_IMAGE053
Text query representations obtained through conventional feed-forward networks
Figure 14728DEST_PATH_IMAGE054
as the output of the cross-modal decoding block;
Figure 282899DEST_PATH_IMAGE055
means the first
Figure 603022DEST_PATH_IMAGE055
Block cross-modal decoding blocks.

具体地,第

Figure 564024DEST_PATH_IMAGE055
个跨模态解码块表示为: Specifically, the first
Figure 564024DEST_PATH_IMAGE055
A cross-modal decoding block is represented as:

Figure 586076DEST_PATH_IMAGE056
Figure 586076DEST_PATH_IMAGE056

Figure 392358DEST_PATH_IMAGE057
Figure 392358DEST_PATH_IMAGE057

Figure 770250DEST_PATH_IMAGE058
Figure 770250DEST_PATH_IMAGE058

其中,

Figure 433312DEST_PATH_IMAGE059
Figure 896654DEST_PATH_IMAGE060
Figure 257360DEST_PATH_IMAGE061
分别为多头自注意力模块和多 头交叉注意力模块,
Figure 286496DEST_PATH_IMAGE062
为常规前向网络(feed forward network)。 in,
Figure 433312DEST_PATH_IMAGE059
,
Figure 896654DEST_PATH_IMAGE060
and
Figure 257360DEST_PATH_IMAGE061
are the multi-head self-attention module and the multi-head cross-attention module, respectively.
Figure 286496DEST_PATH_IMAGE062
It is a regular feed forward network.

步骤S3.2,级联跨模态融合,首先将全局级视频引导的查询表征

Figure 589301DEST_PATH_IMAGE046
与视频模态表 征
Figure 336677DEST_PATH_IMAGE006
,通过逐元素乘,在粗粒度级进行跨模态融合,得到粗粒度级融合后的视频表征
Figure 999609DEST_PATH_IMAGE063
,然 后将单词级视频引导的查询表征
Figure 883251DEST_PATH_IMAGE047
与粗粒度级融合后的视频表征
Figure 356958DEST_PATH_IMAGE063
,通过另一组跨模态解 码块,在细粒度级进行跨模态融合,将粗粒度级融合后的视频表征
Figure 326051DEST_PATH_IMAGE063
作为初始表征输入第 一块跨模态解码块,将得到的结果输入第二块跨模态解码块,以此类推,将最后一块跨模态 解码块的输出,作为增强化视频表征
Figure 481088DEST_PATH_IMAGE064
;所述步骤S3.2中的跨模态解码块的内部操作如 下: Step S3.2, cascaded cross-modal fusion, first characterizes the global-level video-guided query
Figure 589301DEST_PATH_IMAGE046
and video modality characterization
Figure 336677DEST_PATH_IMAGE006
, through element-by-element multiplication, cross-modal fusion is performed at the coarse-grained level, and the video representation after coarse-grained level fusion is obtained
Figure 999609DEST_PATH_IMAGE063
, and then characterize the word-level video-guided query
Figure 883251DEST_PATH_IMAGE047
Video representation after fusion with coarse-grained level
Figure 356958DEST_PATH_IMAGE063
, through another set of cross-modal decoding blocks, cross-modal fusion is performed at the fine-grained level, and the fused video representation at the coarse-grained level is
Figure 326051DEST_PATH_IMAGE063
Input the first cross-modal decoding block as the initial representation, input the obtained result to the second cross-modal decoding block, and so on, take the output of the last cross-modal decoding block as the enhanced video representation
Figure 481088DEST_PATH_IMAGE064
; The internal operation of the cross-modal decoding block in the step S3.2 is as follows:

将获取的视频表征

Figure 235549DEST_PATH_IMAGE065
,通过多头自注意力模块,得到视频表征
Figure 880157DEST_PATH_IMAGE066
;将视频表征
Figure 336546DEST_PATH_IMAGE066
作为查询,将单词级视频引导的查询表征
Figure 92012DEST_PATH_IMAGE047
作为键和值,通过多头交叉注意力模块,得到 视频表征
Figure 199515DEST_PATH_IMAGE067
;视频表征
Figure 15024DEST_PATH_IMAGE067
通过常规前向网络,得到的视频表征
Figure 958709DEST_PATH_IMAGE068
作为跨模态解码块的输 出;
Figure 517866DEST_PATH_IMAGE069
表示第
Figure 981340DEST_PATH_IMAGE069
块跨模态解码块。粗粒度级进行跨模态融合用于抑制背景视频帧和强调前景 视频帧,可表示为
Figure 905434DEST_PATH_IMAGE070
Figure 336415DEST_PATH_IMAGE071
表示逐元素乘。 Video representations to be acquired
Figure 235549DEST_PATH_IMAGE065
, through the multi-head self-attention module, the video representation is obtained
Figure 880157DEST_PATH_IMAGE066
; characterize the video
Figure 336546DEST_PATH_IMAGE066
As a query, word-level video-guided query representation
Figure 92012DEST_PATH_IMAGE047
As keys and values, the video representation is obtained through a multi-head cross-attention module
Figure 199515DEST_PATH_IMAGE067
; video representation
Figure 15024DEST_PATH_IMAGE067
Through the conventional feed-forward network, the resulting video representation
Figure 958709DEST_PATH_IMAGE068
as the output of the cross-modal decoding block;
Figure 517866DEST_PATH_IMAGE069
means the first
Figure 981340DEST_PATH_IMAGE069
Block cross-modal decoding blocks. Coarse-grained cross-modal fusion is used to suppress background video frames and emphasize foreground video frames, which can be expressed as
Figure 905434DEST_PATH_IMAGE070
,
Figure 336415DEST_PATH_IMAGE071
Represents element-wise multiplication.

Figure 433684DEST_PATH_IMAGE072
个跨模态解码块表示为: the first
Figure 433684DEST_PATH_IMAGE072
A cross-modal decoding block is represented as:

Figure 266511DEST_PATH_IMAGE073
Figure 266511DEST_PATH_IMAGE073

Figure 407511DEST_PATH_IMAGE074
Figure 407511DEST_PATH_IMAGE074

Figure 325788DEST_PATH_IMAGE116
Figure 325788DEST_PATH_IMAGE116

其中,

Figure 492327DEST_PATH_IMAGE076
Figure 914082DEST_PATH_IMAGE060
Figure 993027DEST_PATH_IMAGE061
分别为多头自注意力模块和 多头交叉注意力模块,
Figure 133021DEST_PATH_IMAGE062
为常规前向网络(feed forward network)。 in,
Figure 492327DEST_PATH_IMAGE076
,
Figure 914082DEST_PATH_IMAGE060
and
Figure 993027DEST_PATH_IMAGE061
are the multi-head self-attention module and the multi-head cross-attention module, respectively.
Figure 133021DEST_PATH_IMAGE062
It is a regular feed forward network.

步骤S4:对于经过多粒度级联交互后得到的视频表征,采用基于注意力的时序位置回归模块,预测文本查询相应的目标视频片段时序位置;Step S4: For the video representation obtained after the multi-granularity cascade interaction, an attention-based time series position regression module is used to predict the time series position of the corresponding target video segment for the text query;

所述步骤S4中的基于注意力的时序位置回归模块,将经过多粒度级联交互的视频 序列表征

Figure 837672DEST_PATH_IMAGE064
,通过多层感知器和SoftMax激活层,得到视频的时序注意力分数
Figure 379512DEST_PATH_IMAGE077
;再将增强 化视频表征
Figure 81889DEST_PATH_IMAGE064
与时序注意力分数
Figure 958447DEST_PATH_IMAGE077
,通过注意力池化层,得到目标片段的表征
Figure 201209DEST_PATH_IMAGE078
;最后,将 目标片段的表征
Figure 597556DEST_PATH_IMAGE078
,通过多层感知器,对目标片段归一化后的时序中心坐标
Figure 267571DEST_PATH_IMAGE079
和片段时长
Figure 398470DEST_PATH_IMAGE080
进行直接回归。 The attention-based time-series position regression module in the step S4 characterizes the video sequence that has undergone multi-granularity cascade interaction
Figure 837672DEST_PATH_IMAGE064
, through the multi-layer perceptron and SoftMax activation layer, the temporal attention score of the video is obtained
Figure 379512DEST_PATH_IMAGE077
; then the enhanced video representation
Figure 81889DEST_PATH_IMAGE064
with temporal attention scores
Figure 958447DEST_PATH_IMAGE077
, through the attention pooling layer, the representation of the target segment is obtained
Figure 201209DEST_PATH_IMAGE078
; finally, the representation of the target fragment
Figure 597556DEST_PATH_IMAGE078
, through the multi-layer perceptron, the normalized time series center coordinates of the target segment
Figure 267571DEST_PATH_IMAGE079
and segment duration
Figure 398470DEST_PATH_IMAGE080
Do a direct regression.

具体基于注意力的时序位置回归表示为:The specific attention-based time-series location regression is expressed as:

Figure 179344DEST_PATH_IMAGE081
Figure 179344DEST_PATH_IMAGE081

Figure 695776DEST_PATH_IMAGE117
Figure 695776DEST_PATH_IMAGE117

Figure 536693DEST_PATH_IMAGE083
Figure 536693DEST_PATH_IMAGE083
.

其中,

Figure 341838DEST_PATH_IMAGE064
为增强化视频表征,即经过多粒度级联交互的视频序列表征,注意力池 化层用于汇聚视频序列表征, in,
Figure 341838DEST_PATH_IMAGE064
In order to enhance the video representation, that is, the video sequence representation through multi-granularity cascade interaction, the attention pooling layer is used to aggregate the video sequence representation,

步骤S5:对于步骤S1~S4所组成的基于多粒度级联交互网络的跨模态时序行为定 位模型,利用训练样本集进行该模型的训练,训练时所采用的总损失函数包括注意力对齐 损失和边界损失,其中,边界损失包括平滑

Figure 175670DEST_PATH_IMAGE001
损失和时序广义交并比损失,从而更好地适应 于时序定位任务的评测准则,训练样本集由若干{视频,查询,目标视频片段时序位置标注} 三元组样本构成。 Step S5: For the cross-modal time series behavior localization model based on the multi-granularity cascade interaction network composed of steps S1~S4, use the training sample set to train the model, and the total loss function used during training includes the attention alignment loss. and boundary loss, where boundary loss includes smoothing
Figure 175670DEST_PATH_IMAGE001
The loss and the time-series generalized intersection loss are better adapted to the evaluation criteria of the time-series location task. The training sample set consists of several {video, query, target video segment time-series location annotation} triple samples.

所述步骤S5中模型的训练,包括如下步骤:The training of the model in the step S5 includes the following steps:

步骤S5.1,计算注意力对齐损失

Figure 546609DEST_PATH_IMAGE084
,将第i帧对应的时序注意力分数的对数与指 示值
Figure 558427DEST_PATH_IMAGE085
的乘积,根据采样帧数进行累加,通过该累加的结果比上
Figure 382027DEST_PATH_IMAGE085
根据采样帧数累加的结 果求损失
Figure 786594DEST_PATH_IMAGE084
Figure 12039DEST_PATH_IMAGE086
表明视频的第i帧位于时序标注片段内,反之则
Figure 194759DEST_PATH_IMAGE087
;注意力对齐 损失
Figure 771234DEST_PATH_IMAGE084
用于鼓励标注时序片段内的视频帧具有更高的注意力分数,具体计算过程可表 示为: Step S5.1, calculate the attention alignment loss
Figure 546609DEST_PATH_IMAGE084
, the logarithm of the temporal attention score corresponding to the i -th frame and the indicated value
Figure 558427DEST_PATH_IMAGE085
The product is accumulated according to the number of sampling frames, and the accumulated result is higher than
Figure 382027DEST_PATH_IMAGE085
Calculate the loss according to the accumulated result of the number of sampling frames
Figure 786594DEST_PATH_IMAGE084
,
Figure 12039DEST_PATH_IMAGE086
Indicates that the i-th frame of the video is located in the timing annotation segment, and vice versa
Figure 194759DEST_PATH_IMAGE087
; attention alignment loss
Figure 771234DEST_PATH_IMAGE084
It is used to encourage video frames within annotated time series segments to have higher attention scores. The specific calculation process can be expressed as:

Figure 946869DEST_PATH_IMAGE118
Figure 946869DEST_PATH_IMAGE118

其中,T表示采样的帧数,

Figure 26821DEST_PATH_IMAGE089
表示第i帧的时序注意力分数,
Figure 849283DEST_PATH_IMAGE086
表明视频的第i 帧位于时序标注片段内,反之则
Figure 647475DEST_PATH_IMAGE087
。 Among them, T represents the number of frames sampled,
Figure 26821DEST_PATH_IMAGE089
represents the temporal attention score of the i-th frame,
Figure 849283DEST_PATH_IMAGE086
Indicates that the i-th frame of the video is within the timing annotation segment, and vice versa
Figure 647475DEST_PATH_IMAGE087
.

步骤S5.2,计算边界损失

Figure 377533DEST_PATH_IMAGE090
,通过结合平滑
Figure 62724DEST_PATH_IMAGE001
损失
Figure 852825DEST_PATH_IMAGE091
和时序广义交并比损失
Figure 403892DEST_PATH_IMAGE092
进行边界损失度量;对预测片段的归一化时序中心坐标
Figure 672063DEST_PATH_IMAGE079
与时序标注片段的归一化时 序中心坐标
Figure 975874DEST_PATH_IMAGE093
的差值,求第一平滑
Figure 671297DEST_PATH_IMAGE001
损失,对预测片段的片段时长
Figure 444081DEST_PATH_IMAGE080
与时序标注片段的片段 时长
Figure 515943DEST_PATH_IMAGE094
的差值,求第二平滑
Figure 441304DEST_PATH_IMAGE001
损失,将第一、第二平滑
Figure 307629DEST_PATH_IMAGE001
损失的和作为损失
Figure 567709DEST_PATH_IMAGE091
;计算回归片 段
Figure 380944DEST_PATH_IMAGE095
和相应标注片段
Figure 410080DEST_PATH_IMAGE096
的广义交并比,将该广义交并比的负值加上1,作为时序广义交并比 损失
Figure 696574DEST_PATH_IMAGE092
;将损失
Figure 443950DEST_PATH_IMAGE091
与时序广义交并比损失
Figure 123193DEST_PATH_IMAGE092
的和作为边界损失
Figure 6836DEST_PATH_IMAGE090
;边界损失
Figure 231275DEST_PATH_IMAGE090
的具体计算过程可表示如下: Step S5.2, calculate the boundary loss
Figure 377533DEST_PATH_IMAGE090
, by combining smooth
Figure 62724DEST_PATH_IMAGE001
loss
Figure 852825DEST_PATH_IMAGE091
and time series generalized intersection loss
Figure 403892DEST_PATH_IMAGE092
Perform boundary loss metrics; normalized temporal center coordinates for predicted segments
Figure 672063DEST_PATH_IMAGE079
Normalized temporal center coordinates with temporal annotation fragments
Figure 975874DEST_PATH_IMAGE093
The difference of , find the first smoothing
Figure 671297DEST_PATH_IMAGE001
loss, segment duration for predicted segments
Figure 444081DEST_PATH_IMAGE080
Clip duration with timing annotation clips
Figure 515943DEST_PATH_IMAGE094
The difference of , find the second smoothing
Figure 441304DEST_PATH_IMAGE001
loss, smoothing the first and second
Figure 307629DEST_PATH_IMAGE001
loss and as loss
Figure 567709DEST_PATH_IMAGE091
; Calculate the regression segment
Figure 380944DEST_PATH_IMAGE095
and corresponding annotation fragments
Figure 410080DEST_PATH_IMAGE096
The generalized intersection and union ratio of , add 1 to the negative value of the generalized intersection and union ratio, as the loss of the time series generalized intersection and union ratio
Figure 696574DEST_PATH_IMAGE092
; will lose
Figure 443950DEST_PATH_IMAGE091
Generalized intersection loss with time series
Figure 123193DEST_PATH_IMAGE092
The sum as the boundary loss
Figure 6836DEST_PATH_IMAGE090
; frontier loss
Figure 231275DEST_PATH_IMAGE090
The specific calculation process can be expressed as follows:

Figure 200368DEST_PATH_IMAGE119
Figure 200368DEST_PATH_IMAGE119

Figure 152143DEST_PATH_IMAGE120
Figure 152143DEST_PATH_IMAGE120

Figure 359134DEST_PATH_IMAGE099
Figure 359134DEST_PATH_IMAGE099

其中,

Figure 738162DEST_PATH_IMAGE100
表示平滑
Figure 443819DEST_PATH_IMAGE001
损失函数,
Figure 199286DEST_PATH_IMAGE101
表示两片段的交并比,
Figure 57520DEST_PATH_IMAGE102
表示覆盖模型回归片 段
Figure 138609DEST_PATH_IMAGE095
和相应标注片段
Figure 833026DEST_PATH_IMAGE096
的最小时序框。 in,
Figure 738162DEST_PATH_IMAGE100
means smooth
Figure 443819DEST_PATH_IMAGE001
loss function,
Figure 199286DEST_PATH_IMAGE101
represents the intersection ratio of the two fragments,
Figure 57520DEST_PATH_IMAGE102
Represents coverage model regression snippet
Figure 138609DEST_PATH_IMAGE095
and corresponding annotation fragments
Figure 833026DEST_PATH_IMAGE096
minimum timing frame.

步骤S5.3,将注意力对齐损失

Figure 126604DEST_PATH_IMAGE084
与边界损失
Figure 839345DEST_PATH_IMAGE090
的加权和作为模型训练的总损 失。 Step S5.3, align the attention to the loss
Figure 126604DEST_PATH_IMAGE084
with boundary loss
Figure 839345DEST_PATH_IMAGE090
The weighted sum is used as the total loss for model training.

具体总损失函数

Figure 29018DEST_PATH_IMAGE103
为: Specific total loss function
Figure 29018DEST_PATH_IMAGE103
for:

Figure 460000DEST_PATH_IMAGE104
Figure 460000DEST_PATH_IMAGE104

其中,

Figure 806536DEST_PATH_IMAGE105
Figure 373784DEST_PATH_IMAGE106
为权值超参数,且在训练阶段使用优化器更新模型参数。 in,
Figure 806536DEST_PATH_IMAGE105
and
Figure 373784DEST_PATH_IMAGE106
are weight hyperparameters, and the optimizer is used to update the model parameters during the training phase.

本发明方法与其它现有代表性方法在TACoS测试集上的准确率对比,如表1所示,采用“R@n, IoU=m”的评测准则,这里n=1,m={0.1, 0.3, 0.5}。The accuracy of the method of the present invention and other existing representative methods on the TACoS test set is compared, as shown in Table 1, using the evaluation criterion of "R@n, IoU=m", where n=1, m={0.1, 0.3, 0.5}.

表1Table 1

Figure 796675DEST_PATH_IMAGE121
Figure 796675DEST_PATH_IMAGE121

与前述跨模态时序行为定位方法的实施例相对应,本发明还提供了多粒度级联交互网络的跨模态时序行为定位装置的实施例。Corresponding to the foregoing embodiments of the cross-modal timing behavior positioning method, the present invention also provides an embodiment of a cross-modal timing behavior positioning apparatus for a multi-granularity cascade interaction network.

参见图4,本发明实施例提供的多粒度级联交互网络的跨模态时序行为定位装置,包括一个或多个处理器,用于实现上述实施例中的多粒度级联交互网络的跨模态时序行为定位方法。Referring to FIG. 4 , an apparatus for locating cross-modal timing behavior of a multi-granularity cascading interaction network provided by an embodiment of the present invention includes one or more processors for implementing the cross-modality of the multi-granularity cascading interaction network in the foregoing embodiment. Temporal timing behavior localization method.

本发明多粒度级联交互网络的跨模态时序行为定位装置的实施例可以应用在任意具备数据处理能力的设备上,该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图4所示,为本发明多粒度级联交互网络的跨模态时序行为定位装置所在任意具备数据处理能力的设备的一种硬件结构图,除了图4所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能,还可以包括其他硬件,对此不再赘述。The embodiment of the device for locating cross-modal timing behavior in a multi-granularity cascade interaction network of the present invention can be applied to any device with data processing capability, which can be a device or device such as a computer. The apparatus embodiment may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, a device in a logical sense is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory through the processor of any device with data processing capability where it is located. From the perspective of hardware, as shown in FIG. 4 , it is a hardware structure diagram of any device with data processing capability where the cross-modal timing behavior positioning device of the multi-granularity cascading interaction network of the present invention is located, except that shown in FIG. 4 In addition to the processor, memory, network interface, and non-volatile memory, any device with data processing capability where the apparatus in the embodiment is located may also include other hardware, usually according to the actual function of any device with data processing capability, This will not be repeated here.

上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。For details of the implementation process of the functions and functions of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method, which will not be repeated here.

对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。As for the apparatus embodiments, since they basically correspond to the method embodiments, reference may be made to the partial descriptions of the method embodiments for related parts. The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. Those of ordinary skill in the art can understand and implement it without creative effort.

本发明实施例还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述实施例中的多粒度级联交互网络的跨模态时序行为定位方法。Embodiments of the present invention further provide a computer-readable storage medium on which a program is stored, and when the program is executed by a processor, implements the method for locating cross-modal timing behavior of a multi-granularity cascaded interaction network in the foregoing embodiment.

所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元,例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备的外部存储设备,例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、SD卡、闪存卡(Flash Card)等。进一步的,所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据,还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be an internal storage unit of any device with data processing capability described in any of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium may also be an external storage device of any device with data processing capability, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), an SD card, a flash memory card equipped on the device (Flash Card) etc. Further, the computer-readable storage medium may also include both an internal storage unit of any device with data processing capability and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the device with data processing capability, and can also be used to temporarily store data that has been output or will be output.

以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明实施例技术方案的范围。The above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be described in the foregoing embodiments. The technical solutions of the present invention are modified, or some or all of the technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1.一种多粒度级联交互网络的跨模态时序行为定位方法,其特征在于包括以下步骤:1. a cross-modal time sequence behavior positioning method of a multi-granularity cascade interaction network is characterized in that comprising the following steps: 步骤S1:给定未修剪的视频样本,利用视觉预训练模型进行视频表征的初步提取,并采用局部-全局的方式,对初步提取后的视频表征进行上下文感知的时序依赖编码,得到最终的视频表征;Step S1: Given an untrimmed video sample, use a visual pre-training model to perform preliminary extraction of video representations, and use a local-global approach to perform context-aware time-series-dependent coding on the preliminarily extracted video representations to obtain the final video. representation; 步骤S2:对于未修剪视频相应的文本查询,采用预训练的词嵌入模型,对查询文本中各个单词进行词嵌入初始化,然后采用多层双向长短时记忆网络,进行上下文编码,得到文本查询的单词级表征和全局级表征;Step S2: For the text query corresponding to the untrimmed video, a pre-trained word embedding model is used to initialize the word embedding for each word in the query text, and then a multi-layer bidirectional long-term memory network is used to perform context encoding to obtain the words of the text query. level characterization and global level characterization; 步骤S3:对于已提取的视频表征和文本查询表征,采用多粒度级联交互网络进行视频模态和文本查询模态间的交互,得到查询引导的增强化视频表征;Step S3: for the extracted video representation and text query representation, a multi-granularity cascade interaction network is used to perform interaction between the video modality and the text query modality, so as to obtain a query-guided enhanced video representation; 步骤S4:对于经过多粒度级联交互后得到的增强化视频表征,采用基于注意力的时序位置回归模块,预测文本查询相应的目标视频片段时序位置;Step S4: For the enhanced video representation obtained after the multi-granularity cascade interaction, an attention-based time-series position regression module is used to predict the time-series position of the corresponding target video segment for the text query; 步骤S5:对于步骤S1~S4所组成的基于多粒度级联交互网络的跨模态时序行为定位模 型,利用训练样本集进行该模型的训练,训练时所采用的总损失函数包括注意力对齐损失 和边界损失,其中,边界损失包括平滑
Figure 842729DEST_PATH_IMAGE001
损失和时序广义交并比损失。
Step S5: For the cross-modal time series behavior localization model based on the multi-granularity cascade interaction network composed of steps S1~S4, use the training sample set to train the model, and the total loss function used during training includes the attention alignment loss. and boundary loss, where boundary loss includes smoothing
Figure 842729DEST_PATH_IMAGE001
Loss and time series generalized intersection loss.
2.根据权利要求1所述的多粒度级联交互网络的跨模态时序行为定位方法,其特征在 于所述步骤S1中,基于视觉预训练模型,以离线方式提取视频帧特征并均匀地采样T帧,然 后经过线性变换层,获取一组视频表征
Figure 478110DEST_PATH_IMAGE002
Figure 916176DEST_PATH_IMAGE003
为视频第i帧的表征,进而对 视频表征
Figure 244389DEST_PATH_IMAGE004
采用局部-全局的方式,进行上下文感知的时序依赖编码。
2. The cross-modal time sequence behavior positioning method of multi-granularity cascading interaction network according to claim 1, it is characterized in that in described step S1, based on visual pre-training model, extract video frame feature in offline mode and sample uniformly T frames, and then go through a linear transformation layer to obtain a set of video representations
Figure 478110DEST_PATH_IMAGE002
,
Figure 916176DEST_PATH_IMAGE003
is the representation of the i-th frame of the video, and then the representation of the video
Figure 244389DEST_PATH_IMAGE004
Context-aware time-dependent encoding is performed in a local-global manner.
3.根据权利要求2所述的多粒度级联交互网络的跨模态时序行为定位方法,其特征在 于所述步骤S1中的局部-全局上下文感知编码方式,首先对视频表征
Figure 384383DEST_PATH_IMAGE004
进行局部上下文感 知编码,得到视频表征
Figure 89034DEST_PATH_IMAGE005
;然后对视频表征
Figure 614562DEST_PATH_IMAGE005
进行全局上下文感知编码,得到视频表征
Figure 113677DEST_PATH_IMAGE006
3. The cross-modal timing behavior positioning method of multi-granularity cascade interaction network according to claim 2, it is characterized in that the local-global context-aware coding method in the described step S1, first to the video representation
Figure 384383DEST_PATH_IMAGE004
Perform local context-aware encoding to obtain video representations
Figure 89034DEST_PATH_IMAGE005
; then characterize the video
Figure 614562DEST_PATH_IMAGE005
Perform global context-aware coding to obtain video representations
Figure 113677DEST_PATH_IMAGE006
.
4.根据权利要求3所述的多粒度级联交互网络的跨模态时序行为定位方法,其特征在于所述步骤S1中的局部上下文感知编码和全局上下文感知编码,分别以如下方式进行实施:4. The cross-modal timing behavior positioning method of multi-granularity cascade interaction network according to claim 3, is characterized in that the local context-aware coding and the global context-aware coding in the described step S1 are respectively implemented in the following manner: 步骤S1.1,局部上下文感知编码采用一组配备一维偏移窗口的连续局部变压器块,将 视频表征
Figure 6546DEST_PATH_IMAGE004
作为初始表征,输入第一块一维偏移窗口的连续局部变压器块,将得到的结果 输入第二块一维偏移窗口的连续局部变压器块,以此类推,将最后一块一维偏移窗口的连 续局部变压器块的输出,作为局部上下文感知编码输出的视频表征
Figure 718150DEST_PATH_IMAGE005
;一维偏移窗口的连 续局部变压器块内部操作如下:
Step S1.1, local context-aware coding employs a set of continuous local transformer blocks equipped with one-dimensional offset windows to represent the video
Figure 6546DEST_PATH_IMAGE004
As an initial representation, input the continuous local transformer block of the first one-dimensional offset window, input the obtained result into the continuous local transformer block of the second one-dimensional offset window, and so on, put the last block of the one-dimensional offset window into the continuous local transformer block. The output of successive local transformer blocks, as the video representation of the output of the local context-aware encoding
Figure 718150DEST_PATH_IMAGE005
; The internal operation of the continuous local transformer block for a one-dimensional offset window is as follows:
对获取的视频表征
Figure 130808DEST_PATH_IMAGE007
进行层标准化后,通过一维窗口多头自注意力模块,将得到的 结果与视频表征
Figure 66403DEST_PATH_IMAGE007
相加,得到视频表征
Figure 180990DEST_PATH_IMAGE008
;对视频表征
Figure 961864DEST_PATH_IMAGE009
进行层标准化后,通过多层 感知器,将得到的结果与视频表征
Figure 461984DEST_PATH_IMAGE009
相加,得到视频表征
Figure 302901DEST_PATH_IMAGE010
;对视频表征
Figure 904784DEST_PATH_IMAGE011
进行层标准 化后,通过一维偏移窗口多头自注意力模块,将得到的结果与视频表征
Figure 489349DEST_PATH_IMAGE011
相加,得到视频 表征
Figure 611020DEST_PATH_IMAGE012
;对视频表征
Figure 888418DEST_PATH_IMAGE012
进行层标准化后,通过多层感知器,将得到的结果与视频表征
Figure 712017DEST_PATH_IMAGE012
相加,输出视频表征
Figure 100273DEST_PATH_IMAGE013
作为一维偏移窗口的连续局部变压器块的输出,
Figure 574986DEST_PATH_IMAGE014
表示第
Figure 757705DEST_PATH_IMAGE014
块配备 一维偏移窗口的连续局部变压器块;
Characterization of the acquired video
Figure 130808DEST_PATH_IMAGE007
After layer normalization, the obtained results are compared with the video representation through the one-dimensional window multi-head self-attention module.
Figure 66403DEST_PATH_IMAGE007
Add to get the video representation
Figure 180990DEST_PATH_IMAGE008
; representation of video
Figure 961864DEST_PATH_IMAGE009
After layer normalization, the obtained results are compared with the video representation through a multi-layer perceptron
Figure 461984DEST_PATH_IMAGE009
Add to get the video representation
Figure 302901DEST_PATH_IMAGE010
; representation of video
Figure 904784DEST_PATH_IMAGE011
After layer normalization, the obtained results are compared with the video representation through the one-dimensional offset window multi-head self-attention module.
Figure 489349DEST_PATH_IMAGE011
Add to get the video representation
Figure 611020DEST_PATH_IMAGE012
; representation of video
Figure 888418DEST_PATH_IMAGE012
After layer normalization, the obtained results are compared with the video representation through a multi-layer perceptron
Figure 712017DEST_PATH_IMAGE012
Add, output video representation
Figure 100273DEST_PATH_IMAGE013
The output of successive local transformer blocks as a 1D offset window,
Figure 574986DEST_PATH_IMAGE014
means the first
Figure 757705DEST_PATH_IMAGE014
block is equipped with a continuous local transformer block with a one-dimensional offset window;
步骤S1.2,全局上下文感知编码包括一组常规变压器块,将视频表征
Figure 334180DEST_PATH_IMAGE005
做出初始表征输 入第一块常规变压器块,将得到的结果输入第二块常规变压器块,以此类推,将最后一块常 规变压器块的输出,作为全局上下文感知编码输出视频表征
Figure 260548DEST_PATH_IMAGE006
;常规变压器块内部操作如 下:
Step S1.2, the global context-aware coding consists of a set of conventional transformer blocks that characterize the video
Figure 334180DEST_PATH_IMAGE005
Make the initial representation and input it into the first regular transformer block, input the obtained result into the second regular transformer block, and so on, use the output of the last regular transformer block as the global context-aware coding output video representation
Figure 260548DEST_PATH_IMAGE006
; The internal operation of a conventional transformer block is as follows:
获取的视频表征
Figure 91232DEST_PATH_IMAGE015
,通过常规多头自注意力模块后,将得到的结果与视频表征
Figure 710432DEST_PATH_IMAGE015
相加后,再进行层标准化,得到视频表征
Figure 508624DEST_PATH_IMAGE016
;视频表征
Figure 973103DEST_PATH_IMAGE016
通过多层感知器后,将得到的结 果与视频表征
Figure 422408DEST_PATH_IMAGE016
相加后,再进行层标准化,得到的视频表征
Figure 212509DEST_PATH_IMAGE017
作为常规变压器块的输出,
Figure 232418DEST_PATH_IMAGE018
表示第
Figure 766167DEST_PATH_IMAGE018
块常规变压器块。
Acquired video representation
Figure 91232DEST_PATH_IMAGE015
, after passing the conventional multi-head self-attention module, the obtained results are compared with the video representation
Figure 710432DEST_PATH_IMAGE015
After addition, layer normalization is performed to obtain the video representation
Figure 508624DEST_PATH_IMAGE016
; video representation
Figure 973103DEST_PATH_IMAGE016
After passing through the multi-layer perceptron, the obtained results are compared with the video representation
Figure 422408DEST_PATH_IMAGE016
After addition, layer normalization is performed, and the resulting video representation
Figure 212509DEST_PATH_IMAGE017
As the output of a regular transformer block,
Figure 232418DEST_PATH_IMAGE018
means the first
Figure 766167DEST_PATH_IMAGE018
block regular transformer block.
5.根据权利要求1所述的多粒度级联交互网络的跨模态时序行为定位方法,其特征在 于所述步骤S2中,查询文本中每个单词对应的可学习词嵌入向量,使用预训练的词嵌入模 型进行初始化,得到文本查询的嵌入向量序列
Figure 571443DEST_PATH_IMAGE019
Figure 470129DEST_PATH_IMAGE020
为视频第i个单词的表征,通 过多层的双向长短时记忆网络,对文本查询的嵌入向量序列
Figure 39651DEST_PATH_IMAGE021
进行上下文编码,得到查询 的单词级文本查询表征
Figure 314775DEST_PATH_IMAGE022
,通过
Figure 223825DEST_PATH_IMAGE023
的前向隐状态向量和
Figure 604996DEST_PATH_IMAGE024
的后向隐状态 向量的拼接,得到全局级文本查询表征
Figure 599497DEST_PATH_IMAGE025
,最终得到文本查询表征
Figure 475049DEST_PATH_IMAGE026
5. The method for locating cross-modal time-series behavior of multi-granularity cascade interaction network according to claim 1, is characterized in that in said step S2, the learnable word embedding vector corresponding to each word in the query text, using pre-training The word embedding model is initialized to obtain the embedding vector sequence of the text query
Figure 571443DEST_PATH_IMAGE019
,
Figure 470129DEST_PATH_IMAGE020
is the representation of the i-th word in the video, through the multi-layer bidirectional long-short-term memory network, the embedding vector sequence of the text query
Figure 39651DEST_PATH_IMAGE021
Perform context encoding to obtain a word-level textual query representation of the query
Figure 314775DEST_PATH_IMAGE022
,pass
Figure 223825DEST_PATH_IMAGE023
The forward hidden state vector sum of
Figure 604996DEST_PATH_IMAGE024
The concatenation of the backward hidden state vectors of , to obtain the global-level text query representation
Figure 599497DEST_PATH_IMAGE025
, and finally get the text query representation
Figure 475049DEST_PATH_IMAGE026
.
6.根据权利要求1所述的多粒度级联交互网络的跨模态时序行为定位方法,其特征在 于所述步骤S3中的多粒度级联交互网络,首先将视频表征和文本查询表征
Figure 504185DEST_PATH_IMAGE026
,通 过视频引导的查询解码,得到视频引导的查询表征
Figure 292144DEST_PATH_IMAGE027
Figure 836258DEST_PATH_IMAGE028
表示全局级视频引导的 查询表征,
Figure 453184DEST_PATH_IMAGE029
表示单词级视频引导的查询表征,然后将视频引导的查询表征
Figure 382831DEST_PATH_IMAGE030
与视频模态表 征,通过级联跨模态融合,得到最终的增强化视频表征。
6. The cross-modal time sequence behavior positioning method of multi-granularity cascading interaction network according to claim 1, it is characterized in that the multi-granularity cascading interaction network in the described step S3, first by video representation and text query representation
Figure 504185DEST_PATH_IMAGE026
, through video-guided query decoding to obtain video-guided query representations
Figure 292144DEST_PATH_IMAGE027
,
Figure 836258DEST_PATH_IMAGE028
represents a global-level video-guided query representation,
Figure 453184DEST_PATH_IMAGE029
represent word-level video-guided query representations, and then characterize the video-guided query representations
Figure 382831DEST_PATH_IMAGE030
The final enhanced video representation is obtained by cascading cross-modal fusion with the video modality representation.
7.根据权利要求6所述的多粒度级联交互网络的跨模态时序行为定位方法,其特征在于所述步骤S3中视频引导的查询解码和级联跨模态融合分别以如下方式实施:7. The cross-modal timing behavior positioning method of multi-granularity cascade interaction network according to claim 6, is characterized in that the query decoding and cascade cross-modal fusion of video guidance in the described step S3 are respectively implemented as follows: 步骤S3.1,视频引导的查询解码采用一组跨模态解码块,将文本查询表征
Figure 590959DEST_PATH_IMAGE031
作为初始表 征输入第一块跨模态解码块,将得到的结果输入第二块跨模态解码块,以此类推,将最后一 块跨模态解码块的输出,作为视频引导的查询表征
Figure 560052DEST_PATH_IMAGE032
;所述步骤S3.1中的跨模态解码块的 内部操作如下:
Step S3.1, video-guided query decoding uses a set of cross-modal decoding blocks to characterize the text query.
Figure 590959DEST_PATH_IMAGE031
Input the first cross-modal decoding block as the initial representation, input the obtained result to the second cross-modal decoding block, and so on, use the output of the last cross-modal decoding block as the video-guided query representation
Figure 560052DEST_PATH_IMAGE032
; The internal operation of the cross-modal decoding block in the step S3.1 is as follows:
将获取的文本查询表征
Figure 777406DEST_PATH_IMAGE033
,通过多头自注意力模块,得到文本查询表征
Figure 266288DEST_PATH_IMAGE034
;将文本 查询表征
Figure 114158DEST_PATH_IMAGE034
作为查询,将视频表征
Figure 367285DEST_PATH_IMAGE006
作为键和值,通过多头交叉注意力模块,得到文本查 询表征
Figure 122751DEST_PATH_IMAGE035
;文本查询表征
Figure 230253DEST_PATH_IMAGE035
通过常规前向网络,得到的文本查询表征
Figure 45763DEST_PATH_IMAGE036
作为跨模态解码 块的输出;
Figure 723869DEST_PATH_IMAGE037
表示第
Figure 17447DEST_PATH_IMAGE037
块跨模态解码块;
The text query representation that will be obtained
Figure 777406DEST_PATH_IMAGE033
, through the multi-head self-attention module, the text query representation is obtained
Figure 266288DEST_PATH_IMAGE034
; characterize the text query
Figure 114158DEST_PATH_IMAGE034
As a query, the video representation
Figure 367285DEST_PATH_IMAGE006
As keys and values, the text query representation is obtained through a multi-head cross-attention module
Figure 122751DEST_PATH_IMAGE035
; text query representation
Figure 230253DEST_PATH_IMAGE035
Text query representations obtained through conventional feed-forward networks
Figure 45763DEST_PATH_IMAGE036
as the output of the cross-modal decoding block;
Figure 723869DEST_PATH_IMAGE037
means the first
Figure 17447DEST_PATH_IMAGE037
block cross-modal decoding block;
步骤S3.2,级联跨模态融合,首先将全局级视频引导的查询表征
Figure 480920DEST_PATH_IMAGE028
与视频模态表征
Figure 467331DEST_PATH_IMAGE006
, 通过逐元素乘,在粗粒度级进行跨模态融合,得到粗粒度级融合后的视频表征
Figure 898312DEST_PATH_IMAGE038
,然后将 单词级视频引导的查询表征
Figure 995581DEST_PATH_IMAGE029
与粗粒度级融合后的视频表征
Figure 812096DEST_PATH_IMAGE038
,通过另一组跨模态解码 块,在细粒度级进行跨模态融合,将粗粒度级融合后的视频表征
Figure 969408DEST_PATH_IMAGE038
作为初始表征输入第一 块跨模态解码块,将得到的结果输入第二块跨模态解码块,以此类推,将最后一块跨模态解 码块的输出,作为增强化视频表征
Figure 622106DEST_PATH_IMAGE039
;所述步骤S3.2中的跨模态解码块的内部操作如下:
Step S3.2, cascaded cross-modal fusion, first characterizes the global-level video-guided query
Figure 480920DEST_PATH_IMAGE028
and video modality characterization
Figure 467331DEST_PATH_IMAGE006
, through element-by-element multiplication, cross-modal fusion is performed at the coarse-grained level, and the video representation after fusion at the coarse-grained level is obtained.
Figure 898312DEST_PATH_IMAGE038
, and then characterize the word-level video-guided query
Figure 995581DEST_PATH_IMAGE029
Video representation after fusion with coarse-grained level
Figure 812096DEST_PATH_IMAGE038
, through another set of cross-modal decoding blocks, cross-modal fusion is performed at the fine-grained level, and the fused video representation at the coarse-grained level is
Figure 969408DEST_PATH_IMAGE038
Input the first cross-modal decoding block as the initial representation, input the obtained result to the second cross-modal decoding block, and so on, take the output of the last cross-modal decoding block as the enhanced video representation
Figure 622106DEST_PATH_IMAGE039
; The internal operation of the cross-modal decoding block in the step S3.2 is as follows:
将获取的视频表征
Figure 523066DEST_PATH_IMAGE040
,通过多头自注意力模块,得到视频表征
Figure 961132DEST_PATH_IMAGE041
;将视频表征
Figure 289345DEST_PATH_IMAGE041
作 为查询,将单词级视频引导的查询表征
Figure 429339DEST_PATH_IMAGE029
作为键和值,通过多头交叉注意力模块,得到视频 表征
Figure 868411DEST_PATH_IMAGE042
;视频表征
Figure 659518DEST_PATH_IMAGE042
通过常规前向网络,得到的视频表征
Figure 424212DEST_PATH_IMAGE043
作为跨模态解码块的输出;
Figure 785923DEST_PATH_IMAGE044
表示第
Figure 28686DEST_PATH_IMAGE044
块跨模态解码块。
Video representations to be acquired
Figure 523066DEST_PATH_IMAGE040
, through the multi-head self-attention module, the video representation is obtained
Figure 961132DEST_PATH_IMAGE041
; characterize the video
Figure 289345DEST_PATH_IMAGE041
As a query, word-level video-guided query representation
Figure 429339DEST_PATH_IMAGE029
As keys and values, the video representation is obtained through a multi-head cross-attention module
Figure 868411DEST_PATH_IMAGE042
; video representation
Figure 659518DEST_PATH_IMAGE042
Through the conventional feed-forward network, the resulting video representation
Figure 424212DEST_PATH_IMAGE043
as the output of the cross-modal decoding block;
Figure 785923DEST_PATH_IMAGE044
means the first
Figure 28686DEST_PATH_IMAGE044
Block cross-modal decoding blocks.
8.根据权利要求1所述的多粒度级联交互网络的跨模态时序行为定位方法,其特征在 于所述步骤S4中的基于注意力的时序位置回归模块,将多粒度级联交互输出的增强化视频 表征
Figure 441344DEST_PATH_IMAGE039
,通过多层感知器和SoftMax激活层,得到视频的时序注意力分数
Figure 376939DEST_PATH_IMAGE045
;再将增强化视 频表征
Figure 225946DEST_PATH_IMAGE039
与时序注意力分数
Figure 272399DEST_PATH_IMAGE045
,通过注意力池化层,得到目标片段的表征
Figure 772520DEST_PATH_IMAGE046
;最后,将目标 片段的表征
Figure 879016DEST_PATH_IMAGE046
,通过多层感知器,对目标片段归一化后的时序中心坐标
Figure 215319DEST_PATH_IMAGE047
和片段时长
Figure 534305DEST_PATH_IMAGE048
进 行直接回归。
8. The cross-modal time sequence behavior positioning method of the multi-granularity cascade interaction network according to claim 1, wherein the attention-based time sequence position regression module in the step S4, the multi-granularity cascade interaction output Enhanced Video Representation
Figure 441344DEST_PATH_IMAGE039
, through the multi-layer perceptron and SoftMax activation layer, the temporal attention score of the video is obtained
Figure 376939DEST_PATH_IMAGE045
; then the enhanced video representation
Figure 225946DEST_PATH_IMAGE039
with temporal attention scores
Figure 272399DEST_PATH_IMAGE045
, through the attention pooling layer, the representation of the target segment is obtained
Figure 772520DEST_PATH_IMAGE046
; finally, the representation of the target fragment
Figure 879016DEST_PATH_IMAGE046
, through the multi-layer perceptron, the normalized time series center coordinates of the target segment
Figure 215319DEST_PATH_IMAGE047
and segment duration
Figure 534305DEST_PATH_IMAGE048
Do a direct regression.
9.根据权利要求1所述的多粒度级联交互网络的跨模态时序行为定位方法,其特征在于所述步骤S5中模型的训练,包括如下步骤:9. The cross-modal time sequence behavior positioning method of multi-granularity cascade interaction network according to claim 1, is characterized in that the training of the model in described step S5, comprises the steps: 步骤S5.1,计算注意力对齐损失
Figure 655976DEST_PATH_IMAGE049
,将第i帧对应的时序注意力分数的对数与指示值
Figure 933374DEST_PATH_IMAGE050
的乘积,根据采样帧数进行累加,通过该累加的结果比上
Figure 22553DEST_PATH_IMAGE050
根据采样帧数累加的结果求 损失
Figure 145229DEST_PATH_IMAGE049
Figure 885521DEST_PATH_IMAGE051
表明视频的第i帧位于时序标注片段内,反之则
Figure 68241DEST_PATH_IMAGE052
;注意力对齐损失
Figure 379136DEST_PATH_IMAGE049
的具体计算过程可表示为:
Step S5.1, calculate the attention alignment loss
Figure 655976DEST_PATH_IMAGE049
, the logarithm of the temporal attention score corresponding to the i -th frame and the indicated value
Figure 933374DEST_PATH_IMAGE050
The product is accumulated according to the number of sampling frames, and the accumulated result is higher than
Figure 22553DEST_PATH_IMAGE050
Calculate the loss according to the accumulated result of the number of sampling frames
Figure 145229DEST_PATH_IMAGE049
,
Figure 885521DEST_PATH_IMAGE051
Indicates that the i-th frame of the video is located in the timing annotation segment, and vice versa
Figure 68241DEST_PATH_IMAGE052
; attention alignment loss
Figure 379136DEST_PATH_IMAGE049
The specific calculation process can be expressed as:
Figure 305504DEST_PATH_IMAGE053
Figure 305504DEST_PATH_IMAGE053
步骤S5.2,计算边界损失
Figure 401767DEST_PATH_IMAGE054
,通过结合平滑
Figure 755388DEST_PATH_IMAGE001
损失
Figure 553580DEST_PATH_IMAGE055
和时序广义交并比损失
Figure 283638DEST_PATH_IMAGE056
进 行边界损失度量;对预测片段的归一化时序中心坐标
Figure 490802DEST_PATH_IMAGE047
与时序标注片段的归一化时序中心 坐标
Figure 15324DEST_PATH_IMAGE057
的差值,求第一平滑
Figure 300812DEST_PATH_IMAGE001
损失,对预测片段的片段时长
Figure 834561DEST_PATH_IMAGE048
与时序标注片段的片段时长
Figure 639837DEST_PATH_IMAGE058
的差值,求第二平滑
Figure 335261DEST_PATH_IMAGE001
损失,将第一、第二平滑
Figure 108045DEST_PATH_IMAGE001
损失的和作为损失
Figure 179906DEST_PATH_IMAGE055
;计算回归片段
Figure 338224DEST_PATH_IMAGE059
和 相应标注片段
Figure 470128DEST_PATH_IMAGE060
的广义交并比,将该广义交并比的负值加上1,作为时序广义交并比损失
Figure 730208DEST_PATH_IMAGE056
;将损失
Figure 340181DEST_PATH_IMAGE055
与时序广义交并比损失
Figure 854470DEST_PATH_IMAGE056
的和作为边界损失
Figure 157275DEST_PATH_IMAGE054
;边界损失
Figure 639072DEST_PATH_IMAGE054
的具 体计算过程可表示如下:
Step S5.2, calculate the boundary loss
Figure 401767DEST_PATH_IMAGE054
, by combining smooth
Figure 755388DEST_PATH_IMAGE001
loss
Figure 553580DEST_PATH_IMAGE055
and time series generalized intersection loss
Figure 283638DEST_PATH_IMAGE056
Perform boundary loss metrics; normalized temporal center coordinates for predicted segments
Figure 490802DEST_PATH_IMAGE047
Normalized temporal center coordinates with temporal annotation fragments
Figure 15324DEST_PATH_IMAGE057
The difference of , find the first smoothing
Figure 300812DEST_PATH_IMAGE001
loss, segment duration for predicted segments
Figure 834561DEST_PATH_IMAGE048
Clip duration with timing annotation clips
Figure 639837DEST_PATH_IMAGE058
The difference of , find the second smoothing
Figure 335261DEST_PATH_IMAGE001
loss, smoothing the first and second
Figure 108045DEST_PATH_IMAGE001
loss and as loss
Figure 179906DEST_PATH_IMAGE055
; Calculate the regression segment
Figure 338224DEST_PATH_IMAGE059
and corresponding annotation fragments
Figure 470128DEST_PATH_IMAGE060
The generalized intersection and union ratio of , add 1 to the negative value of the generalized intersection and union ratio, as the loss of the time series generalized intersection and union ratio
Figure 730208DEST_PATH_IMAGE056
; will lose
Figure 340181DEST_PATH_IMAGE055
Generalized intersection loss with time series
Figure 854470DEST_PATH_IMAGE056
The sum as the boundary loss
Figure 157275DEST_PATH_IMAGE054
; frontier loss
Figure 639072DEST_PATH_IMAGE054
The specific calculation process can be expressed as follows:
Figure 318315DEST_PATH_IMAGE061
Figure 318315DEST_PATH_IMAGE061
Figure 185646DEST_PATH_IMAGE062
Figure 185646DEST_PATH_IMAGE062
Figure 924932DEST_PATH_IMAGE063
Figure 924932DEST_PATH_IMAGE063
其中,
Figure 628445DEST_PATH_IMAGE064
表示平滑
Figure 111379DEST_PATH_IMAGE001
损失函数,
Figure 600261DEST_PATH_IMAGE065
表示两片段的交并比,
Figure 244869DEST_PATH_IMAGE066
表示覆盖模型回归片段
Figure 701258DEST_PATH_IMAGE059
和 相应标注片段
Figure 456724DEST_PATH_IMAGE060
的最小时序框;
in,
Figure 628445DEST_PATH_IMAGE064
means smooth
Figure 111379DEST_PATH_IMAGE001
loss function,
Figure 600261DEST_PATH_IMAGE065
represents the intersection ratio of the two fragments,
Figure 244869DEST_PATH_IMAGE066
Represents coverage model regression snippet
Figure 701258DEST_PATH_IMAGE059
and corresponding annotation fragments
Figure 456724DEST_PATH_IMAGE060
The minimum timing frame of ;
步骤S5.3,将注意力对齐损失
Figure 564226DEST_PATH_IMAGE049
与边界损失
Figure 114156DEST_PATH_IMAGE054
的加权和作为模型训练的总损失, 结合优化器更新模型参数。
Step S5.3, align the attention to the loss
Figure 564226DEST_PATH_IMAGE049
with boundary loss
Figure 114156DEST_PATH_IMAGE054
The weighted sum is taken as the total loss of model training, combined with the optimizer to update the model parameters.
10.一种多粒度级联交互网络的跨模态时序行为定位装置,其特征在于,包括一个或多个处理器,用于实现权利要求1-9中任一项所述的多粒度级联交互网络的跨模态时序行为定位方法。10. A cross-modal timing behavior positioning device for a multi-granularity cascade interaction network, characterized in that it comprises one or more processors for implementing the multi-granularity cascade described in any one of claims 1-9 A cross-modal temporal behavior localization method for interaction networks.
CN202210052687.8A 2022-01-18 2022-01-18 Cross-modal timing behavior localization method and device for multi-granularity cascade interaction network Active CN114064967B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210052687.8A CN114064967B (en) 2022-01-18 2022-01-18 Cross-modal timing behavior localization method and device for multi-granularity cascade interaction network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210052687.8A CN114064967B (en) 2022-01-18 2022-01-18 Cross-modal timing behavior localization method and device for multi-granularity cascade interaction network

Publications (2)

Publication Number Publication Date
CN114064967A true CN114064967A (en) 2022-02-18
CN114064967B CN114064967B (en) 2022-05-06

Family

ID=80231249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210052687.8A Active CN114064967B (en) 2022-01-18 2022-01-18 Cross-modal timing behavior localization method and device for multi-granularity cascade interaction network

Country Status (1)

Country Link
CN (1) CN114064967B (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114357124A (en) * 2022-03-18 2022-04-15 成都考拉悠然科技有限公司 Video paragraph positioning method based on language reconstruction and graph mechanism
CN114581821A (en) * 2022-02-23 2022-06-03 腾讯科技(深圳)有限公司 Video detection method, system, storage medium and server
CN114792424A (en) * 2022-05-30 2022-07-26 北京百度网讯科技有限公司 Document image processing method and device and electronic equipment
CN114896451A (en) * 2022-05-25 2022-08-12 云从科技集团股份有限公司 Video clip positioning method, system, control device and readable storage medium
CN114925232A (en) * 2022-05-31 2022-08-19 杭州电子科技大学 A cross-modal temporal video localization method under the framework of text question answering
CN115131655A (en) * 2022-09-01 2022-09-30 浙江啄云智能科技有限公司 Training method and device of target detection model and target detection method
CN115187783A (en) * 2022-09-09 2022-10-14 之江实验室 Multi-task hybrid supervised medical image segmentation method and system based on federated learning
CN115223086A (en) * 2022-09-20 2022-10-21 之江实验室 Cross-modal action positioning method and system based on interactive attention guidance and correction
CN115238130A (en) * 2022-09-21 2022-10-25 之江实验室 Temporal language localization method and device based on modal customization collaborative attention interaction
CN116246213A (en) * 2023-05-08 2023-06-09 腾讯科技(深圳)有限公司 Data processing method, device, equipment and medium
CN116385070A (en) * 2023-01-18 2023-07-04 中国科学技术大学 E-commerce short video advertisement multi-objective estimation method, system, device and storage medium
CN117076712A (en) * 2023-10-16 2023-11-17 中国科学技术大学 Video retrieval method, system, device and storage medium
CN116824461B (en) * 2023-08-30 2023-12-08 山东建筑大学 Question understanding guiding video question answering method and system
CN117609553A (en) * 2024-01-23 2024-02-27 江南大学 Video retrieval method and system based on local feature enhancement and modal interaction
CN117724153A (en) * 2023-12-25 2024-03-19 北京孚梅森石油科技有限公司 Lithology recognition method based on multi-window cascading interaction
CN117876929A (en) * 2024-01-12 2024-04-12 天津大学 A temporal object localization method based on progressive multi-scale context learning
CN118897905A (en) * 2024-10-08 2024-11-05 山东大学 A video clip positioning method and system based on fine-grained spatiotemporal correlation modeling
CN119152337A (en) * 2024-11-20 2024-12-17 合肥工业大学 Audiovisual event localization system and method based on cross-modal consistency and temporal multi-granularity collaboration

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107346328A (en) * 2017-05-25 2017-11-14 北京大学 A kind of cross-module state association learning method based on more granularity hierarchical networks
CN109858032A (en) * 2019-02-14 2019-06-07 程淑玉 Merge more granularity sentences interaction natural language inference model of Attention mechanism
CN111309971A (en) * 2020-01-19 2020-06-19 浙江工商大学 Multi-level coding-based text-to-video cross-modal retrieval method
CN111310676A (en) * 2020-02-21 2020-06-19 重庆邮电大学 Video action recognition method based on CNN-LSTM and attention
CN111782871A (en) * 2020-06-18 2020-10-16 湖南大学 Cross-modal video moment location method based on spatiotemporal reinforcement learning
CN111930999A (en) * 2020-07-21 2020-11-13 山东省人工智能研究院 Method for implementing text query and positioning video clip by frame-by-frame cross-modal similarity correlation
CN112115849A (en) * 2020-09-16 2020-12-22 中国石油大学(华东) Video scene recognition method based on multi-granularity video information and attention mechanism
CN113111837A (en) * 2021-04-25 2021-07-13 山东省人工智能研究院 Intelligent monitoring video early warning method based on multimedia semantic analysis
CN113204675A (en) * 2021-07-07 2021-08-03 成都考拉悠然科技有限公司 Cross-modal video time retrieval method based on cross-modal object inference network
EP3933686A2 (en) * 2020-11-27 2022-01-05 Beijing Baidu Netcom Science Technology Co., Ltd. Video processing method, apparatus, electronic device, storage medium, and program product
CN113934887A (en) * 2021-12-20 2022-01-14 成都考拉悠然科技有限公司 No-proposal time sequence language positioning method based on semantic decoupling

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107346328A (en) * 2017-05-25 2017-11-14 北京大学 A kind of cross-module state association learning method based on more granularity hierarchical networks
CN109858032A (en) * 2019-02-14 2019-06-07 程淑玉 Merge more granularity sentences interaction natural language inference model of Attention mechanism
CN111309971A (en) * 2020-01-19 2020-06-19 浙江工商大学 Multi-level coding-based text-to-video cross-modal retrieval method
CN111310676A (en) * 2020-02-21 2020-06-19 重庆邮电大学 Video action recognition method based on CNN-LSTM and attention
CN111782871A (en) * 2020-06-18 2020-10-16 湖南大学 Cross-modal video moment location method based on spatiotemporal reinforcement learning
CN111930999A (en) * 2020-07-21 2020-11-13 山东省人工智能研究院 Method for implementing text query and positioning video clip by frame-by-frame cross-modal similarity correlation
CN112115849A (en) * 2020-09-16 2020-12-22 中国石油大学(华东) Video scene recognition method based on multi-granularity video information and attention mechanism
EP3933686A2 (en) * 2020-11-27 2022-01-05 Beijing Baidu Netcom Science Technology Co., Ltd. Video processing method, apparatus, electronic device, storage medium, and program product
CN113111837A (en) * 2021-04-25 2021-07-13 山东省人工智能研究院 Intelligent monitoring video early warning method based on multimedia semantic analysis
CN113204675A (en) * 2021-07-07 2021-08-03 成都考拉悠然科技有限公司 Cross-modal video time retrieval method based on cross-modal object inference network
CN113934887A (en) * 2021-12-20 2022-01-14 成都考拉悠然科技有限公司 No-proposal time sequence language positioning method based on semantic decoupling

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
JONGHWAN MUN: "Local-Global Video-Text Interactions for Temporal Grounding", 《ARXIV》 *
SHIZHE CHEN: "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning", 《ARXIV》 *
ZHENZHI WANG: "Negative Sample Matters:A Renaissance of Metric Learning for Temporal Groounding", 《ARXIV》 *
戴思达: "深度多模态融合技术及时间序列分析算法研究", 《中国优秀硕士学位论文全文数据库》 *
赵才荣,齐鼎等: "智能视频监控关键技术: 行人再识别研究综述", 《中国科学:信息科学》 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114581821A (en) * 2022-02-23 2022-06-03 腾讯科技(深圳)有限公司 Video detection method, system, storage medium and server
CN114581821B (en) * 2022-02-23 2024-11-08 腾讯科技(深圳)有限公司 Video detection method, system, storage medium and server
CN114357124A (en) * 2022-03-18 2022-04-15 成都考拉悠然科技有限公司 Video paragraph positioning method based on language reconstruction and graph mechanism
CN114896451A (en) * 2022-05-25 2022-08-12 云从科技集团股份有限公司 Video clip positioning method, system, control device and readable storage medium
CN114792424A (en) * 2022-05-30 2022-07-26 北京百度网讯科技有限公司 Document image processing method and device and electronic equipment
CN114925232A (en) * 2022-05-31 2022-08-19 杭州电子科技大学 A cross-modal temporal video localization method under the framework of text question answering
CN115131655A (en) * 2022-09-01 2022-09-30 浙江啄云智能科技有限公司 Training method and device of target detection model and target detection method
CN115187783A (en) * 2022-09-09 2022-10-14 之江实验室 Multi-task hybrid supervised medical image segmentation method and system based on federated learning
CN115223086A (en) * 2022-09-20 2022-10-21 之江实验室 Cross-modal action positioning method and system based on interactive attention guidance and correction
CN115223086B (en) * 2022-09-20 2022-12-06 之江实验室 Cross-modal action localization method and system based on interactive attention guidance and correction
CN115238130A (en) * 2022-09-21 2022-10-25 之江实验室 Temporal language localization method and device based on modal customization collaborative attention interaction
CN115238130B (en) * 2022-09-21 2022-12-06 之江实验室 Time sequence language positioning method and device based on modal customization collaborative attention interaction
CN116385070A (en) * 2023-01-18 2023-07-04 中国科学技术大学 E-commerce short video advertisement multi-objective estimation method, system, device and storage medium
CN116385070B (en) * 2023-01-18 2023-10-03 中国科学技术大学 E-commerce short video advertising multi-target prediction methods, systems, equipment and storage media
CN116246213A (en) * 2023-05-08 2023-06-09 腾讯科技(深圳)有限公司 Data processing method, device, equipment and medium
CN116824461B (en) * 2023-08-30 2023-12-08 山东建筑大学 Question understanding guiding video question answering method and system
CN117076712A (en) * 2023-10-16 2023-11-17 中国科学技术大学 Video retrieval method, system, device and storage medium
CN117076712B (en) * 2023-10-16 2024-02-23 中国科学技术大学 Video retrieval method, system, device and storage medium
CN117724153B (en) * 2023-12-25 2024-05-14 北京孚梅森石油科技有限公司 Lithology recognition method based on multi-window cascading interaction
CN117724153A (en) * 2023-12-25 2024-03-19 北京孚梅森石油科技有限公司 Lithology recognition method based on multi-window cascading interaction
CN117876929B (en) * 2024-01-12 2024-06-21 天津大学 A temporal object localization method based on progressive multi-scale context learning
CN117876929A (en) * 2024-01-12 2024-04-12 天津大学 A temporal object localization method based on progressive multi-scale context learning
CN117609553B (en) * 2024-01-23 2024-03-22 江南大学 Video retrieval method and system based on local feature enhancement and modal interaction
CN117609553A (en) * 2024-01-23 2024-02-27 江南大学 Video retrieval method and system based on local feature enhancement and modal interaction
CN118897905A (en) * 2024-10-08 2024-11-05 山东大学 A video clip positioning method and system based on fine-grained spatiotemporal correlation modeling
CN119152337A (en) * 2024-11-20 2024-12-17 合肥工业大学 Audiovisual event localization system and method based on cross-modal consistency and temporal multi-granularity collaboration
CN119152337B (en) * 2024-11-20 2025-02-11 合肥工业大学 Audio-visual event positioning system and method based on cross-modal consistency and time sequence multi-granularity collaboration

Also Published As

Publication number Publication date
CN114064967B (en) 2022-05-06

Similar Documents

Publication Publication Date Title
CN114064967B (en) Cross-modal timing behavior localization method and device for multi-granularity cascade interaction network
CN110209836B (en) Method and device for remote supervision relationship extraction
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN110059160B (en) End-to-end context-based knowledge base question-answering method and device
CN106919652B (en) Short-sighted frequency automatic marking method and system based on multi-source various visual angles transductive learning
Tang et al. Comprehensive instructional video analysis: The coin dataset and performance evaluation
CN111414845B (en) Multi-form sentence video positioning method based on space-time diagram inference network
CN114743143A (en) A video description generation method and storage medium based on multi-concept knowledge mining
CN115223086A (en) Cross-modal action positioning method and system based on interactive attention guidance and correction
CN113963304B (en) Cross-modal video timing action localization method and system based on timing-spatial graph
CN116186328A (en) Video text cross-modal retrieval method based on pre-clustering guidance
CN116935274A (en) Weak supervision cross-mode video positioning method based on modal feature alignment
CN114925232A (en) A cross-modal temporal video localization method under the framework of text question answering
WO2023092719A1 (en) Information extraction method for medical record data, and terminal device and readable storage medium
CN116127132A (en) A Temporal Language Localization Approach Based on Cross-Modal Text-Related Attention
US20230326178A1 (en) Concept disambiguation using multimodal embeddings
Huang Multi-modal video summarization
CN113688871B (en) Transformer-based video multi-label action identification method
CN114339403A (en) A method, system, device and readable storage medium for generating video action clips
CN117152669B (en) Cross-mode time domain video positioning method and system
Hao et al. What matters: Attentive and relational feature aggregation network for video-text retrieval
CN115238130B (en) Time sequence language positioning method and device based on modal customization collaborative attention interaction
CN116935292A (en) A short video scene classification method and system based on self-attention model
CN114282537B (en) Social text-oriented cascading linear entity relation extraction method
Pan et al. A Multiple Utterances based Neural Network Model for Joint Intent Detection and Slot Filling.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant