CN114064967A - Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network - Google Patents
Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network Download PDFInfo
- Publication number
- CN114064967A CN114064967A CN202210052687.8A CN202210052687A CN114064967A CN 114064967 A CN114064967 A CN 114064967A CN 202210052687 A CN202210052687 A CN 202210052687A CN 114064967 A CN114064967 A CN 114064967A
- Authority
- CN
- China
- Prior art keywords
- video
- representation
- cross
- modal
- query
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000002452 interceptive effect Effects 0.000 title description 3
- 230000003993 interaction Effects 0.000 claims abstract description 54
- 238000012549 training Methods 0.000 claims abstract description 26
- 230000006399 behavior Effects 0.000 claims description 53
- 230000002123 temporal effect Effects 0.000 claims description 34
- 238000010606 normalization Methods 0.000 claims description 22
- 230000004927 fusion Effects 0.000 claims description 21
- 239000013598 vector Substances 0.000 claims description 19
- 238000012512 characterization method Methods 0.000 claims description 18
- 230000004807 localization Effects 0.000 claims description 18
- 239000012634 fragment Substances 0.000 claims description 15
- 238000009499 grossing Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 11
- 230000000007 visual effect Effects 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 8
- 230000002457 bidirectional effect Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000015654 memory Effects 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 3
- 238000013459 approach Methods 0.000 claims description 3
- 230000001419 dependent effect Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000007787 long-term memory Effects 0.000 claims description 3
- 230000036962 time dependent Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 abstract description 3
- 238000012545 processing Methods 0.000 description 9
- 238000011156 evaluation Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/735—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/146—Data rate or code amount at the encoder output
- H04N19/149—Data rate or code amount at the encoder output by estimating the code amount by means of a model, e.g. mathematical model or statistical model
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/20—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
- H04N19/21—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding with binary alpha-plane coding for video objects, e.g. context-based arithmetic encoding [CAE]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Library & Information Science (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Algebra (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
本发明公开了多粒度级联交互网络的跨模态时序行为定位方法及装置,用于解决未修剪视频中基于给定文本查询的时序行为定位问题。本发明实施一种新的多粒度级联跨模态交互网络,以由粗到细的方式进行级联跨模态交互,用以提升模型的跨模态对齐能力。此外,本发明引入了一种局部‑全局上下文感知的视频编码器(local‑global context‑aware video encoder),用于提升视频编码器的上下文时序依赖建模能力。本发明实现方法简单,手段灵活,在提升视觉‑语言跨模态对齐精度方面具有优势,且训练所得模型在成对的视频‑查询测试数据上可显著提升时序定位准确度。
The invention discloses a cross-modal time sequence behavior positioning method and device of a multi-granularity cascade interaction network, which are used to solve the time sequence behavior positioning problem based on a given text query in an untrimmed video. The present invention implements a new multi-granularity cascaded cross-modal interaction network, which performs cascaded cross-modal interaction in a coarse-to-fine manner, so as to improve the cross-modal alignment capability of the model. In addition, the present invention introduces a local-global context-aware video encoder, which is used to improve the context timing dependency modeling capability of the video encoder. The present invention has a simple implementation method and flexible means, and has advantages in improving the visual-language cross-modal alignment accuracy, and the model obtained by training can significantly improve the time-series positioning accuracy on the paired video-query test data.
Description
技术领域technical field
本发明涉及视觉-语言跨模态学习领域,尤其是涉及跨模态时序行为定位方法及装置。The present invention relates to the field of vision-language cross-modal learning, in particular to a cross-modal timing behavior positioning method and device.
背景技术Background technique
随着多媒体和网络技术的迅猛发展,以及交通、校园和商场等场所大规模视频监控的日益普及,海量的视频数据呈现快速的几何式增长,视频理解已成为一个重要且亟待解决的问题。其中,时序行为定位是视频理解的基础和重要组成部分。基于视觉单模态的时序行为定位研究将待定位的行为限定在预定义的行为集合中,然而,在真实世界中行为复杂多样,预定义的行为集合难以满足真实世界的需要。如图1所示,视觉-语言跨模态时序行为定位任务给定视频中某段行为的文本描述作为查询,在视频中对相应行为片段进行时序定位。视觉-语言跨模态时序行为定位是一种非常自然的人机交互方式,该项技术在网络短视频内容检索与生产、智能视频监控以及人机交互等领域具有广阔的应用前景。With the rapid development of multimedia and network technologies, and the increasing popularity of large-scale video surveillance in places such as transportation, campuses, and shopping malls, massive video data presents a rapid geometric growth, and video understanding has become an important and urgent problem to be solved. Among them, temporal behavior localization is the basis and important part of video understanding. The research on temporal behavior localization based on visual single modality limits the behavior to be localized in a predefined behavior set. However, in the real world, the behavior is complex and diverse, and the predefined behavior set is difficult to meet the needs of the real world. As shown in Figure 1, the visual-linguistic cross-modal temporal behavior localization task is given the text description of a certain behavior in a video as a query, and temporally localizes the corresponding behavior segment in the video. Vision-language cross-modal timing behavior localization is a very natural human-computer interaction method. This technology has broad application prospects in the fields of network short video content retrieval and production, intelligent video surveillance, and human-computer interaction.
在深度学习的推动下,视觉-语言跨模态时序行为定位任务引起了工业界和学术界的广泛关注。由于异构的文本模态与视觉模态之间存在显著的语义鸿沟,在从文本模态到视觉模态的跨模态时序行为定位任务中,如何实现模态间的语义对齐是一个核心问题。现有的视觉-语言跨模态时序行为定位方法主要有三类,包括基于候选片段提名的方法、免候选片段提名的方法以及基于序列决策的方法。视觉-语言跨模态对齐在现有的三类方法中均为不可或缺的重要环节。然而,现有方法在视觉-语言跨模态交互环节没有充分利用多粒度的文本查询信息,且在视频表征编码环节没有充分建模视频的局部上下文时序依赖特性。Driven by deep learning, the task of visual-linguistic cross-modal temporal behavior localization has attracted extensive attention from industry and academia. Due to the significant semantic gap between heterogeneous text modalities and visual modalities, how to achieve semantic alignment between modalities is a core issue in the task of cross-modal temporal behavior localization from text modalities to visual modalities. . Existing visual-linguistic cross-modal temporal behavior localization methods mainly fall into three categories, including candidate segment nomination-based methods, candidate segment-nomination-free methods, and sequence decision-based methods. Vision-linguistic cross-modal alignment is an indispensable and important link in all three existing methods. However, existing methods do not fully utilize the multi-granularity of textual query information in the visual-linguistic cross-modal interaction link, and do not fully model the local contextual temporal dependence of video in the video representation coding link.
发明内容SUMMARY OF THE INVENTION
为解决现有技术的不足,在视觉-语言跨模态时序行为定位任务中,实现提升视觉-语言跨模态对齐精度的目的,本发明采用如下的技术方案:In order to solve the deficiencies of the prior art, in the visual-language cross-modal timing behavior positioning task, to achieve the purpose of improving the visual-language cross-modal alignment accuracy, the present invention adopts the following technical solutions:
多粒度级联交互网络的跨模态时序行为定位方法,包括以下步骤:A cross-modal timing behavior localization method for multi-granularity cascade interaction network, including the following steps:
步骤S1:给定未修剪的视频样本,利用视觉预训练模型进行视频表征的初步提取,并采用局部-全局的方式,对初步提取后的视频表征进行上下文感知的时序依赖编码,得到最终的视频表征,从而提升视频表征的上下文时序依赖建模能力;Step S1: Given an untrimmed video sample, use a visual pre-training model to perform preliminary extraction of video representations, and use a local-global approach to perform context-aware time-series-dependent coding on the preliminarily extracted video representations to obtain the final video. characterization, thereby improving the contextual timing dependence modeling ability of video characterization;
步骤S2:对于未修剪视频相应的文本查询,采用预训练的词嵌入模型,对查询文本中各个单词进行词嵌入初始化,然后采用多层双向长短时记忆网络,进行上下文编码,得到文本查询的单词级表征和全局级表征;Step S2: For the text query corresponding to the untrimmed video, a pre-trained word embedding model is used to initialize the word embedding for each word in the query text, and then a multi-layer bidirectional long-term memory network is used to perform context encoding to obtain the words of the text query. level characterization and global level characterization;
步骤S3:对于已提取的视频表征和文本查询表征,采用多粒度级联交互网络进行视频模态和文本查询模态间的交互,得到查询引导的增强化视频表征,从而提升跨模态对齐精度;Step S3: For the extracted video representations and text query representations, a multi-granularity cascaded interaction network is used to perform the interaction between the video modality and the text query modality to obtain query-guided enhanced video representations, thereby improving cross-modal alignment accuracy ;
步骤S4:对于经过多粒度级联交互后得到的视频表征,采用基于注意力的时序位置回归模块,预测文本查询相应的目标视频片段时序位置;Step S4: For the video representation obtained after the multi-granularity cascade interaction, an attention-based time series position regression module is used to predict the time series position of the corresponding target video segment for the text query;
步骤S5:对于步骤S1~S4所组成的基于多粒度级联交互网络的跨模态时序行为定 位模型,利用训练样本集进行该模型的训练,训练时所采用的总损失函数包括注意力对齐 损失和边界损失,其中,边界损失包括平滑损失和时序广义交并比损失,从而更好地适应 于时序定位任务的评测准则,训练样本集由若干{视频,查询,目标视频片段时序位置标注} 三元组样本构成。 Step S5: For the cross-modal time series behavior localization model based on the multi-granularity cascade interaction network composed of steps S1~S4, use the training sample set to train the model, and the total loss function used during training includes the attention alignment loss. and boundary loss, where boundary loss includes smoothing The loss and the time-series generalized intersection loss are better adapted to the evaluation criteria of the time-series location task. The training sample set consists of several {video, query, target video segment time-series location annotation} triple samples.
进一步地,所述步骤S1中,基于视觉预训练模型,以离线方式提取视频帧特征并均 匀地采样T帧,然后经过一个线性变换层,获取一组视频表征,为视频 第i帧的表征,进而对视频表征采用局部-全局的方式,进行上下文感知的时序依赖编码。 Further, in the step S1, based on the visual pre-training model, the video frame features are extracted offline and T frames are sampled uniformly, and then a set of video representations are obtained through a linear transformation layer. , is the representation of the i-th frame of the video, and then the representation of the video Context-aware time-dependent encoding is performed in a local-global manner.
进一步地,所述步骤S1中的局部-全局上下文感知编码,首先对视频表征进行局 部上下文感知编码,得到视频表征;然后对视频表征进行全局上下文感知编码,得到视 频表征。 Further, the local-global context-aware coding in the step S1 firstly characterizes the video Perform local context-aware encoding to obtain video representations ; then characterize the video Perform global context-aware coding to obtain video representations .
进一步地,所述步骤S1中的局部上下文感知编码和全局上下文感知编码,分别以如下方式进行实施:Further, the local context-aware coding and the global context-aware coding in the step S1 are respectively implemented in the following ways:
步骤S1.1,局部上下文感知编码采用一组配备一维偏移窗口的连续局部变压器 (local transformer)块,将视频表征作为初始表征,输入第一块一维偏移窗口的连续局 部变压器块,将得到的结果输入第二块一维偏移窗口的连续局部变压器块,以此类推,将最 后一块一维偏移窗口的连续局部变压器块的输出,作为局部上下文感知编码输出的视频表 征;一维偏移窗口的连续局部变压器块内部操作如下: Step S1.1, local context-aware coding employs a set of continuous local transformer blocks equipped with a one-dimensional offset window to represent the video As an initial representation, input the continuous local transformer block of the first one-dimensional offset window, input the obtained result into the continuous local transformer block of the second one-dimensional offset window, and so on, put the last block of the one-dimensional offset window into the continuous local transformer block. The output of successive local transformer blocks, as the video representation of the output of the local context-aware encoding ; The internal operation of the continuous local transformer block for a one-dimensional offset window is as follows:
对获取的视频表征进行层标准化后,通过一维窗口多头自注意力模块,将得 到的结果与视频表征相加,得到视频表征;对视频表征进行层标准化后,通过多 层感知器,将得到的结果与视频表征相加,得到视频表征;对视频表征进行层标准 化后,通过一维偏移窗口多头自注意力模块,将得到的结果与视频表征相加,得到视频表 征;对视频表征进行层标准化后,通过多层感知器,将得到的结果与视频表征相加, 输出视频表征作为一维偏移窗口的连续局部变压器块的输出,表示第块配备一维偏 移窗口的连续局部变压器块。 Characterization of the acquired video After layer normalization, the obtained results are compared with the video representation through the one-dimensional window multi-head self-attention module. Add to get the video representation ; representation of video After layer normalization, the obtained results are compared with the video representation through a multi-layer perceptron Add to get the video representation ; representation of video After layer normalization, the obtained results are compared with the video representation through the one-dimensional offset window multi-head self-attention module. Add to get the video representation ; representation of video After layer normalization, the obtained results are compared with the video representation through a multi-layer perceptron Add, output video representation The output of successive local transformer blocks as a 1D offset window, means the first The block is equipped with a continuous local transformer block of one-dimensional offset windows.
具体地,第块配备一维偏移窗口的连续局部变压器块表示为: Specifically, the first A continuous local transformer block equipped with a one-dimensional offset window is represented as:
其中,,为层标准化,为一维窗口多头自注意力模块,为多 层感知器,为一维偏移窗口多头自注意力模块。 in, , for layer normalization, is a one-dimensional window multi-head self-attention module, is a multilayer perceptron, A multi-head self-attention module for one-dimensional offset windows.
步骤S1.2,全局上下文感知编码包括一组常规变压器块,将视频表征做出初始 表征输入第一块常规变压器块,将得到的结果输入第二块常规变压器块,以此类推,将最后 一块常规变压器块的输出,作为全局上下文感知编码输出视频表征;常规变压器块内部操 作如下: Step S1.2, the global context-aware coding consists of a set of conventional transformer blocks that characterize the video Make the initial representation and input it into the first regular transformer block, input the obtained result into the second regular transformer block, and so on, use the output of the last regular transformer block as the global context-aware coding output video representation ; The internal operation of a conventional transformer block is as follows:
获取的视频表征,通过常规多头自注意力模块后,将得到的结果与视频表征相加后,再进行层标准化,得到视频表征;视频表征通过多层感知器后,将得到的 结果与视频表征相加后,再进行层标准化,得到的视频表征作为常规变压器块的输 出,表示第块常规变压器块。 Acquired video representation , after passing the conventional multi-head self-attention module, the obtained results are compared with the video representation After addition, layer normalization is performed to obtain the video representation ; video representation After passing through the multi-layer perceptron, the obtained results are compared with the video representation After addition, layer normalization is performed, and the resulting video representation As the output of a regular transformer block, means the first block regular transformer block.
具体地,第个变压器块表示为: Specifically, the first A transformer block is represented as:
其中,,为常规多头自注意力模块,为层标准化,为多层感知 器。 in, , is the conventional multi-head self-attention module, for layer normalization, is a multilayer perceptron.
进一步地,所述步骤S2中,查询文本中每个单词对应的可学习词嵌入向量,使用预 训练的词嵌入模型进行初始化,得到文本查询的嵌入向量序列,为视频第i 个单词的表征,通过多层的双向长短时记忆网络(BLSTM),对文本查询的嵌入向量序列进 行上下文编码,得到查询的单词级文本查询表征,通过的前向隐状态 向量和的后向隐状态向量的拼接,得到全局级文本查询表征,最终得到文本查询表征。 Further, in the step S2, the learnable word embedding vector corresponding to each word in the query text is initialized using a pre-trained word embedding model to obtain the embedding vector sequence of the text query. , For the representation of the ith word in the video, through a multi-layer bidirectional long short-term memory network (BLSTM), the embedding vector sequence of the text query Perform context encoding to obtain a word-level textual query representation of the query ,pass The forward hidden state vector sum of The concatenation of the backward hidden state vectors of , to obtain the global-level text query representation , and finally get the text query representation .
具体实施方式如下:The specific implementation is as follows:
其中为的前向隐状态向量和的后向隐状态向量的拼接。 in for The forward hidden state vector sum of The concatenation of the backward hidden state vectors of .
进一步地,所述步骤S3中的多粒度级联交互网络,首先将视频表征和文本查询 表征,通过视频引导的查询解码,得到视频引导的查询表征,表示 全局级视频引导的查询表征,表示单词级视频引导的查询表征,然后将视频引导的查询 表征与视频模态表征,通过级联跨模态融合,得到最终的增强化视频表征。视频引导的 查询解码,用以缩小视频表征和文本查询表征模态之间的语义鸿沟。 Further, in the multi-granularity cascade interaction network in the step S3, the video is first characterized and text query representation , through video-guided query decoding to obtain video-guided query representations , represents a global-level video-guided query representation, represent word-level video-guided query representations, and then characterize the video-guided query representations and video modality characterization , through cascaded cross-modal fusion, the final enhanced video representation is obtained. Video-guided query decoding to narrow down video representations and text query representation Semantic gap between modalities.
进一步地,所述步骤S3包括如下步骤:Further, the step S3 includes the following steps:
步骤S3.1,视频引导的查询解码采用一组跨模态解码块,将文本查询表征作为 初始表征输入第一块跨模态解码块,将得到的结果输入第二块跨模态解码块,以此类推,将 最后一块跨模态解码块的输出,作为视频引导的查询表征;所述步骤S3.1中的跨模态解 码块的内部操作如下: Step S3.1, video-guided query decoding uses a set of cross-modal decoding blocks to characterize the text query. Input the first cross-modal decoding block as the initial representation, input the obtained result to the second cross-modal decoding block, and so on, use the output of the last cross-modal decoding block as the video-guided query representation ; The internal operation of the cross-modal decoding block in the step S3.1 is as follows:
将获取的文本查询表征,通过多头自注意力模块,得到文本查询表征;将文 本查询表征作为查询,将视频表征作为键和值,通过多头交叉注意力模块,得到文本查 询表征;文本查询表征通过常规前向网络,得到的文本查询表征作为跨模态解码块 的输出;表示第块跨模态解码块。 The text query representation that will be obtained , through the multi-head self-attention module, the text query representation is obtained ; characterize the text query As a query, the video representation As keys and values, the text query representation is obtained through a multi-head cross-attention module ; text query representation Text query representations obtained through conventional feed-forward networks as the output of the cross-modal decoding block; means the first Block cross-modal decoding blocks.
具体地,第个跨模态解码块表示为: Specifically, the first A cross-modal decoding block is represented as:
其中,,和分别为多头自注意力模块和多 头交叉注意力模块,为常规前向网络(feed forward network)。 in, , and are the multi-head self-attention module and the multi-head cross-attention module, respectively. It is a regular feed forward network.
步骤S3.2,级联跨模态融合,首先将全局级视频引导的查询表征与视频模态表 征,通过逐元素乘,在粗粒度级进行跨模态融合,得到粗粒度级融合后的视频表征,然后 将单词级视频引导的查询表征与粗粒度级融合后的视频表征,通过另一组跨模态解码 块,在细粒度级进行跨模态融合,将粗粒度级融合后的视频表征作为初始表征输入第一 块跨模态解码块,将得到的结果输入第二块跨模态解码块,以此类推,将最后一块跨模态解 码块的输出,作为增强化视频表征;所述步骤S3.2中的跨模态解码块的内部操作如下: Step S3.2, cascaded cross-modal fusion, first characterizes the global-level video-guided query and video modality characterization , through element-by-element multiplication, cross-modal fusion is performed at the coarse-grained level, and the video representation after coarse-grained level fusion is obtained , and then characterize the word-level video-guided query Video representation after fusion with coarse-grained level , through another set of cross-modal decoding blocks, cross-modal fusion is performed at the fine-grained level, and the fused video representation at the coarse-grained level is Input the first cross-modal decoding block as the initial representation, input the obtained result to the second cross-modal decoding block, and so on, take the output of the last cross-modal decoding block as the enhanced video representation ; The internal operation of the cross-modal decoding block in the step S3.2 is as follows:
将获取的视频表征,通过多头自注意力模块,得到视频表征;将视频表征作为查询,将单词级视频引导的查询表征作为键和值,通过多头交叉注意力模块,得到 视频表征;视频表征通过常规前向网络,得到的视频表征作为跨模态解码块的输 出;表示第块跨模态解码块。粗粒度级进行跨模态融合用于抑制背景视频帧和强调前景 视频帧,可表示为,表示逐元素乘。 Video representations to be acquired , through the multi-head self-attention module, the video representation is obtained ; characterize the video As a query, word-level video-guided query representation As keys and values, the video representation is obtained through a multi-head cross-attention module ; video representation Through the conventional feed-forward network, the resulting video representation as the output of the cross-modal decoding block; means the first Block cross-modal decoding blocks. Coarse-grained cross-modal fusion is used to suppress background video frames and emphasize foreground video frames, which can be expressed as , Represents element-wise multiplication.
第个跨模态解码块表示为: the first A cross-modal decoding block is represented as:
其中,,和分别为多头自注意力模块和 多头交叉注意力模块,为常规前向网络(feed forward network)。 in, , and are the multi-head self-attention module and the multi-head cross-attention module, respectively. It is a regular feed forward network.
进一步地,所述步骤S4中的基于注意力的时序位置回归模块,将经过多粒度级联 交互的视频序列表征,通过多层感知器和SoftMax激活层,得到视频的时序注意力分数; 再将增强化视频表征与时序注意力分数,通过注意力池化层,得到目标片段的表征; 最后,将目标片段的表征,通过多层感知器,对目标片段归一化后的时序中心坐标和片 段时长进行直接回归。 Further, the attention-based time series position regression module in the step S4 characterizes the video sequence that has undergone multi-granularity cascade interaction. , through the multi-layer perceptron and SoftMax activation layer, the temporal attention score of the video is obtained ; then the enhanced video representation with temporal attention scores , through the attention pooling layer, the representation of the target segment is obtained ; Finally, the representation of the target segment , through the multi-layer perceptron, the normalized time series center coordinates of the target segment and segment duration Do a direct regression.
具体基于注意力的时序位置回归表示为:The specific attention-based time-series location regression is expressed as:
。 .
其中,为增强化视频表征,即经过多粒度级联交互后输出的视频序列表征,注 意力池化层用于汇聚视频序列表征。 in, In order to enhance the video representation, that is, the representation of the video sequence output after multi-granularity cascade interaction, the attention pooling layer is used to aggregate the representation of the video sequence.
进一步地,所述步骤S5中模型的训练,包括如下步骤:Further, the training of the model in the step S5 includes the following steps:
步骤S5.1,计算注意力对齐损失,将第i帧对应的时序注意力分数的对数与指 示值的乘积,根据采样帧数进行累加,通过该累加的结果比上根据采样帧数累加的结 果求损失,表明视频的第i帧位于时序标注片段内,反之则;注意力对齐 损失用于鼓励标注时序片段内的视频帧具有更高的注意力分数,具体计算过程可表 示为: Step S5.1, calculate the attention alignment loss , the logarithm of the temporal attention score corresponding to the i -th frame and the indicated value The product is accumulated according to the number of sampling frames, and the accumulated result is higher than Calculate the loss according to the accumulated result of the number of sampling frames , Indicates that the i-th frame of the video is located in the timing annotation segment, and vice versa ; attention alignment loss It is used to encourage video frames within annotated time series segments to have higher attention scores. The specific calculation process can be expressed as:
其中,T表示采样的帧数,表示第i帧的时序注意力分数,表明视频的第i帧 位于时序标注片段内,反之则。 Among them, T represents the number of frames sampled, represents the temporal attention score of the i-th frame, Indicates that the i-th frame of the video is located in the timing annotation segment, and vice versa .
步骤S5.2,计算边界损失,通过结合平滑损失和时序广义交并比损失进行边界损失度量;对预测片段的归一化时序中心坐标与时序标注片段的归一化时 序中心坐标的差值,求第一平滑损失,对预测片段的片段时长与时序标注片段的片 段时长的差值,求第二平滑损失,将第一、第二平滑损失的和作为损失;计算回归 片段和相应标注片段的广义交并比,将该广义交并比的负值加上1,作为时序广义交并 比损失;将损失与时序广义交并比损失的和作为边界损失;边界损失 的具体计算过程可表示如下:Step S5.2, calculate the boundary loss , by combining smooth loss and time series generalized intersection loss Perform boundary loss metrics; normalized temporal center coordinates for predicted segments Normalized temporal center coordinates with temporal annotation fragments The difference of , find the first smoothing loss, segment duration for predicted segments Clip duration with timing annotation clips The difference of , find the second smoothing loss, smoothing the first and second loss and as loss ; Calculate the regression segment and corresponding annotation fragments The generalized intersection and union ratio of , add 1 to the negative value of the generalized intersection and union ratio, as the loss of the time series generalized intersection and union ratio ; will lose Generalized intersection loss with time series The sum as the boundary loss ; frontier loss The specific calculation process can be expressed as follows:
其中,表示平滑损失函数,表示两片段的交并比,表示覆盖模型回归片 段和相应标注片段的最小时序框。 in, means smooth loss function, represents the intersection ratio of the two fragments, Represents coverage model regression snippet and corresponding annotation fragments minimum timing frame.
步骤S5.3,将注意力对齐损失与边界损失的加权和作为模型训练的总损 失。 Step S5.3, align the attention to the loss with boundary loss The weighted sum is used as the total loss for model training.
具体总损失函数为: Specific total loss function for:
其中,和为权值超参数,且在训练阶段使用优化器更新模型参数。 in, and are weight hyperparameters, and the optimizer is used to update the model parameters during the training phase.
多粒度级联交互网络的跨模态时序行为定位装置,包括一个或多个处理器,用于实现所述的多粒度级联交互网络的跨模态时序行为定位方法。An apparatus for locating cross-modal timing behavior of a multi-granularity cascading interaction network includes one or more processors for implementing the method for locating the cross-modal timing behavior of a multi-granularity cascading interaction network.
本发明的优势和有益效果在于:The advantages and beneficial effects of the present invention are:
本发明的多粒度级联交互网络的跨模态时序行为定位方法及装置,在视觉-语言跨模态交互环节以由粗到细的方式充分利用多粒度的文本查询信息,并在视频表征编码环节充分建模视频的局部-全局上下文时序依赖特性,用于解决未修剪视频中基于文本查询的时序行为定位问题。对于给定的未修剪视频和文本查询,本发明可提升视觉-语言跨模态对齐精度,进而提升跨模态时序行为定位任务的定位准确度。The cross-modal timing behavior positioning method and device of the multi-granularity cascading interaction network of the present invention make full use of the multi-granularity text query information in the visual-language cross-modal interaction link in a coarse-to-fine manner, and use the multi-granularity text query information in the video representation coding. The link fully models the local-global context temporal dependence characteristics of video, and is used to solve the problem of text query-based temporal behavior localization in untrimmed video. For a given untrimmed video and text query, the present invention can improve the visual-linguistic cross-modal alignment accuracy, thereby improving the positioning accuracy of the cross-modal timing behavior positioning task.
附图说明Description of drawings
图1是视觉-语言跨模态时序行为定位任务示例图。Figure 1 is an example diagram of the visual-linguistic cross-modal temporal behavior localization task.
图2是本发明中多粒度级联交互网络的跨模态时序行为定位的流程框图。FIG. 2 is a flow chart of the cross-modal timing behavior positioning of the multi-granularity cascade interaction network in the present invention.
图3是本发明中多粒度级联交互网络的跨模态时序行为定位方法的流程图。FIG. 3 is a flow chart of a method for locating cross-modal timing behavior of a multi-granularity cascade interaction network in the present invention.
图4是本发明中多粒度级联交互网络的跨模态时序行为定位装置的结构图。FIG. 4 is a structural diagram of a cross-modal timing behavior positioning device of a multi-granularity cascade interaction network in the present invention.
具体实施方式Detailed ways
以下结合附图对本发明的具体实施方式进行详细说明。应当理解的是,此处所描述的具体实施方式仅用于说明和解释本发明,并不用于限制本发明。The specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are only used to illustrate and explain the present invention, but not to limit the present invention.
本发明公开了一种多粒度级联交互网络的跨模态时序行为定位方法及装置,基于多粒度级联交互网络的视觉-语言跨模态时序行为定位,用于解决未修剪视频中基于给定文本查询的时序行为定位问题。该方法提出一种简单有效的多粒度级联跨模态交互网络,用以提升模型的跨模态对齐能力。此外,本发明引入一种局部-全局上下文感知的视频编码器,用于提升视频编码器的上下文时序依赖建模能力。因此训练所得模型在成对的视频-查询测试数据上可显著提升时序定位准确度。The invention discloses a method and device for locating cross-modal timing behavior of multi-granularity cascading interactive networks. The visual-language cross-modal timing behavior positioning based on multi-granularity cascading interactive networks is used to solve problems in untrimmed videos based on given The problem of temporal behavior localization of text query. This method proposes a simple and effective multi-granularity cascaded cross-modal interaction network to improve the cross-modal alignment ability of the model. In addition, the present invention introduces a local-global context-aware video encoder for improving the context timing dependency modeling capability of the video encoder. Therefore, the trained model can significantly improve the time-series positioning accuracy on the paired video-query test data.
多粒度级联交互网络的跨模态时序行为定位方法,基于Pytorch框架进行实验,使用预训练的C3D网络以离线方式提取视频帧特征,视频均匀采样为256帧,方法中所有自注意力子模块和交叉注意力子模块的头数均设置为8。在训练阶段使用Adam优化器训练模型,学习率固定为0.0004,且每个批处理由100对视频-查询组成。此外,实验实施时的性能评测准则可采用“R@n, IoU=m”评测准则,该评测准则表示评测数据集中被正确定位的查询所占百分比,其中置信度最高的n个预测片段与真实标注的交并比(IoU)最大值若大于m则认为该查询被正确定位。Cross-modal timing behavior localization method of multi-granularity cascade interaction network, based on the Pytorch framework for experiments, using the pre-trained C3D network to extract video frame features in an offline manner, the video is uniformly sampled as 256 frames, all self-attention sub-modules in the method and the number of heads of the cross-attention submodule are both set to 8. The model is trained using the Adam optimizer during the training phase, with a fixed learning rate of 0.0004, and each batch consists of 100 video-query pairs. In addition, the performance evaluation criterion during the implementation of the experiment can use the "R@n, IoU=m" evaluation criterion, which represents the percentage of correctly located queries in the evaluation data set, and the n prediction segments with the highest confidence If the maximum value of the marked intersection ratio (IoU) is greater than m, the query is considered to be correctly located.
具体实施例中,给定一个未修剪视频,将其均匀采样为视频帧序列,并给定 视频中某个行为片段的文本描述,视觉-语言跨模态时序行为定位任务是预测视频中 文本描述所对应视频片段的开始时间和结束时间。该任务的训练数据集可定义为,其中和为目标视频片段开始时间和结束时间的真实标注。 In specific embodiments, given an untrimmed video , which is uniformly sampled as a sequence of video frames , and given a video A textual description of a behavior fragment in , the visual-linguistic cross-modal temporal behavior localization task is to predict video Chinese text description The start time of the corresponding video clip and end time . The training dataset for this task can be defined as ,in and Ground-truth annotations for the start and end times of the target video clips.
如图2、图3所示,多粒度级联交互网络的跨模态时序行为定位方法,包括如下步骤:As shown in Figure 2 and Figure 3, the cross-modal timing behavior positioning method of multi-granularity cascade interaction network includes the following steps:
步骤S1:给定未修剪的视频样本,利用视觉预训练模型进行视频表征的初步提取,并采用局部-全局的方式,对初步提取后的视频表征进行上下文感知的时序依赖编码,得到最终的视频表征,从而提升视频表征的上下文时序依赖建模能力;Step S1: Given an untrimmed video sample, use a visual pre-training model to perform preliminary extraction of video representations, and use a local-global approach to perform context-aware time-series-dependent coding on the preliminarily extracted video representations to obtain the final video. characterization, thereby improving the contextual timing dependence modeling ability of video characterization;
所述步骤S1中,基于视觉预训练模型,以离线方式提取视频帧特征并均匀地采样T 帧,然后经过一个线性变换层,获取一组视频表征,为视频第i帧的表 征,进而对视频表征采用局部-全局的方式,进行上下文感知的时序依赖编码。 In the step S1, based on the visual pre-training model, the video frame features are extracted offline and T frames are sampled uniformly, and then a set of video representations are obtained through a linear transformation layer. , is the representation of the i-th frame of the video, and then the representation of the video Context-aware time-dependent encoding is performed in a local-global manner.
所述S1中的局部-全局上下文感知编码,首先对视频表征进行局部上下文感知 编码,得到视频表征;然后对视频表征进行全局上下文感知编码,得到视频表征。 The local-global context-aware coding in S1 firstly characterizes the video Perform local context-aware encoding to obtain video representations ; then characterize the video Perform global context-aware coding to obtain video representations .
所述步骤S1中的局部上下文感知编码和全局上下文感知编码,分别以如下方式进行实施:The local context-aware coding and the global context-aware coding in the step S1 are implemented in the following ways:
步骤S1.1,局部上下文感知编码采用一组配备一维偏移窗口的连续局部变压器 (local transformer)块,将视频表征作为初始表征,输入第一块一维偏移窗口的连续局 部变压器块,将得到的结果输入第二块一维偏移窗口的连续局部变压器块,以此类推,将最 后一块一维偏移窗口的连续局部变压器块的输出,作为局部上下文感知编码输出的视频表 征;一维偏移窗口的连续局部变压器块内部操作如下: Step S1.1, local context-aware coding employs a set of continuous local transformer blocks equipped with a one-dimensional offset window to represent the video As an initial representation, input the continuous local transformer block of the first one-dimensional offset window, input the obtained result into the continuous local transformer block of the second one-dimensional offset window, and so on, put the last block of the one-dimensional offset window into the continuous local transformer block. The output of successive local transformer blocks, as the video representation of the output of the local context-aware encoding ; The internal operation of the continuous local transformer block for a one-dimensional offset window is as follows:
对获取的视频表征进行层标准化后,通过一维窗口多头自注意力模块,将得 到的结果与视频表征相加,得到视频表征;对视频表征进行层标准化后,通过多 层感知器,将得到的结果与视频表征相加,得到视频表征;对视频表征进行层标准 化后,通过一维偏移窗口多头自注意力模块,将得到的结果与视频表征相加,得到视频表 征;对视频表征进行层标准化后,通过多层感知器,将得到的结果与视频表征相加, 输出视频表征作为一维偏移窗口的连续局部变压器块的输出,表示第块配备一维偏 移窗口的连续局部变压器块。Characterization of the acquired video After layer normalization, the obtained results are compared with the video representation through the one-dimensional window multi-head self-attention module. Add to get the video representation ; representation of video After layer normalization, the obtained results are compared with the video representation through a multi-layer perceptron Add to get the video representation ; representation of video After layer normalization, the obtained results are compared with the video representation through the one-dimensional offset window multi-head self-attention module. Add to get the video representation ; representation of video After layer normalization, the obtained results are compared with the video representation through a multi-layer perceptron Add, output video representation The output of successive local transformer blocks as a 1D offset window, means the first The block is equipped with a continuous local transformer block of one-dimensional offset windows.
具体地,第块配备一维偏移窗口的连续局部变压器块表示为: Specifically, the first A continuous local transformer block equipped with a one-dimensional offset window is represented as:
其中,,为层标准化,为一维窗口多头自注意力模块,为多 层感知器,为一维偏移窗口多头自注意力模块。 in, , for layer normalization, is a one-dimensional window multi-head self-attention module, is a multilayer perceptron, A multi-head self-attention module for one-dimensional offset windows.
步骤S1.2,全局上下文感知编码包括一组常规变压器块,将视频表征做出初始 表征输入第一块常规变压器块,将得到的结果输入第二块常规变压器块,以此类推,将最后 一块常规变压器块的输出,作为全局上下文感知编码输出视频表征;常规变压器块内部操 作如下: Step S1.2, the global context-aware coding consists of a set of conventional transformer blocks that characterize the video Make the initial representation and input it into the first regular transformer block, input the obtained result into the second regular transformer block, and so on, use the output of the last regular transformer block as the global context-aware coding output video representation ; The internal operation of a conventional transformer block is as follows:
获取的视频表征,通过常规多头自注意力模块后,将得到的结果与视频表征相加后,再进行层标准化,得到视频表征;视频表征通过多层感知器后,将得到的 结果与视频表征相加后,再进行层标准化,得到的视频表征作为常规变压器块的输 出,表示第块常规变压器块。 Acquired video representation , after passing the conventional multi-head self-attention module, the obtained results are compared with the video representation After addition, layer normalization is performed to obtain the video representation ; video representation After passing through the multi-layer perceptron, the obtained results are compared with the video representation After addition, layer normalization is performed, and the resulting video representation As the output of a regular transformer block, means the first block regular transformer block.
具体地,第个变压器块表示为: Specifically, the first A transformer block is represented as:
其中,,为常规多头自注意力模块,为层标准化,为多层感知 器。 in, , is the conventional multi-head self-attention module, for layer normalization, is a multilayer perceptron.
步骤S2:对于未修剪视频相应的文本查询,采用预训练的词嵌入模型,对查询文本中各个单词进行词嵌入初始化,然后采用多层双向长短时记忆网络,进行上下文编码,得到文本查询的单词级表征和全局级表征;Step S2: For the text query corresponding to the untrimmed video, a pre-trained word embedding model is used to initialize the word embedding for each word in the query text, and then a multi-layer bidirectional long-term memory network is used to perform context encoding to obtain the words of the text query. level characterization and global level characterization;
所述步骤S2中,查询文本中每个单词对应的可学习词嵌入向量,使用预训练的词 嵌入模型进行初始化,得到文本查询的嵌入向量序列,为视频第i个单词的 表征,通过多层的双向长短时记忆网络(BLSTM),对文本查询的嵌入向量序列进行上下文 编码,得到查询的单词级文本查询表征,通过的前向隐状态向量和 的后向隐状态向量的拼接,得到全局级文本查询表征,最终得到文本查询表征。 In the step S2, the learnable word embedding vector corresponding to each word in the query text is initialized by using the pre-trained word embedding model to obtain the embedding vector sequence of the text query. , For the representation of the i-th word in the video, through a multi-layer bidirectional long short-term memory network (BLSTM), the embedding vector sequence of the text query Perform context encoding to obtain a word-level textual query representation of the query ,pass The forward hidden state vector sum of The concatenation of the backward hidden state vectors of , to obtain the global-level text query representation , and finally get the text query representation .
具体实施方式如下:The specific implementation is as follows:
其中为的前向隐状态向量和的后向隐状态向量的拼接。 in for The forward hidden state vector sum of The concatenation of the backward hidden state vectors of .
步骤S3:对于已提取的视频表征和文本查询表征,采用多粒度级联交互网络进行视频模态和文本查询模态间的交互,得到查询引导的增强化视频表征,从而提升跨模态对齐精度;Step S3: For the extracted video representations and text query representations, a multi-granularity cascaded interaction network is used to perform the interaction between the video modality and the text query modality to obtain query-guided enhanced video representations, thereby improving cross-modal alignment accuracy ;
所述步骤S3中的多粒度级联交互网络,首先将视频表征和文本查询表征,通过视频引导的查询解码,得到视频引导的查询表征,表示全局 级视频引导的查询表征,表示单词级视频引导的查询表征,然后将视频引导的查询表征 与视频模态表征,通过级联跨模态融合,得到最终的增强化视频表征。视频引导的查询解 码,用以缩小视频表征和文本查询表征模态之间的语义鸿沟。 The multi-granularity cascade interaction network in the step S3 firstly characterizes the video and text query representation , through video-guided query decoding to obtain video-guided query representations , represents a global-level video-guided query representation, represent word-level video-guided query representations, and then characterize the video-guided query representations and video modality characterization , through cascaded cross-modal fusion, the final enhanced video representation is obtained. Video-guided query decoding to narrow down video representations and text query representation Semantic gap between modalities.
所述步骤S3具体包括如下步骤:The step S3 specifically includes the following steps:
步骤S3.1,视频引导的查询解码采用一组跨模态解码块,将文本查询表征作为 初始表征输入第一块跨模态解码块,将得到的结果输入第二块跨模态解码块,以此类推,将 最后一块跨模态解码块的输出,作为视频引导的查询表征;所述步骤S3.1中的跨模态解 码块的内部操作如下: Step S3.1, video-guided query decoding uses a set of cross-modal decoding blocks to characterize the text query. Input the first cross-modal decoding block as the initial representation, input the obtained result to the second cross-modal decoding block, and so on, use the output of the last cross-modal decoding block as the video-guided query representation ; The internal operation of the cross-modal decoding block in the step S3.1 is as follows:
将获取的文本查询表征,通过多头自注意力模块,得到文本查询表征;将文 本查询表征作为查询,将视频表征作为键和值,通过多头交叉注意力模块,得到文本查 询表征;文本查询表征通过常规前向网络,得到的文本查询表征作为跨模态解码块 的输出;表示第块跨模态解码块。 The text query representation that will be obtained , through the multi-head self-attention module, the text query representation is obtained ; characterize the text query As a query, the video representation As keys and values, the text query representation is obtained through a multi-head cross-attention module ; text query representation Text query representations obtained through conventional feed-forward networks as the output of the cross-modal decoding block; means the first Block cross-modal decoding blocks.
具体地,第个跨模态解码块表示为: Specifically, the first A cross-modal decoding block is represented as:
其中,,和分别为多头自注意力模块和多 头交叉注意力模块,为常规前向网络(feed forward network)。 in, , and are the multi-head self-attention module and the multi-head cross-attention module, respectively. It is a regular feed forward network.
步骤S3.2,级联跨模态融合,首先将全局级视频引导的查询表征与视频模态表 征,通过逐元素乘,在粗粒度级进行跨模态融合,得到粗粒度级融合后的视频表征,然 后将单词级视频引导的查询表征与粗粒度级融合后的视频表征,通过另一组跨模态解 码块,在细粒度级进行跨模态融合,将粗粒度级融合后的视频表征作为初始表征输入第 一块跨模态解码块,将得到的结果输入第二块跨模态解码块,以此类推,将最后一块跨模态 解码块的输出,作为增强化视频表征;所述步骤S3.2中的跨模态解码块的内部操作如 下: Step S3.2, cascaded cross-modal fusion, first characterizes the global-level video-guided query and video modality characterization , through element-by-element multiplication, cross-modal fusion is performed at the coarse-grained level, and the video representation after coarse-grained level fusion is obtained , and then characterize the word-level video-guided query Video representation after fusion with coarse-grained level , through another set of cross-modal decoding blocks, cross-modal fusion is performed at the fine-grained level, and the fused video representation at the coarse-grained level is Input the first cross-modal decoding block as the initial representation, input the obtained result to the second cross-modal decoding block, and so on, take the output of the last cross-modal decoding block as the enhanced video representation ; The internal operation of the cross-modal decoding block in the step S3.2 is as follows:
将获取的视频表征,通过多头自注意力模块,得到视频表征;将视频表征作为查询,将单词级视频引导的查询表征作为键和值,通过多头交叉注意力模块,得到 视频表征;视频表征通过常规前向网络,得到的视频表征作为跨模态解码块的输 出;表示第块跨模态解码块。粗粒度级进行跨模态融合用于抑制背景视频帧和强调前景 视频帧,可表示为,表示逐元素乘。 Video representations to be acquired , through the multi-head self-attention module, the video representation is obtained ; characterize the video As a query, word-level video-guided query representation As keys and values, the video representation is obtained through a multi-head cross-attention module ; video representation Through the conventional feed-forward network, the resulting video representation as the output of the cross-modal decoding block; means the first Block cross-modal decoding blocks. Coarse-grained cross-modal fusion is used to suppress background video frames and emphasize foreground video frames, which can be expressed as , Represents element-wise multiplication.
第个跨模态解码块表示为: the first A cross-modal decoding block is represented as:
其中,,和分别为多头自注意力模块和 多头交叉注意力模块,为常规前向网络(feed forward network)。 in, , and are the multi-head self-attention module and the multi-head cross-attention module, respectively. It is a regular feed forward network.
步骤S4:对于经过多粒度级联交互后得到的视频表征,采用基于注意力的时序位置回归模块,预测文本查询相应的目标视频片段时序位置;Step S4: For the video representation obtained after the multi-granularity cascade interaction, an attention-based time series position regression module is used to predict the time series position of the corresponding target video segment for the text query;
所述步骤S4中的基于注意力的时序位置回归模块,将经过多粒度级联交互的视频 序列表征,通过多层感知器和SoftMax激活层,得到视频的时序注意力分数;再将增强 化视频表征与时序注意力分数,通过注意力池化层,得到目标片段的表征;最后,将 目标片段的表征,通过多层感知器,对目标片段归一化后的时序中心坐标和片段时长进行直接回归。 The attention-based time-series position regression module in the step S4 characterizes the video sequence that has undergone multi-granularity cascade interaction , through the multi-layer perceptron and SoftMax activation layer, the temporal attention score of the video is obtained ; then the enhanced video representation with temporal attention scores , through the attention pooling layer, the representation of the target segment is obtained ; finally, the representation of the target fragment , through the multi-layer perceptron, the normalized time series center coordinates of the target segment and segment duration Do a direct regression.
具体基于注意力的时序位置回归表示为:The specific attention-based time-series location regression is expressed as:
。 .
其中,为增强化视频表征,即经过多粒度级联交互的视频序列表征,注意力池 化层用于汇聚视频序列表征, in, In order to enhance the video representation, that is, the video sequence representation through multi-granularity cascade interaction, the attention pooling layer is used to aggregate the video sequence representation,
步骤S5:对于步骤S1~S4所组成的基于多粒度级联交互网络的跨模态时序行为定 位模型,利用训练样本集进行该模型的训练,训练时所采用的总损失函数包括注意力对齐 损失和边界损失,其中,边界损失包括平滑损失和时序广义交并比损失,从而更好地适应 于时序定位任务的评测准则,训练样本集由若干{视频,查询,目标视频片段时序位置标注} 三元组样本构成。 Step S5: For the cross-modal time series behavior localization model based on the multi-granularity cascade interaction network composed of steps S1~S4, use the training sample set to train the model, and the total loss function used during training includes the attention alignment loss. and boundary loss, where boundary loss includes smoothing The loss and the time-series generalized intersection loss are better adapted to the evaluation criteria of the time-series location task. The training sample set consists of several {video, query, target video segment time-series location annotation} triple samples.
所述步骤S5中模型的训练,包括如下步骤:The training of the model in the step S5 includes the following steps:
步骤S5.1,计算注意力对齐损失,将第i帧对应的时序注意力分数的对数与指 示值的乘积,根据采样帧数进行累加,通过该累加的结果比上根据采样帧数累加的结 果求损失,表明视频的第i帧位于时序标注片段内,反之则;注意力对齐 损失用于鼓励标注时序片段内的视频帧具有更高的注意力分数,具体计算过程可表 示为: Step S5.1, calculate the attention alignment loss , the logarithm of the temporal attention score corresponding to the i -th frame and the indicated value The product is accumulated according to the number of sampling frames, and the accumulated result is higher than Calculate the loss according to the accumulated result of the number of sampling frames , Indicates that the i-th frame of the video is located in the timing annotation segment, and vice versa ; attention alignment loss It is used to encourage video frames within annotated time series segments to have higher attention scores. The specific calculation process can be expressed as:
其中,T表示采样的帧数,表示第i帧的时序注意力分数,表明视频的第i 帧位于时序标注片段内,反之则。 Among them, T represents the number of frames sampled, represents the temporal attention score of the i-th frame, Indicates that the i-th frame of the video is within the timing annotation segment, and vice versa .
步骤S5.2,计算边界损失,通过结合平滑损失和时序广义交并比损失进行边界损失度量;对预测片段的归一化时序中心坐标与时序标注片段的归一化时 序中心坐标的差值,求第一平滑损失,对预测片段的片段时长与时序标注片段的片段 时长的差值,求第二平滑损失,将第一、第二平滑损失的和作为损失;计算回归片 段和相应标注片段的广义交并比,将该广义交并比的负值加上1,作为时序广义交并比 损失;将损失与时序广义交并比损失的和作为边界损失;边界损失 的具体计算过程可表示如下: Step S5.2, calculate the boundary loss , by combining smooth loss and time series generalized intersection loss Perform boundary loss metrics; normalized temporal center coordinates for predicted segments Normalized temporal center coordinates with temporal annotation fragments The difference of , find the first smoothing loss, segment duration for predicted segments Clip duration with timing annotation clips The difference of , find the second smoothing loss, smoothing the first and second loss and as loss ; Calculate the regression segment and corresponding annotation fragments The generalized intersection and union ratio of , add 1 to the negative value of the generalized intersection and union ratio, as the loss of the time series generalized intersection and union ratio ; will lose Generalized intersection loss with time series The sum as the boundary loss ; frontier loss The specific calculation process can be expressed as follows:
其中,表示平滑损失函数,表示两片段的交并比,表示覆盖模型回归片 段和相应标注片段的最小时序框。 in, means smooth loss function, represents the intersection ratio of the two fragments, Represents coverage model regression snippet and corresponding annotation fragments minimum timing frame.
步骤S5.3,将注意力对齐损失与边界损失的加权和作为模型训练的总损 失。 Step S5.3, align the attention to the loss with boundary loss The weighted sum is used as the total loss for model training.
具体总损失函数为: Specific total loss function for:
其中,和为权值超参数,且在训练阶段使用优化器更新模型参数。 in, and are weight hyperparameters, and the optimizer is used to update the model parameters during the training phase.
本发明方法与其它现有代表性方法在TACoS测试集上的准确率对比,如表1所示,采用“R@n, IoU=m”的评测准则,这里n=1,m={0.1, 0.3, 0.5}。The accuracy of the method of the present invention and other existing representative methods on the TACoS test set is compared, as shown in Table 1, using the evaluation criterion of "R@n, IoU=m", where n=1, m={0.1, 0.3, 0.5}.
表1Table 1
与前述跨模态时序行为定位方法的实施例相对应,本发明还提供了多粒度级联交互网络的跨模态时序行为定位装置的实施例。Corresponding to the foregoing embodiments of the cross-modal timing behavior positioning method, the present invention also provides an embodiment of a cross-modal timing behavior positioning apparatus for a multi-granularity cascade interaction network.
参见图4,本发明实施例提供的多粒度级联交互网络的跨模态时序行为定位装置,包括一个或多个处理器,用于实现上述实施例中的多粒度级联交互网络的跨模态时序行为定位方法。Referring to FIG. 4 , an apparatus for locating cross-modal timing behavior of a multi-granularity cascading interaction network provided by an embodiment of the present invention includes one or more processors for implementing the cross-modality of the multi-granularity cascading interaction network in the foregoing embodiment. Temporal timing behavior localization method.
本发明多粒度级联交互网络的跨模态时序行为定位装置的实施例可以应用在任意具备数据处理能力的设备上,该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图4所示,为本发明多粒度级联交互网络的跨模态时序行为定位装置所在任意具备数据处理能力的设备的一种硬件结构图,除了图4所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能,还可以包括其他硬件,对此不再赘述。The embodiment of the device for locating cross-modal timing behavior in a multi-granularity cascade interaction network of the present invention can be applied to any device with data processing capability, which can be a device or device such as a computer. The apparatus embodiment may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, a device in a logical sense is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory through the processor of any device with data processing capability where it is located. From the perspective of hardware, as shown in FIG. 4 , it is a hardware structure diagram of any device with data processing capability where the cross-modal timing behavior positioning device of the multi-granularity cascading interaction network of the present invention is located, except that shown in FIG. 4 In addition to the processor, memory, network interface, and non-volatile memory, any device with data processing capability where the apparatus in the embodiment is located may also include other hardware, usually according to the actual function of any device with data processing capability, This will not be repeated here.
上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。For details of the implementation process of the functions and functions of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method, which will not be repeated here.
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。As for the apparatus embodiments, since they basically correspond to the method embodiments, reference may be made to the partial descriptions of the method embodiments for related parts. The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. Those of ordinary skill in the art can understand and implement it without creative effort.
本发明实施例还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述实施例中的多粒度级联交互网络的跨模态时序行为定位方法。Embodiments of the present invention further provide a computer-readable storage medium on which a program is stored, and when the program is executed by a processor, implements the method for locating cross-modal timing behavior of a multi-granularity cascaded interaction network in the foregoing embodiment.
所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元,例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备的外部存储设备,例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、SD卡、闪存卡(Flash Card)等。进一步的,所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据,还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be an internal storage unit of any device with data processing capability described in any of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium may also be an external storage device of any device with data processing capability, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), an SD card, a flash memory card equipped on the device (Flash Card) etc. Further, the computer-readable storage medium may also include both an internal storage unit of any device with data processing capability and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the device with data processing capability, and can also be used to temporarily store data that has been output or will be output.
以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明实施例技术方案的范围。The above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be described in the foregoing embodiments. The technical solutions of the present invention are modified, or some or all of the technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the scope of the technical solutions of the embodiments of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210052687.8A CN114064967B (en) | 2022-01-18 | 2022-01-18 | Cross-modal timing behavior localization method and device for multi-granularity cascade interaction network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210052687.8A CN114064967B (en) | 2022-01-18 | 2022-01-18 | Cross-modal timing behavior localization method and device for multi-granularity cascade interaction network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114064967A true CN114064967A (en) | 2022-02-18 |
CN114064967B CN114064967B (en) | 2022-05-06 |
Family
ID=80231249
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210052687.8A Active CN114064967B (en) | 2022-01-18 | 2022-01-18 | Cross-modal timing behavior localization method and device for multi-granularity cascade interaction network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114064967B (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114357124A (en) * | 2022-03-18 | 2022-04-15 | 成都考拉悠然科技有限公司 | Video paragraph positioning method based on language reconstruction and graph mechanism |
CN114581821A (en) * | 2022-02-23 | 2022-06-03 | 腾讯科技(深圳)有限公司 | Video detection method, system, storage medium and server |
CN114792424A (en) * | 2022-05-30 | 2022-07-26 | 北京百度网讯科技有限公司 | Document image processing method and device and electronic equipment |
CN114896451A (en) * | 2022-05-25 | 2022-08-12 | 云从科技集团股份有限公司 | Video clip positioning method, system, control device and readable storage medium |
CN114925232A (en) * | 2022-05-31 | 2022-08-19 | 杭州电子科技大学 | A cross-modal temporal video localization method under the framework of text question answering |
CN115131655A (en) * | 2022-09-01 | 2022-09-30 | 浙江啄云智能科技有限公司 | Training method and device of target detection model and target detection method |
CN115187783A (en) * | 2022-09-09 | 2022-10-14 | 之江实验室 | Multi-task hybrid supervised medical image segmentation method and system based on federated learning |
CN115223086A (en) * | 2022-09-20 | 2022-10-21 | 之江实验室 | Cross-modal action positioning method and system based on interactive attention guidance and correction |
CN115238130A (en) * | 2022-09-21 | 2022-10-25 | 之江实验室 | Temporal language localization method and device based on modal customization collaborative attention interaction |
CN116246213A (en) * | 2023-05-08 | 2023-06-09 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and medium |
CN116385070A (en) * | 2023-01-18 | 2023-07-04 | 中国科学技术大学 | E-commerce short video advertisement multi-objective estimation method, system, device and storage medium |
CN117076712A (en) * | 2023-10-16 | 2023-11-17 | 中国科学技术大学 | Video retrieval method, system, device and storage medium |
CN116824461B (en) * | 2023-08-30 | 2023-12-08 | 山东建筑大学 | Question understanding guiding video question answering method and system |
CN117609553A (en) * | 2024-01-23 | 2024-02-27 | 江南大学 | Video retrieval method and system based on local feature enhancement and modal interaction |
CN117724153A (en) * | 2023-12-25 | 2024-03-19 | 北京孚梅森石油科技有限公司 | Lithology recognition method based on multi-window cascading interaction |
CN117876929A (en) * | 2024-01-12 | 2024-04-12 | 天津大学 | A temporal object localization method based on progressive multi-scale context learning |
CN118897905A (en) * | 2024-10-08 | 2024-11-05 | 山东大学 | A video clip positioning method and system based on fine-grained spatiotemporal correlation modeling |
CN119152337A (en) * | 2024-11-20 | 2024-12-17 | 合肥工业大学 | Audiovisual event localization system and method based on cross-modal consistency and temporal multi-granularity collaboration |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107346328A (en) * | 2017-05-25 | 2017-11-14 | 北京大学 | A kind of cross-module state association learning method based on more granularity hierarchical networks |
CN109858032A (en) * | 2019-02-14 | 2019-06-07 | 程淑玉 | Merge more granularity sentences interaction natural language inference model of Attention mechanism |
CN111309971A (en) * | 2020-01-19 | 2020-06-19 | 浙江工商大学 | Multi-level coding-based text-to-video cross-modal retrieval method |
CN111310676A (en) * | 2020-02-21 | 2020-06-19 | 重庆邮电大学 | Video action recognition method based on CNN-LSTM and attention |
CN111782871A (en) * | 2020-06-18 | 2020-10-16 | 湖南大学 | Cross-modal video moment location method based on spatiotemporal reinforcement learning |
CN111930999A (en) * | 2020-07-21 | 2020-11-13 | 山东省人工智能研究院 | Method for implementing text query and positioning video clip by frame-by-frame cross-modal similarity correlation |
CN112115849A (en) * | 2020-09-16 | 2020-12-22 | 中国石油大学(华东) | Video scene recognition method based on multi-granularity video information and attention mechanism |
CN113111837A (en) * | 2021-04-25 | 2021-07-13 | 山东省人工智能研究院 | Intelligent monitoring video early warning method based on multimedia semantic analysis |
CN113204675A (en) * | 2021-07-07 | 2021-08-03 | 成都考拉悠然科技有限公司 | Cross-modal video time retrieval method based on cross-modal object inference network |
EP3933686A2 (en) * | 2020-11-27 | 2022-01-05 | Beijing Baidu Netcom Science Technology Co., Ltd. | Video processing method, apparatus, electronic device, storage medium, and program product |
CN113934887A (en) * | 2021-12-20 | 2022-01-14 | 成都考拉悠然科技有限公司 | No-proposal time sequence language positioning method based on semantic decoupling |
-
2022
- 2022-01-18 CN CN202210052687.8A patent/CN114064967B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107346328A (en) * | 2017-05-25 | 2017-11-14 | 北京大学 | A kind of cross-module state association learning method based on more granularity hierarchical networks |
CN109858032A (en) * | 2019-02-14 | 2019-06-07 | 程淑玉 | Merge more granularity sentences interaction natural language inference model of Attention mechanism |
CN111309971A (en) * | 2020-01-19 | 2020-06-19 | 浙江工商大学 | Multi-level coding-based text-to-video cross-modal retrieval method |
CN111310676A (en) * | 2020-02-21 | 2020-06-19 | 重庆邮电大学 | Video action recognition method based on CNN-LSTM and attention |
CN111782871A (en) * | 2020-06-18 | 2020-10-16 | 湖南大学 | Cross-modal video moment location method based on spatiotemporal reinforcement learning |
CN111930999A (en) * | 2020-07-21 | 2020-11-13 | 山东省人工智能研究院 | Method for implementing text query and positioning video clip by frame-by-frame cross-modal similarity correlation |
CN112115849A (en) * | 2020-09-16 | 2020-12-22 | 中国石油大学(华东) | Video scene recognition method based on multi-granularity video information and attention mechanism |
EP3933686A2 (en) * | 2020-11-27 | 2022-01-05 | Beijing Baidu Netcom Science Technology Co., Ltd. | Video processing method, apparatus, electronic device, storage medium, and program product |
CN113111837A (en) * | 2021-04-25 | 2021-07-13 | 山东省人工智能研究院 | Intelligent monitoring video early warning method based on multimedia semantic analysis |
CN113204675A (en) * | 2021-07-07 | 2021-08-03 | 成都考拉悠然科技有限公司 | Cross-modal video time retrieval method based on cross-modal object inference network |
CN113934887A (en) * | 2021-12-20 | 2022-01-14 | 成都考拉悠然科技有限公司 | No-proposal time sequence language positioning method based on semantic decoupling |
Non-Patent Citations (5)
Title |
---|
JONGHWAN MUN: "Local-Global Video-Text Interactions for Temporal Grounding", 《ARXIV》 * |
SHIZHE CHEN: "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning", 《ARXIV》 * |
ZHENZHI WANG: "Negative Sample Matters:A Renaissance of Metric Learning for Temporal Groounding", 《ARXIV》 * |
戴思达: "深度多模态融合技术及时间序列分析算法研究", 《中国优秀硕士学位论文全文数据库》 * |
赵才荣,齐鼎等: "智能视频监控关键技术: 行人再识别研究综述", 《中国科学:信息科学》 * |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114581821A (en) * | 2022-02-23 | 2022-06-03 | 腾讯科技(深圳)有限公司 | Video detection method, system, storage medium and server |
CN114581821B (en) * | 2022-02-23 | 2024-11-08 | 腾讯科技(深圳)有限公司 | Video detection method, system, storage medium and server |
CN114357124A (en) * | 2022-03-18 | 2022-04-15 | 成都考拉悠然科技有限公司 | Video paragraph positioning method based on language reconstruction and graph mechanism |
CN114896451A (en) * | 2022-05-25 | 2022-08-12 | 云从科技集团股份有限公司 | Video clip positioning method, system, control device and readable storage medium |
CN114792424A (en) * | 2022-05-30 | 2022-07-26 | 北京百度网讯科技有限公司 | Document image processing method and device and electronic equipment |
CN114925232A (en) * | 2022-05-31 | 2022-08-19 | 杭州电子科技大学 | A cross-modal temporal video localization method under the framework of text question answering |
CN115131655A (en) * | 2022-09-01 | 2022-09-30 | 浙江啄云智能科技有限公司 | Training method and device of target detection model and target detection method |
CN115187783A (en) * | 2022-09-09 | 2022-10-14 | 之江实验室 | Multi-task hybrid supervised medical image segmentation method and system based on federated learning |
CN115223086A (en) * | 2022-09-20 | 2022-10-21 | 之江实验室 | Cross-modal action positioning method and system based on interactive attention guidance and correction |
CN115223086B (en) * | 2022-09-20 | 2022-12-06 | 之江实验室 | Cross-modal action localization method and system based on interactive attention guidance and correction |
CN115238130A (en) * | 2022-09-21 | 2022-10-25 | 之江实验室 | Temporal language localization method and device based on modal customization collaborative attention interaction |
CN115238130B (en) * | 2022-09-21 | 2022-12-06 | 之江实验室 | Time sequence language positioning method and device based on modal customization collaborative attention interaction |
CN116385070A (en) * | 2023-01-18 | 2023-07-04 | 中国科学技术大学 | E-commerce short video advertisement multi-objective estimation method, system, device and storage medium |
CN116385070B (en) * | 2023-01-18 | 2023-10-03 | 中国科学技术大学 | E-commerce short video advertising multi-target prediction methods, systems, equipment and storage media |
CN116246213A (en) * | 2023-05-08 | 2023-06-09 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and medium |
CN116824461B (en) * | 2023-08-30 | 2023-12-08 | 山东建筑大学 | Question understanding guiding video question answering method and system |
CN117076712A (en) * | 2023-10-16 | 2023-11-17 | 中国科学技术大学 | Video retrieval method, system, device and storage medium |
CN117076712B (en) * | 2023-10-16 | 2024-02-23 | 中国科学技术大学 | Video retrieval method, system, device and storage medium |
CN117724153B (en) * | 2023-12-25 | 2024-05-14 | 北京孚梅森石油科技有限公司 | Lithology recognition method based on multi-window cascading interaction |
CN117724153A (en) * | 2023-12-25 | 2024-03-19 | 北京孚梅森石油科技有限公司 | Lithology recognition method based on multi-window cascading interaction |
CN117876929B (en) * | 2024-01-12 | 2024-06-21 | 天津大学 | A temporal object localization method based on progressive multi-scale context learning |
CN117876929A (en) * | 2024-01-12 | 2024-04-12 | 天津大学 | A temporal object localization method based on progressive multi-scale context learning |
CN117609553B (en) * | 2024-01-23 | 2024-03-22 | 江南大学 | Video retrieval method and system based on local feature enhancement and modal interaction |
CN117609553A (en) * | 2024-01-23 | 2024-02-27 | 江南大学 | Video retrieval method and system based on local feature enhancement and modal interaction |
CN118897905A (en) * | 2024-10-08 | 2024-11-05 | 山东大学 | A video clip positioning method and system based on fine-grained spatiotemporal correlation modeling |
CN119152337A (en) * | 2024-11-20 | 2024-12-17 | 合肥工业大学 | Audiovisual event localization system and method based on cross-modal consistency and temporal multi-granularity collaboration |
CN119152337B (en) * | 2024-11-20 | 2025-02-11 | 合肥工业大学 | Audio-visual event positioning system and method based on cross-modal consistency and time sequence multi-granularity collaboration |
Also Published As
Publication number | Publication date |
---|---|
CN114064967B (en) | 2022-05-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114064967B (en) | Cross-modal timing behavior localization method and device for multi-granularity cascade interaction network | |
CN110209836B (en) | Method and device for remote supervision relationship extraction | |
WO2021121198A1 (en) | Semantic similarity-based entity relation extraction method and apparatus, device and medium | |
CN110059160B (en) | End-to-end context-based knowledge base question-answering method and device | |
CN106919652B (en) | Short-sighted frequency automatic marking method and system based on multi-source various visual angles transductive learning | |
Tang et al. | Comprehensive instructional video analysis: The coin dataset and performance evaluation | |
CN111414845B (en) | Multi-form sentence video positioning method based on space-time diagram inference network | |
CN114743143A (en) | A video description generation method and storage medium based on multi-concept knowledge mining | |
CN115223086A (en) | Cross-modal action positioning method and system based on interactive attention guidance and correction | |
CN113963304B (en) | Cross-modal video timing action localization method and system based on timing-spatial graph | |
CN116186328A (en) | Video text cross-modal retrieval method based on pre-clustering guidance | |
CN116935274A (en) | Weak supervision cross-mode video positioning method based on modal feature alignment | |
CN114925232A (en) | A cross-modal temporal video localization method under the framework of text question answering | |
WO2023092719A1 (en) | Information extraction method for medical record data, and terminal device and readable storage medium | |
CN116127132A (en) | A Temporal Language Localization Approach Based on Cross-Modal Text-Related Attention | |
US20230326178A1 (en) | Concept disambiguation using multimodal embeddings | |
Huang | Multi-modal video summarization | |
CN113688871B (en) | Transformer-based video multi-label action identification method | |
CN114339403A (en) | A method, system, device and readable storage medium for generating video action clips | |
CN117152669B (en) | Cross-mode time domain video positioning method and system | |
Hao et al. | What matters: Attentive and relational feature aggregation network for video-text retrieval | |
CN115238130B (en) | Time sequence language positioning method and device based on modal customization collaborative attention interaction | |
CN116935292A (en) | A short video scene classification method and system based on self-attention model | |
CN114282537B (en) | Social text-oriented cascading linear entity relation extraction method | |
Pan et al. | A Multiple Utterances based Neural Network Model for Joint Intent Detection and Slot Filling. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |