CN115098728A - Video retrieval method and device - Google Patents
Video retrieval method and device Download PDFInfo
- Publication number
- CN115098728A CN115098728A CN202210628871.2A CN202210628871A CN115098728A CN 115098728 A CN115098728 A CN 115098728A CN 202210628871 A CN202210628871 A CN 202210628871A CN 115098728 A CN115098728 A CN 115098728A
- Authority
- CN
- China
- Prior art keywords
- video
- query
- sample
- candidate
- candidate intervals
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Library & Information Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及图像处理技术领域,具体涉及一种视频检索方法及装置。The present invention relates to the technical field of image processing, in particular to a video retrieval method and device.
背景技术Background technique
随着现如今视频分享的广泛流行,人们越来越期望以高效、准确的方式找到自己感兴趣的视频内容。为了满足这一需求,针对视频检索技术的研究由此产生。研究人员开发了多种不同的视频检索方法,目的是从候选视频集中找到与查询信息最相关的视频。然而,在实际应用中,视频并不像在实验数据集中那样直接以剪辑的形式提供,未经剪辑的长视频随处可见。这些视频通常包含复杂的内容,大部分与提供的查询信息无关,只有小部分满足查询描述。因此,另一个主题,即视频时刻检索,成为了近年来新兴的研究重点。With the widespread popularity of video sharing nowadays, people increasingly expect to find the video content they are interested in in an efficient and accurate way. In order to meet this demand, research on video retrieval technology was born. Researchers have developed a number of different video retrieval methods with the goal of finding the most relevant videos from the candidate video set that are most relevant to the query information. However, in practical applications, the videos are not directly provided in the form of clips as in the experimental datasets, and long unedited videos can be found everywhere. These videos usually contain complex content, mostly irrelevant to the query information provided, and only a small part satisfy the query description. Therefore, another topic, video moment retrieval, has become an emerging research focus in recent years.
视频时刻检索具有时间定位能力,能够从一个未经剪辑的长视频中找到与查询相关的视频片段的准确起始点和结束点。与从候选集合中检索视频不同,视频时刻检索在长视频中提供了更细粒度的时间定位,避免了手动搜索的麻烦。Video Moment Retrieval has the ability to locate the exact start and end points of the video clips relevant to the query from a long, unedited video. Unlike retrieving videos from candidate sets, video moment retrieval provides more fine-grained temporal localization in long videos, avoiding the hassle of manual searching.
目前,视频时刻检索方法主要基于自然语言查询,其目的是在长视频中定位与给定的自然语言查询相关的视频片段的起止时间点。然而,使用文本作为查询方式限制了查询中包含的信息的丰富性和复杂性。由于视频在信息表达方面具有的优势,即视频中同时包含每帧的视觉内容和帧间的语义内容,因此基于视频的视频时刻检索(VQ-VMR),逐渐成为新兴的研究课题。Currently, video moment retrieval methods are mainly based on natural language queries, which aim to locate the start and end time points of video clips related to a given natural language query in long videos. However, using text as a query method limits the richness and complexity of the information contained in the query. Video-based video moment retrieval (VQ-VMR) has gradually become an emerging research topic due to the advantages of video in information expression, that is, video contains both visual content of each frame and semantic content between frames.
给定查询视频片段,VQ-VMR任务旨在从一个长参考视频中找到与查询视频语义对应的视频片段。在此任务中,目标时刻的精确检索需要查询视频和参考视频片段之间的语义相关性度量。由于视频所包含的信息是全面而复杂的,因此很难衡量它们之间的相关性。目前,在构建一个能够有效衡量视频之间相关性的框架方面,还存在着巨大的技术空白。Given a query video clip, the VQ-VMR task aims to find the video clip semantically corresponding to the query video from a long reference video. In this task, precise retrieval of target moments requires a semantic correlation measure between the query video and reference video clips. Since the information contained in videos is comprehensive and complex, it is difficult to measure the correlation between them. Currently, there is still a huge technical gap in building a framework that can effectively measure the correlation between videos.
发明内容SUMMARY OF THE INVENTION
本发明旨在至少解决现有技术中存在的技术问题之一。为此,本发明提出一种视频检索方法及装置。The present invention aims to solve at least one of the technical problems existing in the prior art. To this end, the present invention provides a video retrieval method and device.
具体地,本发明提供了以下技术方案:Specifically, the present invention provides the following technical solutions:
第一方面,本发明实施例提供了一种视频检索方法,包括:In a first aspect, an embodiment of the present invention provides a video retrieval method, including:
将长参考视频进行拆分得到多个候选区间;Split the long reference video to obtain multiple candidate intervals;
将查询视频和所述多个候选区间输入到视频检索模型中,得到所述多个候选区间中与所述查询视频匹配的候选区间;Input the query video and the plurality of candidate intervals into the video retrieval model, and obtain the candidate intervals that match the query video in the plurality of candidate intervals;
其中,所述视频检索模型是将样本查询视频和样本长参考视频拆分得到的多个样本候选区间作为输入,将所述样本长参考视频拆分得到的多个样本候选区间中与所述样本查询视频匹配的样本候选区间作为输出,对初始机器学习模型进行训练后得到的,所述视频检索模型基于所述样本查询视频和所述样本长参考视频拆分得到的多个样本候选区间之间的局部语义相似度和全局语义相似度进行训练。Wherein, the video retrieval model takes multiple sample candidate intervals obtained by splitting the sample query video and the sample long reference video as input, and uses the sample candidate intervals obtained by splitting the sample long reference video with the sample The sample candidate interval matched by the query video is obtained after training the initial machine learning model, and the video retrieval model is based on the sample query video and the sample long reference video. The local semantic similarity and global semantic similarity are trained.
进一步地,将查询视频和所述多个候选区间输入到视频检索模型中,得到所述多个候选区间中与所述查询视频匹配的候选区间,包括:Further, input the query video and the multiple candidate intervals into the video retrieval model, and obtain the candidate intervals matching the query video among the multiple candidate intervals, including:
将查询视频和所述多个候选区间进行视频编码后输入到视频检索模型中,得到所述多个候选区间中与所述查询视频匹配的候选区间;The query video and the plurality of candidate intervals are subjected to video encoding and then input into the video retrieval model to obtain candidate intervals that match the query video in the plurality of candidate intervals;
其中,所述查询视频和所述多个候选区间的视频编码过程包括:Wherein, the video coding process of the query video and the multiple candidate intervals includes:
利用C3D模型对所述查询视频和所述多个候选区间分别进行特征提取;Use the C3D model to perform feature extraction on the query video and the multiple candidate intervals respectively;
采用两个长短时记忆单元对提取的特征进行聚合,得到所述查询视频和所述多个候选区间的视频编码结果。The extracted features are aggregated by using two long-term and short-term memory units to obtain the video coding results of the query video and the multiple candidate intervals.
进一步地,所述视频检索模型包括局部语义相似度模块,所述局部语义相似度模块用于对所述查询视频和所述多个候选区间进行局部语义相似度处理;Further, the video retrieval model includes a local semantic similarity module, and the local semantic similarity module is used to perform local semantic similarity processing on the query video and the plurality of candidate intervals;
所述局部语义相似度处理过程包括:The local semantic similarity processing process includes:
对所述查询视频和所述多个候选区间利用哈达玛积运算进行特征匹配;Using Hadamard product operation to perform feature matching on the query video and the multiple candidate intervals;
基于注意力机制,增强匹配的特征表示对,并抑制不匹配的特征对。Based on the attention mechanism, matched feature representation pairs are enhanced and mismatched feature pairs are suppressed.
进一步地,所述视频检索模型包括全局语义相似度模块,所述全局语义相似度模块用于对所述查询视频和所述多个候选区间进行全局语义相似度处理;Further, the video retrieval model includes a global semantic similarity module, and the global semantic similarity module is used to perform global semantic similarity processing on the query video and the plurality of candidate intervals;
所述全局语义相似度处理过程包括:The global semantic similarity processing process includes:
将所述查询视频和任一候选区间成对嵌入编码到一个三维张量中;encoding the query video and any candidate interval pairwise embedding into a three-dimensional tensor;
将所述三维张量作为输入,考虑视频的时间维度,从全局角度学习查询视频和候选区间之间的语义相似度。Taking the three-dimensional tensor as input, considering the temporal dimension of the video, the semantic similarity between the query video and the candidate interval is learned from a global perspective.
进一步地,所述全局语义相似度模块包括注意力层和卷积层的组合;Further, the global semantic similarity module includes a combination of an attention layer and a convolution layer;
将所述三维张量输入到所述全局语义相似度模块中,通过多个卷积层和注意层的组合传递输入的所述三维张量,所述全局语义相似度模块中的卷积核逐步从每一层中找到匹配的嵌入,得到全局匹配结果;The three-dimensional tensor is input into the global semantic similarity module, and the input three-dimensional tensor is passed through a combination of multiple convolution layers and attention layers, and the convolution kernel in the global semantic similarity module is gradually Find the matching embedding from each layer to get the global matching result;
将所述全局匹配结果进行全局平局池化,以获得一个单一向量作为视频间匹配表示的结果。The global matching result is subjected to global draw pooling to obtain a single vector as the result of the inter-video matching representation.
进一步地,所述视频检索模型的训练过程包括:Further, the training process of the video retrieval model includes:
以三元组损失和回归损失为损失函数对所述视频检索模型进行训练;training the video retrieval model with triple loss and regression loss as loss functions;
其中,所述三元组损失的收敛方向包括:对于由查询视频、正候选区间样本和负候选区间样本组成的三元组,给正向的匹配视频对分配更高的匹配得分,给负向的视频对分配更低的匹配得分;Wherein, the convergence direction of the triplet loss includes: for a triplet consisting of a query video, a positive candidate interval sample, and a negative candidate interval sample, assigning a higher matching score to a positive matching video pair, and assigning a higher matching score to a negative of video pairs assigned lower match scores;
所述回归损失包括:候选区间中心点和长度的预测相对偏移量以及候选区间中心点和长度的真实相对偏移量。The regression loss includes: the predicted relative offset of the center point and the length of the candidate interval and the actual relative offset of the center point and the length of the candidate interval.
第二方面,本发明实施例还提供了一种视频检索装置,包括:In a second aspect, an embodiment of the present invention further provides a video retrieval device, including:
拆分模块,用于将长参考视频进行拆分得到多个候选区间;The splitting module is used to split the long reference video to obtain multiple candidate intervals;
检索模块,用于将查询视频和所述多个候选区间输入到视频检索模型中,得到所述多个候选区间中与所述查询视频匹配的候选区间;a retrieval module, configured to input the query video and the plurality of candidate intervals into a video retrieval model, and obtain a candidate interval that matches the query video in the plurality of candidate intervals;
其中,所述视频检索模型是将样本查询视频和样本长参考视频拆分得到的多个样本候选区间作为输入,将所述样本长参考视频拆分得到的多个样本候选区间中与所述样本查询视频匹配的样本候选区间作为输出,对初始机器学习模型进行训练后得到的,所述视频检索模型基于所述样本查询视频和所述样本长参考视频拆分得到的多个样本候选区间之间的局部语义相似度和全局语义相似度进行训练。Wherein, the video retrieval model takes multiple sample candidate intervals obtained by splitting the sample query video and the sample long reference video as input, and uses the sample candidate intervals obtained by splitting the sample long reference video with the sample The sample candidate interval matched by the query video is obtained after training the initial machine learning model, and the video retrieval model is based on the sample query video and the sample long reference video. The local semantic similarity and global semantic similarity are trained.
进一步地,所述检索模块具体用于:Further, the retrieval module is specifically used for:
将查询视频和所述多个候选区间进行视频编码后输入到视频检索模型中,得到所述多个候选区间中与所述查询视频匹配的候选区间;The query video and the plurality of candidate intervals are subjected to video encoding and then input into the video retrieval model to obtain candidate intervals that match the query video in the plurality of candidate intervals;
其中,所述查询视频和所述多个候选区间的视频编码过程包括:Wherein, the video coding process of the query video and the multiple candidate intervals includes:
利用C3D模型对所述查询视频和所述多个候选区间分别进行特征提取;Use the C3D model to perform feature extraction on the query video and the multiple candidate intervals respectively;
采用两个长短时记忆单元对提取的特征进行聚合,得到所述查询视频和所述多个候选区间的视频编码结果。The extracted features are aggregated by using two long-term and short-term memory units to obtain the video coding results of the query video and the multiple candidate intervals.
第三方面,本发明实施例还提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如第一方面所述视频检索方法的步骤。In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implementing the first program when executing the program The steps of the video retrieval method described in the aspect.
第四方面,本发明实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如第一方面所述视频检索方法的步骤。In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of the video retrieval method according to the first aspect.
根据上面的技术方案可知,本发明实施例提供的视频检索方法及装置,使得视频检索模型同时考虑视频的时空维度,分别从局部帧级别和全局角度计算和学习视频之间的语义相似度,大大提高了视频检索任务片段定位的准确性。According to the above technical solutions, the video retrieval method and device provided in the embodiments of the present invention enable the video retrieval model to simultaneously consider the spatiotemporal dimensions of the video, and calculate and learn the semantic similarity between the videos from the local frame level and the global perspective respectively. Improves the accuracy of segment location for video retrieval tasks.
需要说明的是,本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。It is to be noted that additional aspects and advantages of the present invention will be set forth, in part, from the following description, and in part will be apparent from the following description, or may be learned by practice of the invention.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are For some embodiments of the present invention, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.
图1是本发明一实施例提供的视频检索方法的流程图;1 is a flowchart of a video retrieval method provided by an embodiment of the present invention;
图2是本发明一实施例提供的视频检索方法的实现原理示意图;2 is a schematic diagram of an implementation principle of a video retrieval method provided by an embodiment of the present invention;
图3是本发明一实施例提供的视频检索方法的实验效果示意图;3 is a schematic diagram of an experimental effect of a video retrieval method provided by an embodiment of the present invention;
图4是本发明一实施例提供的视频检索方法的实现过程示意图;4 is a schematic diagram of an implementation process of a video retrieval method provided by an embodiment of the present invention;
图5是本发明一实施例提供的电子设备的结构示意图。FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整的描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
图1示出了本发明一实施例提供的视频检索方法的流程示意图,参见图1,本发明实施例提供的视频检索方法,包括:FIG. 1 shows a schematic flowchart of a video retrieval method provided by an embodiment of the present invention. Referring to FIG. 1 , the video retrieval method provided by an embodiment of the present invention includes:
步骤101:将长参考视频进行拆分得到多个候选区间;Step 101: splitting the long reference video to obtain multiple candidate intervals;
步骤102:将查询视频和所述多个候选区间输入到视频检索模型中,得到所述多个候选区间中与所述查询视频匹配的候选区间;Step 102: input the query video and the plurality of candidate intervals into the video retrieval model, and obtain the candidate intervals that match the query video in the plurality of candidate intervals;
其中,所述视频检索模型是将样本查询视频和样本长参考视频拆分得到的多个样本候选区间作为输入,将所述样本长参考视频拆分得到的多个样本候选区间中与所述样本查询视频匹配的样本候选区间作为输出,对初始机器学习模型进行训练后得到的,所述视频检索模型基于所述样本查询视频和所述样本长参考视频拆分得到的多个样本候选区间之间的局部语义相似度和全局语义相似度进行训练。Wherein, the video retrieval model takes multiple sample candidate intervals obtained by splitting the sample query video and the sample long reference video as input, and uses the sample candidate intervals obtained by splitting the sample long reference video with the sample The sample candidate interval matched by the query video is obtained after training the initial machine learning model, and the video retrieval model is based on the sample query video and the sample long reference video. The local semantic similarity and global semantic similarity are trained.
根据上面的技术方案可知,本发明实施例提供的视频检索方法,使得视频检索模型同时考虑视频的时空维度,分别从局部帧级别和全局角度计算和学习视频之间的语义相似度,大大提高了视频检索任务片段定位的准确性。According to the above technical solutions, the video retrieval method provided by the embodiment of the present invention enables the video retrieval model to simultaneously consider the spatiotemporal dimensions of the video, and calculate and learn the semantic similarity between the videos from the local frame level and the global perspective respectively, which greatly improves the performance of the video retrieval model. Accuracy of segment localization for video retrieval tasks.
基于上述实施例的内容,在本实施例中,将查询视频和所述多个候选区间输入到视频检索模型中,得到所述多个候选区间中与所述查询视频匹配的候选区间,包括:Based on the content of the above embodiment, in this embodiment, the query video and the multiple candidate intervals are input into the video retrieval model, and the candidate intervals matching the query video among the multiple candidate intervals are obtained, including:
将查询视频和所述多个候选区间进行视频编码后输入到视频检索模型中,得到所述多个候选区间中与所述查询视频匹配的候选区间;The query video and the plurality of candidate intervals are subjected to video encoding and then input into the video retrieval model to obtain candidate intervals that match the query video in the plurality of candidate intervals;
其中,所述查询视频和所述多个候选区间的视频编码过程包括:Wherein, the video coding process of the query video and the multiple candidate intervals includes:
利用C3D模型对所述查询视频和所述多个候选区间分别进行特征提取;Use the C3D model to perform feature extraction on the query video and the multiple candidate intervals respectively;
采用两个长短时记忆单元对提取的特征进行聚合,得到所述查询视频和所述多个候选区间的视频编码结果。The extracted features are aggregated by using two long-term and short-term memory units to obtain the video coding results of the query video and the multiple candidate intervals.
基于上述实施例的内容,在本实施例中,所述视频检索模型包括局部语义相似度模块,所述局部语义相似度模块用于对所述查询视频和所述多个候选区间进行局部语义相似度处理;Based on the content of the above embodiment, in this embodiment, the video retrieval model includes a local semantic similarity module, and the local semantic similarity module is used to perform local semantic similarity between the query video and the multiple candidate intervals. degree processing;
所述局部语义相似度处理过程包括:The local semantic similarity processing process includes:
对所述查询视频和所述多个候选区间利用哈达玛积运算进行特征匹配;Using Hadamard product operation to perform feature matching on the query video and the multiple candidate intervals;
基于注意力机制,增强匹配的特征表示对,并抑制不匹配的特征对。Based on the attention mechanism, matched feature representation pairs are enhanced and mismatched feature pairs are suppressed.
基于上述实施例的内容,在本实施例中,所述视频检索模型包括全局语义相似度模块,所述全局语义相似度模块用于对所述查询视频和所述多个候选区间进行全局语义相似度处理;Based on the content of the above embodiment, in this embodiment, the video retrieval model includes a global semantic similarity module, and the global semantic similarity module is used to perform global semantic similarity between the query video and the multiple candidate intervals. degree processing;
所述全局语义相似度处理过程包括:The global semantic similarity processing process includes:
将所述查询视频和任一候选区间成对嵌入编码到一个三维张量中;encoding the query video and any candidate interval pairwise embedding into a three-dimensional tensor;
将所述三维张量作为输入,考虑视频的时间维度,从全局角度学习查询视频和候选区间之间的语义相似度。Taking the three-dimensional tensor as input, considering the temporal dimension of the video, the semantic similarity between the query video and the candidate interval is learned from a global perspective.
基于上述实施例的内容,在本实施例中,所述全局语义相似度模块包括注意力层和卷积层的组合;Based on the content of the above embodiment, in this embodiment, the global semantic similarity module includes a combination of an attention layer and a convolution layer;
将所述三维张量输入到所述全局语义相似度模块中,通过多个卷积层和注意层的组合传递输入的所述三维张量,所述全局语义相似度模块中的卷积核逐步从每一层中找到匹配的嵌入,得到全局匹配结果;The three-dimensional tensor is input into the global semantic similarity module, and the input three-dimensional tensor is passed through a combination of multiple convolution layers and attention layers, and the convolution kernel in the global semantic similarity module is gradually Find the matching embedding from each layer to get the global matching result;
将所述全局匹配结果进行全局平局池化,以获得一个单一向量作为视频间匹配表示的结果。The global matching result is subjected to global draw pooling to obtain a single vector as the result of the inter-video matching representation.
基于上述实施例的内容,在本实施例中,所述视频检索模型的训练过程包括:Based on the content of the above embodiment, in this embodiment, the training process of the video retrieval model includes:
以三元组损失和回归损失为损失函数对所述视频检索模型进行训练;training the video retrieval model with triple loss and regression loss as loss functions;
其中,所述三元组损失的收敛方向包括:对于由查询视频、正候选区间样本和负候选区间样本组成的三元组,给正向的匹配视频对分配更高的匹配得分,给负向的视频对分配更低的匹配得分;Wherein, the convergence direction of the triplet loss includes: for a triplet consisting of a query video, a positive candidate interval sample, and a negative candidate interval sample, assigning a higher matching score to a positive matching video pair, and assigning a higher matching score to a negative of video pairs assigned lower match scores;
所述回归损失包括:候选区间中心点和长度的预测相对偏移量以及候选区间中心点和长度的真实相对偏移量。The regression loss includes: the predicted relative offset of the center point and the length of the candidate interval and the actual relative offset of the center point and the length of the candidate interval.
考虑到现有VQ-VMR技术在有效衡量视频之间相关性方面存在的不足,本发明实施例提出了一种新的框架来测量查询视频和参考视频片段之间的语义相关性。通过利用卷积结构和注意机制,本发明实施例的模型同时考虑了视频的时空维度,分别从局部帧级别和全局角度计算和学习视频之间的语义相似度,大大提高了VQ-VMR任务片段定位的准确性。下面结合图2所示的原理图对本发明一实施例提供的视频检索方法进行解释说明。Considering the deficiencies of the existing VQ-VMR technology in effectively measuring the correlation between videos, the embodiment of the present invention proposes a new framework to measure the semantic correlation between the query video and the reference video segment. By using the convolution structure and attention mechanism, the model of the embodiment of the present invention considers the spatiotemporal dimensions of the video, calculates and learns the semantic similarity between the videos from the local frame level and the global perspective, and greatly improves the VQ-VMR task segment. positioning accuracy. The following describes a video retrieval method provided by an embodiment of the present invention with reference to the schematic diagram shown in FIG. 2 .
步骤一:视频编码。Step 1: Video encoding.
首先采用TAG候选区间生成算法将长参考视频R拆分为N个候选区间片段,记为:查询视频Q与任意一个候选区间pk作为构建模型的一对输入,分别进行视频编码。视频编码过程分为两个步骤:First, the TAG candidate interval generation algorithm is used to split the long reference video R into N candidate interval segments, denoted as: The query video Q and any candidate interval p k are used as a pair of inputs to construct the model, and video coding is performed respectively. The video encoding process is divided into two steps:
①利用C3D模型对查询视频Q和候选区间pk分别进行特征提取。①Use the C3D model to extract the features of the query video Q and the candidate interval p k respectively.
其中qi表示查询视频的第i个特征,表示候选区间的第j个特征;m,n分别表示查询视频和候选区间提取的特征总数。where qi represents the ith feature of the query video, Represents the jth feature of the candidate interval; m, n represent the total number of features extracted from the query video and the candidate interval, respectively.
②由于上述特征仅考虑较短范围内的视频特性,因此本发明实施例采用两个长短时记忆单元(LSTM)来对提取的特征进行进一步聚合。② Since the above features only consider video features in a short range, two long-short-term memory units (LSTMs) are used in the embodiment of the present invention to further aggregate the extracted features.
为了简化,在后续的内容中,本发明实施例将记为 For simplicity, in the following content, the embodiments of the present invention will marked as
步骤二:局部帧级别的视频间语义相似度计算。Step 2: Calculate the semantic similarity between videos at the local frame level.
①特征匹配。基于得到的视频特征利用哈达玛积运算完成特征匹配操作。任意两个视频特征之间,本发明实施例计算二者的相关性并记为γ,公式如下:① Feature matching. Based on the obtained video features The feature matching operation is completed using the Hadamard product operation. Between any two video features, the embodiment of the present invention calculates the correlation between the two and denote it as γ, and the formula is as follows:
其中,⊙表示哈达玛积运算;D表示全连接层。Among them, ⊙ represents the Hadamard product operation; D represents the fully connected layer.
②注意力机制。引入注意力机制,旨在有效地鼓励匹配的特征表示对,并同时抑制不匹配的特征对。针对每一对视频特征,本发明实施例计算其注意力权重α:② attention mechanism. An attention mechanism is introduced, aiming to effectively encourage matching feature representation pairs while suppressing mismatched feature pairs. For each pair of video features, the embodiment of the present invention calculates its attention weight α:
αi,j=σ(WTD4(ri,j))α i,j =σ(W T D 4 (ri ,j ))
其中:σ表示sigmoid函数,是一个可学习参数。具体来说,视频特征之间的相关性越高,得到的权重α值应该更大;相反,特征之间的相关性越低,权重α值则更小。where: σ represents the sigmoid function, is a learnable parameter. Specifically, the higher the correlation between the video features, the larger the weight α value should be; on the contrary, the lower the correlation between the features, the smaller the weight α value.
因此,一对视频特征之间的相关性最终表示为:ti,j=αi,jγi,j Therefore, the correlation between a pair of video features is finally expressed as: t i,j =αi ,j γi ,j
③三维张量。本发明实施例将两个视频序列之间的所有成对嵌入编码到一个三维张量中,并表示为T:③ three-dimensional tensor. This embodiment of the present invention encodes all pairwise embeddings between two video sequences into a three-dimensional tensor, denoted as T:
步骤三:全局视频间语义相似度计算。Step 3: Calculation of global semantic similarity between videos.
将获得的三维张量T作为输入,从全局的角度学习查询视频和候选区间之间的语义相似度。上一节中得到的匹配结果只在局部帧级计算查询视频与候选区间之间的关系,没有考虑视频的时间维度。在VQ-VMR任务中,匹配的视频应该在连续的时间范围内共享语义相似的帧。多个孤立的高匹配点不能代表视频整体间的高匹配度。因此,在本节中,本发明实施例引入了另一种注意机制来消除孤立的高匹配值,以获得更好的相关性度量。Taking the obtained 3D tensor T as input, the semantic similarity between the query video and the candidate interval is learned from a global perspective. The matching results obtained in the previous section only calculate the relationship between the query video and the candidate interval at the local frame level, without considering the temporal dimension of the video. In the VQ-VMR task, matched videos should share semantically similar frames in contiguous time frames. Multiple isolated high matching points cannot represent a high degree of matching between the videos as a whole. Therefore, in this section, embodiments of the present invention introduce another attention mechanism to eliminate isolated high matching values to obtain better correlation metrics.
本发明实施例提出了一种特殊的卷积层,称为注意力层。它是一种单通道卷积层,用Att(T)表示。本发明实施例使用卷积层和注意层的组合,并将其表示为:The embodiment of the present invention proposes a special convolutional layer called an attention layer. It is a single-channel convolutional layer, denoted by Att(T). The embodiment of the present invention uses a combination of convolutional layer and attention layer, and expresses it as:
Conv(T)·σ(ATT(T))Conv(T) σ(ATT(T))
其中T表示输入的张量。在本发明实施例的组合中,注意力层的宽度和高度与其前一层卷积层相同。注意层通过sigmoid函数后,与前卷积层相乘(如公式所示),相当于为前卷积层的每个位置分配相应的注意力权重。where T represents the input tensor. In the combination of embodiments of the present invention, the width and height of the attention layer are the same as that of the previous convolutional layer. After the attention layer passes the sigmoid function, it is multiplied by the previous convolutional layer (as shown in the formula), which is equivalent to assigning the corresponding attention weight to each position of the previous convolutional layer.
首先,本发明实施例通过多个卷积层和注意层的组合传递输入的张量T,其可学习的卷积核可以逐步地从每一层中找到匹配的嵌入。具体地,本发明实施例应用三组组合,用公式表示为:First, the embodiment of the present invention transmits the input tensor T through a combination of multiple convolution layers and attention layers, and its learnable convolution kernel can gradually find matching embeddings from each layer. Specifically, the embodiment of the present invention applies three groups of combinations, which are expressed as:
Tk=Conv(k)(Tk-1)·σ(Att(k)(Tk-1))T k =Conv(k)(T k-1 )·σ(Att(k)(T k-1 ))
其中k可以是1、2、3,T0初始化为从上一节获得的T,Conv(k)表示第k个卷积层,Att(k)表示第k个注意力层。where k can be 1, 2, 3, T 0 is initialized as T obtained from the previous section, Conv(k) represents the kth convolutional layer, and Att(k) represents the kth attention layer.
本发明实施例将Tk进行全局平局池化,以获得一个单一向量作为视频间匹配表示的结果,记为Tout。In the embodiment of the present invention, Tk is globally pooled to obtain a single vector as the result of the matching representation between videos, which is denoted as T out .
步骤四:相似度得分和位置信息回归。Step 4: Similarity score and location information regression.
将得到的向量Tout作为输入,回归网络具有两个同级输出层。第一个输出得到查询视频和候选区间之间的语义相似度得分并表示为S。本发明实施例使用tanh作为激活函数,使得网络生成的最终分数值S在[-1,1]范围内。Taking the resulting vector T out as input, the regression network has two sibling output layers. The first output is the semantic similarity score between the query video and the candidate interval and denoted as S. The embodiment of the present invention uses tanh as the activation function, so that the final score value S generated by the network is in the range of [-1, 1].
第二个输出得到候选区间位置回归偏移量。本发明实施例将位置回归偏移量设计为(Tc,Tl),其中Tc和Tl分别表示候选区间中心点的偏移量和长度的偏移量。由于候选区间与真实区间相比可能过长或过短,回归过程有助于找到更好的位置息。The second output gets the candidate interval position regression offset. In the embodiment of the present invention, the position regression offset is designed as (T c , T l ), where T c and T l represent the offset of the center point of the candidate interval and the offset of the length, respectively. Since the candidate interval may be too long or too short compared to the true interval, the regression process helps to find better location information.
步骤五:训练。Step 5: Training.
提出的模型结构包含两个同级输出,一个用于相关性度量,另一个用于边界回归。本发明实施例设计了多任务损失L来对本发明实施例的模型进行联合训练。损失函数具体表示为:The proposed model structure contains two sibling outputs, one for correlation measure and the other for boundary regression. The embodiment of the present invention designs a multi-task loss L to jointly train the model of the embodiment of the present invention. The loss function is specifically expressed as:
L=Ltr+μ·Lreg L=L tr + μ·L reg
其中Ltr表示三元组损失;Lreg表示回归损失;μ表示衡量两种损失贡献程度的权重值。where L tr represents the triple loss; L reg represents the regression loss; μ represents the weight value that measures the contribution of the two losses.
①三元组损失Ltr。应用于三元组损失的训练样本包括正候选区间样本和负候选区间样本。本发明实施例将训练数据集重组为视频三元组(q,p,n),其中q,p,n分别表示查询视频、正候选区间样本和负候选区间样本。本发明实施例从训练集中随机选择一个视频片段作为查询视频,并将与查询视频具有相同标签的视频中获得的候选区间作为正样本。同理,与查询视频具有不同标签的视频中获得的建议被称为负样本。① The triplet loss L tr . The training samples applied to triplet loss include positive candidate interval samples and negative candidate interval samples. In the embodiment of the present invention, the training data set is reorganized into video triples (q, p, n), where q, p, n represent query video, positive candidate interval samples and negative candidate interval samples, respectively. In the embodiment of the present invention, a video segment is randomly selected as a query video from the training set, and a candidate interval obtained from a video with the same label as the query video is used as a positive sample. Similarly, suggestions obtained from videos with different labels from the query video are called negative samples.
Ltr的作用是给正向的匹配视频对(q,p)分配更高的匹配得分,给负向的视频对(q,n)分配更低的匹配得分。The role of Ltr is to assign a higher matching score to a positive matching video pair (q, p) and a lower matching score to a negative video pair (q, n).
其中,γ是一个边缘参数,以确保正样本对(q,p)的匹配得分和负样本对(q,n)的匹配得分之间有足够大的分差;λ是一个正则化参数,防止模型的过度拟合;N是一批样本中包含的三元组个数。where γ is a marginal parameter to ensure a sufficiently large difference between the matching score of the positive sample pair (q, p) and the matching score of the negative sample pair (q, n); λ is a regularization parameter that prevents Overfitting of the model; N is the number of triplets contained in a batch of samples.
②回归损失Lreg ② regression loss L reg
仅使用正候选区间样本来训练回归损失。回归损失Lreg表示为:Use only positive candidate interval samples to train the regression loss. The regression loss Lreg is expressed as:
其中,L1为光滑L1函数,Tc和Tl分别为候选区间中心点和长度的预测相对偏移量,和为候选区间中心点和长度的真实相对偏移量。Among them, L 1 is the smooth L 1 function, T c and T l are the predicted relative offsets of the center point and length of the candidate interval, respectively, and is the true relative offset of the center point and length of the candidate interval.
其中,loci和leni分别表示第i个候选区间的中心坐标和长度;和表示对应真实的中心坐标和长度。Among them, loc i and len i respectively represent the center coordinate and length of the ith candidate interval; and Indicates the corresponding real center coordinates and length.
步骤六:预测。Step 6: Predict.
完成训练后,本发明实施例将查询视频和从参考视频中获得的候选区间作为输入,利用提出的框架进行检索片段的位置预测。本发明实施例选择相似度得分最高的候选区间,具体表示为:After the training is completed, the embodiment of the present invention takes the query video and the candidate interval obtained from the reference video as input, and uses the proposed framework to predict the position of the retrieved segment. The embodiment of the present invention selects the candidate interval with the highest similarity score, which is specifically expressed as:
i=arg maxS(q,pi)i=arg maxS(q, p i )
其中i表示第i个候选区间。where i represents the ith candidate interval.
得到匹配得分最高的的候选区间后,本发明实施例根据回归偏移量对所选候选区间的边界进行细化,具体公式如下:After obtaining the candidate interval with the highest matching score, the embodiment of the present invention refines the boundary of the selected candidate interval according to the regression offset, and the specific formula is as follows:
其中,Tc(i),Tl(i)分别表示所选第i个候选区间的中心点和长度偏移量。s,e分别表示最终定位片段的起点和终点。Among them, T c (i) and T l (i) represent the center point and length offset of the selected i-th candidate interval, respectively. s, e represent the start and end points of the final positioning segment, respectively.
步骤七:验证。Step 7: Verify.
下面对专利基于视频片段的视频时刻检索技术进行验证。对于生成的三维张量T,本发明实施例希望其宽度和高度能够保持一致,以便于将其作为后续卷积层的输入进行进一步的相关学习。但是,查询视频和候选区间的视频长度可能不同,导致提取的特征数量不同。为了解决这个问题,本发明实施例将视频的最大特征数限制为40。如果视频太长,特征数大于40,本发明实施例选择40个等距特征.The following will verify the patented video clip-based video moment retrieval technology. For the generated three-dimensional tensor T, the embodiment of the present invention hopes that its width and height can be kept consistent, so that it can be used as the input of the subsequent convolution layer for further correlation learning. However, the video lengths of the query video and the candidate interval may be different, resulting in different numbers of extracted features. To solve this problem, the embodiment of the present invention limits the maximum number of features of a video to 40. If the video is too long and the number of features is greater than 40, the embodiment of the present invention selects 40 equidistant features.
实验使用Adam优化器训练模型,初始学习率设置为10-4,batchsize大小设置为32。对于正则化操作,本发明实施例对每个全连接层均应用BN(batchnormalization)操作。本发明实施例认为三元组损失和回归损失的贡献相同,因此多任务损失L中的权重μ设置为0.5。模型中全连接层D1,D2,D3,D4的大小均设置为512。模型中不同卷积层和注意力层的设置细节如表1所示。对于公式中的参数本发明实施例分别设置λ=0.0005,γ=1.5。The experiments use the Adam optimizer to train the model, the initial learning rate is set to 10 −4 , and the batch size is set to 32. For the regularization operation, the embodiment of the present invention applies a BN (batch normalization) operation to each fully connected layer. In the embodiment of the present invention, the triple loss and the regression loss have the same contribution, so the weight μ in the multi-task loss L is set to 0.5. The sizes of the fully connected layers D 1 , D 2 , D 3 , and D 4 in the model are all set to 512. The setting details of different convolutional layers and attention layers in the model are shown in Table 1. For the parameters in the formula, λ=0.0005 and γ=1.5 are respectively set in the embodiment of the present invention.
本发明实施例主要采用Thumos14和ActivityNet两个数据集中的视频作为训练和测试样本,对本发明实施例提出的模型进行评测。表2和表3分别展示了本发明实施例的方法与其他方法在Thumos14和ActivityNet两个数据集上的定量比较结果。图3展示了本发明实施例方法的定性结果。从结果可以看出,本发明实施例的结果与其他方法相比有更好的定位准确性。与其他方法相比,不同数据集下,本发明实施例提出的模型具有更高的适应性,在两个数据集上均能很好地执行检索任务。The embodiments of the present invention mainly use videos in the Thumos14 and ActivityNet datasets as training and test samples to evaluate the models proposed in the embodiments of the present invention. Tables 2 and 3 respectively show the quantitative comparison results between the method of the embodiment of the present invention and other methods on the Thumos14 and ActivityNet datasets. Figure 3 shows the qualitative results of the method of an embodiment of the present invention. It can be seen from the results that the results of the embodiments of the present invention have better positioning accuracy compared with other methods. Compared with other methods, the models proposed in the embodiments of the present invention have higher adaptability under different data sets, and can perform retrieval tasks well on both data sets.
表1卷积层和注意力层设置细节Table 1 Convolutional layer and attention layer setting details
表2 Thumos14数据集不同tIoU阈值下客观指标mAP比较结果Table 2 Comparison results of objective indicator mAP under different tIoU thresholds in Thumos14 dataset
表3 ActivityNet数据集不同tIoU阈值下客观指标mAP比较结果Table 3 Comparison results of objective indicator mAP under different tIoU thresholds in ActivityNet dataset
根据图4可知,本发明实施例提出了一种新的框架来测量查询视频和参考视频片段之间的语义相关性。通过利用卷积结构和注意机制,本发明实施例的模型同时考虑了视频的时空维度,分别从局部帧级别和全局角度计算和学习视频之间的语义相似度,大大提高了VQ-VMR任务片段定位的准确性。本发明实施例提出的视频时刻检索算法,利用卷积结构和注意力机制,同时考虑了视频的时空维度,分别从局部帧级别和全局角度计算和学习视频之间的语义相似度。有效提高了VQ-VMR任务片段定位的准确性。It can be seen from FIG. 4 that an embodiment of the present invention proposes a new framework to measure the semantic correlation between a query video and a reference video segment. By using the convolution structure and attention mechanism, the model of the embodiment of the present invention considers the spatiotemporal dimensions of the video, calculates and learns the semantic similarity between the videos from the local frame level and the global perspective, and greatly improves the VQ-VMR task segment. positioning accuracy. The video moment retrieval algorithm proposed in the embodiment of the present invention utilizes the convolution structure and the attention mechanism, and simultaneously considers the spatiotemporal dimension of the video, and calculates and learns the semantic similarity between the videos from the local frame level and the global perspective respectively. Effectively improve the accuracy of VQ-VMR task segment location.
基于相同的发明构思,本发明另一实施例提供了一种视频检索装置,包括:Based on the same inventive concept, another embodiment of the present invention provides a video retrieval device, including:
拆分模块,用于将长参考视频进行拆分得到多个候选区间;The splitting module is used to split the long reference video to obtain multiple candidate intervals;
检索模块,用于将查询视频和所述多个候选区间输入到视频检索模型中,得到所述多个候选区间中与所述查询视频匹配的候选区间;a retrieval module, configured to input the query video and the plurality of candidate intervals into a video retrieval model, and obtain a candidate interval that matches the query video in the plurality of candidate intervals;
其中,所述视频检索模型是将样本查询视频和样本长参考视频拆分得到的多个样本候选区间作为输入,将所述样本长参考视频拆分得到的多个样本候选区间中与所述样本查询视频匹配的样本候选区间作为输出,对初始机器学习模型进行训练后得到的,所述视频检索模型基于所述样本查询视频和所述样本长参考视频拆分得到的多个样本候选区间之间的局部语义相似度和全局语义相似度进行训练。Wherein, the video retrieval model takes multiple sample candidate intervals obtained by splitting the sample query video and the sample long reference video as input, and uses the sample candidate intervals obtained by splitting the sample long reference video with the sample The sample candidate interval matched by the query video is obtained after training the initial machine learning model, and the video retrieval model is based on the sample query video and the sample long reference video. The local semantic similarity and global semantic similarity are trained.
基于上述实施例的内容,在本实施例中,所述检索模块具体用于:Based on the content of the foregoing embodiment, in this embodiment, the retrieval module is specifically used for:
将查询视频和所述多个候选区间进行视频编码后输入到视频检索模型中,得到所述多个候选区间中与所述查询视频匹配的候选区间;The query video and the plurality of candidate intervals are subjected to video encoding and then input into the video retrieval model to obtain candidate intervals that match the query video in the plurality of candidate intervals;
其中,所述查询视频和所述多个候选区间的视频编码过程包括:Wherein, the video coding process of the query video and the multiple candidate intervals includes:
利用C3D模型对所述查询视频和所述多个候选区间分别进行特征提取;Use the C3D model to perform feature extraction on the query video and the multiple candidate intervals respectively;
采用两个长短时记忆单元对提取的特征进行聚合,得到所述查询视频和所述多个候选区间的视频编码结果。The extracted features are aggregated by using two long-term and short-term memory units to obtain the video coding results of the query video and the multiple candidate intervals.
由于本实施例提供的视频检索装置可以用于执行上述实施例所述的视频检索方法,其工作原理和有益效果类似,具体内容可参见上述实施例,此处不再赘述。Since the video retrieval apparatus provided in this embodiment can be used to execute the video retrieval method described in the foregoing embodiment, its working principle and beneficial effects are similar, and the specific content can refer to the foregoing embodiment, which will not be repeated here.
基于相同的发明构思,本发明又一实施例提供了一种电子设备,参见图5,所述电子设备具体包括如下内容:处理器301、存储器302、通信接口303和通信总线304;Based on the same inventive concept, another embodiment of the present invention provides an electronic device, see FIG. 5 , the electronic device specifically includes the following: a
其中,所述处理器301、存储器302、通信接口303通过所述通信总线304完成相互间的通信;所述通信接口303用于实现各相关设备之间的传输;所述处理器301用于调用所述存储器302中的计算机程序,所述处理器执行所述计算机程序时实现上述视频检索方法的全部步骤,例如,所述处理器执行所述计算机程序时实现下述步骤:将长参考视频进行拆分得到多个候选区间;将查询视频和所述多个候选区间输入到视频检索模型中,得到所述多个候选区间中与所述查询视频匹配的候选区间;其中,所述视频检索模型是将样本查询视频和样本长参考视频拆分得到的多个样本候选区间作为输入,将所述样本长参考视频拆分得到的多个样本候选区间中与所述样本查询视频匹配的样本候选区间作为输出,对初始机器学习模型进行训练后得到的,所述视频检索模型基于所述样本查询视频和所述样本长参考视频拆分得到的多个样本候选区间之间的局部语义相似度和全局语义相似度进行训练。Wherein, the
基于相同的发明构思,本发明又一实施例提供了一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现上述视频检索方法的全部步骤,例如,所述处理器执行所述计算机程序时实现下述步骤:将长参考视频进行拆分得到多个候选区间;将查询视频和所述多个候选区间输入到视频检索模型中,得到所述多个候选区间中与所述查询视频匹配的候选区间;其中,所述视频检索模型是将样本查询视频和样本长参考视频拆分得到的多个样本候选区间作为输入,将所述样本长参考视频拆分得到的多个样本候选区间中与所述样本查询视频匹配的样本候选区间作为输出,对初始机器学习模型进行训练后得到的,所述视频检索模型基于所述样本查询视频和所述样本长参考视频拆分得到的多个样本候选区间之间的局部语义相似度和全局语义相似度进行训练。Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, all steps of the above video retrieval method are implemented , for example, the processor implements the following steps when executing the computer program: splitting the long reference video to obtain multiple candidate intervals; inputting the query video and the multiple candidate intervals into the video retrieval model to obtain the The candidate interval that matches the query video among the plurality of candidate intervals; wherein, the video retrieval model takes as input a plurality of sample candidate intervals obtained by splitting the sample query video and the sample long reference video, and uses the sample long reference video as input. The sample candidate interval that matches the sample query video among the multiple sample candidate intervals obtained by reference video splitting is used as the output, and obtained after training the initial machine learning model, and the video retrieval model is based on the sample query video and the sample query video. The local semantic similarity and global semantic similarity between multiple sample candidate intervals obtained by splitting the sample long reference video are used for training.
此外,上述的存储器中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logic instructions in the memory can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present invention. Those of ordinary skill in the art can understand and implement it without creative effort.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的视频检索方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the video retrieval method described in each embodiment or some part of the embodiment.
在本发明中,诸如“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。In the present invention, terms such as "first" and "second" are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In the description of the present invention, "plurality" means at least two, such as two, three, etc., unless otherwise expressly and specifically defined.
此外,在本发明中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Furthermore, in the present invention, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply existence between these entities or operations any such actual relationship or sequence. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.
此外,在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In addition, in the description of this specification, reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples" and the like means description in conjunction with the embodiment or example. A particular feature, structure, material or characteristic is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine the different embodiments or examples described in this specification, as well as the features of the different embodiments or examples, without conflicting each other.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210628871.2A CN115098728A (en) | 2022-06-06 | 2022-06-06 | Video retrieval method and device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210628871.2A CN115098728A (en) | 2022-06-06 | 2022-06-06 | Video retrieval method and device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN115098728A true CN115098728A (en) | 2022-09-23 |
Family
ID=83289476
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210628871.2A Pending CN115098728A (en) | 2022-06-06 | 2022-06-06 | Video retrieval method and device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115098728A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115495677A (en) * | 2022-11-21 | 2022-12-20 | 阿里巴巴(中国)有限公司 | Method and storage medium for spatio-temporal localization of video |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106682108A (en) * | 2016-12-06 | 2017-05-17 | 浙江大学 | Video retrieval method based on multi-modal convolutional neural network |
| CN109697236A (en) * | 2018-11-06 | 2019-04-30 | 建湖云飞数据科技有限公司 | A kind of multi-medium data match information processing method |
| CN111814922A (en) * | 2020-09-07 | 2020-10-23 | 成都索贝数码科技股份有限公司 | A deep learning-based video clip content matching method |
| US20210272599A1 (en) * | 2020-03-02 | 2021-09-02 | Geneviève Patterson | Systems and methods for automating video editing |
| CN114495170A (en) * | 2022-01-27 | 2022-05-13 | 重庆大学 | Pedestrian re-identification method and system based on local self-attention inhibition |
-
2022
- 2022-06-06 CN CN202210628871.2A patent/CN115098728A/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106682108A (en) * | 2016-12-06 | 2017-05-17 | 浙江大学 | Video retrieval method based on multi-modal convolutional neural network |
| CN109697236A (en) * | 2018-11-06 | 2019-04-30 | 建湖云飞数据科技有限公司 | A kind of multi-medium data match information processing method |
| US20210272599A1 (en) * | 2020-03-02 | 2021-09-02 | Geneviève Patterson | Systems and methods for automating video editing |
| CN111814922A (en) * | 2020-09-07 | 2020-10-23 | 成都索贝数码科技股份有限公司 | A deep learning-based video clip content matching method |
| CN114495170A (en) * | 2022-01-27 | 2022-05-13 | 重庆大学 | Pedestrian re-identification method and system based on local self-attention inhibition |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115495677A (en) * | 2022-11-21 | 2022-12-20 | 阿里巴巴(中国)有限公司 | Method and storage medium for spatio-temporal localization of video |
| CN115495677B (en) * | 2022-11-21 | 2023-03-21 | 阿里巴巴(中国)有限公司 | Method and storage medium for spatio-temporal localization of video |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP5749279B2 (en) | Join embedding for item association | |
| CN110163262B (en) | Model training method, business processing method, device, terminal and storage medium | |
| CN113742446A (en) | Knowledge graph question-answering method and system based on path sorting | |
| CN113326392B (en) | Remote sensing image audio retrieval method based on quadruple hash | |
| CN111259127A (en) | A long text answer selection method based on transfer learning sentence vector | |
| CN108647350A (en) | Image-text associated retrieval method based on two-channel network | |
| CN112368697A (en) | System and method for evaluating a loss function or a gradient of a loss function via dual decomposition | |
| CN113297369A (en) | Intelligent question-answering system based on knowledge graph subgraph retrieval | |
| CN111325015B (en) | Document duplicate checking method and system based on semantic analysis | |
| WO2017012491A1 (en) | Similarity comparison method and apparatus for high-dimensional image features | |
| CN110647904A (en) | Cross-modal retrieval method and system based on unmarked data migration | |
| CN109376261B (en) | Modality-independent retrieval method and system based on mid-level text semantic enhancement space | |
| CN113220865B (en) | Text similar vocabulary retrieval method, system, medium and electronic equipment | |
| CN113127672B (en) | Generation method, retrieval method, medium and terminal of quantitative image retrieval model | |
| CN114556364B (en) | Computer-implemented method for performing neural network architecture searches | |
| CN112199957A (en) | Character entity alignment method and system based on joint embedding of attribute and relationship information | |
| TW202001621A (en) | Corpus generating method and apparatus, and human-machine interaction processing method and apparatus | |
| CN103810252A (en) | Image retrieval method based on group sparse feature selection | |
| CN108595546B (en) | A semi-supervised cross-media feature learning retrieval method | |
| CN105701225B (en) | A kind of cross-media retrieval method based on unified association hypergraph specification | |
| CN110442736B (en) | A Semantic Enhanced Subspace Cross-Media Retrieval Method Based on Quadratic Discriminant Analysis | |
| CN104731884B (en) | A kind of querying method of more Hash tables based on multi-feature fusion | |
| CN109635004A (en) | A kind of object factory providing method, device and the equipment of database | |
| CN115098728A (en) | Video retrieval method and device | |
| CN105989094A (en) | Image retrieval method based on middle layer expression of hidden layer semantics |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |



















