CN117076712B - Video retrieval method, system, device and storage medium - Google Patents

Video retrieval method, system, device and storage medium Download PDF

Info

Publication number
CN117076712B
CN117076712B CN202311331941.9A CN202311331941A CN117076712B CN 117076712 B CN117076712 B CN 117076712B CN 202311331941 A CN202311331941 A CN 202311331941A CN 117076712 B CN117076712 B CN 117076712B
Authority
CN
China
Prior art keywords
video
embedded
text
granularity
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311331941.9A
Other languages
Chinese (zh)
Other versions
CN117076712A (en
Inventor
陈恩红
徐童
殷述康
赵思蕊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202311331941.9A priority Critical patent/CN117076712B/en
Publication of CN117076712A publication Critical patent/CN117076712A/en
Application granted granted Critical
Publication of CN117076712B publication Critical patent/CN117076712B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a video retrieval method, a video retrieval system, video retrieval equipment and a video retrieval storage medium, wherein the video retrieval method, the video retrieval system, the video retrieval equipment and the video storage medium are in one-to-one correspondence schemes; the scheme is as follows: the method and the device fully utilize the correlation between the text information and the video clip to extract the video level characteristics, and can accurately search the video related to the text part by utilizing the similarity between the video level characteristics and the text. Experiments show that the invention has better performance on the recall rate of partial related video retrieval.

Description

Video retrieval method, system, device and storage medium
Technical Field
The present invention relates to the field of multi-modal retrieval, and in particular, to a video retrieval method, system, device, and storage medium.
Background
With the rapid development of social media, multimodal social data has grown explosively, and users need to effectively address information overload problems with efficient search tools. In this context, how to efficiently retrieve semantic relevance according to text content, especially video that effectively includes semantic partial relevance (i.e., related to a part of information in a query condition, and not related or substantially not related to other part of information), has been attracting attention, and related research has important applications in fields of search engines, retrieval systems of video platforms, and the like.
Early text-video search studies generally used simple heuristic algorithms, such as greedy search strategies, in matching relevant segments. However, this approach treats each query instance as independent in model learning, ignoring the relationships between query text, resulting in the matched video segments being suboptimal.
In view of this, the present invention has been made.
Disclosure of Invention
The invention aims to provide a video retrieval method, a system, equipment and a storage medium, which can improve the accuracy of video retrieval.
The invention aims at realizing the following technical scheme:
a video retrieval method comprising:
step 1, extracting characteristics of a query text to obtain text characteristics;
step 2, processing all video segment characteristics and text characteristics of each video in the video library by adopting an attention mechanism respectively to obtain embedded characterization and text embedded characterization of all video segments of each video;
step 3, for each video, selecting matched video fragments by utilizing the embedded characterization of all video fragments of the video and adopting a bipartite graph matching mode for text embedded characterization, namely, the embedded characterization of the matched video fragments is called as coarse-granularity video embedded characterization, the coarse-granularity video embedded characterization is used as a guide, and the embedded characterization of all video fragments is combined through an attention mechanism to obtain video embedded characterization which is called as fine-granularity video embedded characterization;
step 4, for each video, calculating the similarity between each video and the query text by utilizing coarse-granularity video embedding characterization and fine-granularity video embedding characterization;
and 5, generating a video retrieval result according to the similarity between each video and the query text.
A video retrieval system, comprising:
the text feature extraction module is used for extracting features of the query text to obtain text features;
the feature coding module is used for processing all video segment features and text features of each video in the video library by adopting an attention mechanism respectively to obtain embedded characterization of all video segments and text embedded characterization of each video;
the video level feature extraction module is used for selecting matched video fragments by utilizing the embedded features of all video fragments of each video and the text embedded features in a bipartite graph matching mode, wherein the embedded features of the matched video fragments are called coarse-granularity video embedded features, the coarse-granularity video embedded features are used as guidance, and the embedded features of all video fragments are combined through an attention mechanism to obtain video embedded features which are called fine-granularity video embedded features;
the similarity calculation module is used for calculating the similarity between each video and the query text by utilizing the coarse-granularity video embedding representation and the fine-granularity video embedding representation for each video;
and the video search result generation module is used for generating a video search result according to the similarity between each video and the query text.
A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods described previously.
A readable storage medium storing a computer program which, when executed by a processor, implements the method described above.
According to the technical scheme provided by the invention, the correlation between the text information and the video clip is fully utilized to extract the video level characteristics, and the video related to the text query part can be accurately searched by utilizing the similarity between the video level characteristics and the text. Experiments show that the invention has better performance on the recall rate of partial related video retrieval.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a video retrieval method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a bipartite graph matching module according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of obtaining fine-grained video embedded characterization according to an embodiment of the invention;
fig. 4 is a schematic diagram of a video retrieval method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a video retrieval system according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
The terms that may be used herein will first be described as follows:
the terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.
The following describes in detail a video retrieval method, system, device and storage medium provided by the present invention. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer.
Example 1
The embodiment of the invention provides a video retrieval method, which is a partial related video retrieval method based on a bipartite graph matching network, as shown in figure 1, and mainly comprises the following steps:
and step 1, extracting features of the query text to obtain text features.
In the embodiment of the invention, the characteristics of text information can be extracted by using a pre-trained language model, and the specific implementation can be realized by referring to the conventional technology, so that the invention is not repeated.
In the embodiment of the invention, the number of the query texts is one or more, and each query text is independently used for extracting text characteristics.
And 2, respectively processing all video segment characteristics and text characteristics of each video in the video library by adopting an attention mechanism to obtain embedded characterization and text embedded characterization of all video segments of each video.
In the embodiment of the invention, the text feature and the video clip feature are further embedded by using an embedding module based on an attention mechanism, and the method is specifically as follows: and respectively adding position codes into all video segment characteristics and text characteristics of each video, and carrying out embedding processing by combining an attention mechanism to obtain embedded characterization of all video segments and text embedded characterization.
Exemplary: this part of the embedding process in combination with the attention mechanism may be implemented by a Transformer-based encoder.
In the embodiment of the invention, the representation of the video segment can be extracted at equal intervals through the pre-trained neural network; specific: segmenting each video through a motion sliding window to obtain a plurality of video clips; and respectively extracting the characteristics of all the video clips by using a pre-trained neural network to obtain the characteristics of all the video clips.
And 3, for each video, selecting matched video fragments by utilizing the embedded characterization of all video fragments of the video and adopting a bipartite graph matching mode for text embedded characterization, wherein the embedded characterization of the matched video fragments is called coarse-granularity video embedded characterization, the coarse-granularity video embedded characterization is used as a guide, and the embedded characterization of all video fragments is combined through an attention mechanism to obtain video embedded characterization, which is called fine-granularity video embedded characterization.
According to the embodiment of the invention, the embedded representation and the text embedded representation of all video clips can be subjected to bipartite graph matching by using a Hungary algorithm to obtain the coarse-granularity video embedded representation, and then the coarse-granularity video embedded representation is used as a guide to aggregate video embedding by using an attention mechanism to obtain the fine-granularity video embedded representation. The preferred embodiment of this section is as follows:
(1) The process of obtaining coarse-granularity video embedding is realized through a bipartite graph matching module, as shown in fig. 2, text embedding characterization and embedding characterization of each video segment are respectively one-to-one calculated cosine similarity, a negative value is taken to obtain a loss matrix, two axes of the loss matrix respectively represent a query text index and a video segment index, and elements corresponding to an ith row and a jth column of the matrix represent losses for matching an ith query text with a jth video segment; and sending the loss matrix into a solver to obtain video fragments matched with the query text, wherein the embedded characterization of the matched video fragments is the coarse-granularity video embedded characterization.
In the embodiment of the invention, the solver is realized by using the Hungary algorithm, and the solving process of the algorithm can be realized by referring to the conventional technology, and the invention is not repeated.
(2) As shown in fig. 3, the embedded representation of all video clips is processed through a full connection layer to obtain the embedded representation of all processed video clips; and calculating attention values by utilizing the processed embedded characterizations of all video clips and the coarse-granularity video embedded characterizations, and aggregating the processed embedded characterizations of all video clips by utilizing the attention values to obtain the fine-granularity video embedded characterizations.
And 4, for each video, calculating the similarity between each video and the query text by utilizing the coarse-granularity video embedding characterization and the fine-granularity video embedding characterization.
In the embodiment of the invention, each video respectively calculates the similarity of coarse-granularity video embedded characterization and text embedded characterization and the similarity of fine-granularity video embedded characterization and text embedded characterization; and carrying out weighted summation on the two calculated similarities to obtain the similarity between each video and the query text.
When a plurality of query texts are input in the training stage, the similarity between each video and each query text can be calculated through the mode.
And 5, generating a video retrieval result according to the similarity between each video and the query text.
Through the steps, the similarity between each video and the query text can be calculated, then, each video can be ranked according to the similarity, the higher the similarity is, the earlier the video is ranked, and the K videos which are the forefront are selected from the ranks according to the set list length K to generate a video retrieval list, wherein the video retrieval list is a video retrieval result.
In order to intuitively embody the searching effect of the scheme of the invention, searching experiments are carried out on the public data sets TVR, activityNet Captions and Charades-STA data sets, the experimental results are shown in table 1, and the total recall (R value) is far higher than that of the current searching scheme.
Table 1: experimental results
Validating a data set R@1 R@5 R@10
TVR 14.1 34.7 45.9
ActivityNet Captions 7.3 23.7 35.8
Charades-STA 2.1 7.2 11.8
In order to more clearly demonstrate the technical scheme and the technical effects provided by the invention, the following detailed description of the embodiments of the invention is given by way of specific examples.
1. And constructing a network model.
In the embodiment of the invention, the step 2 is realized by a feature encoding module, the step 3 is realized by a video level feature extraction module, the step 4 is realized by a similarity calculation module, and the feature encoding module, the video level feature extraction module and the similarity calculation module form a network model.
2. The network model is trained.
In the embodiment of the invention, training is performed on the network model in advance; the training set comprises a plurality of videos and text data (query text) corresponding to each video; for the current video and the query text corresponding to the current video, obtaining text features and embedded characterization of all video clips through the processing and feature encoding module in the step 1; then, selecting video fragments matched with each query text through a video level feature extraction module, using the embedded representation of the matched video fragments as coarse-granularity video embedded representation, and obtaining video embedded representation, namely fine-granularity video embedded representation; finally, each query text obtains a corresponding coarse-granularity video embedded representation and a corresponding fine-granularity video embedded representation; the method comprises the steps that each query text and a corresponding coarse-granularity video embedding representation and a corresponding fine-granularity video embedding representation are called matching triples, and each query text and a corresponding coarse-granularity video embedding representation and a corresponding fine-granularity video embedding representation of other query texts are called unmatched triples; and respectively calculating the similarity of the query text and the coarse-granularity video embedded representation and the fine-granularity video embedded representation in each matching triplet and each non-matching triplet, the similarity between different query texts and the similarity between different matching video fragments by a similarity calculation module, calculating triplet ordering loss, comparison loss and L1 regularization loss by combining the matching relation between the query texts and the video fragments, and training the network model by using the calculated three types of losses. The following description is directed to a preferred embodiment of the training process.
1. A training set is prepared.
The training set comprises a plurality of videos, and each video is provided with a plurality of corresponding text description labels.
2. And (5) preprocessing data.
In the preprocessing, the use of the whole video is selected as an input. Firstly, obtaining a plurality of fragments for a complete video motion sliding window, and extracting features by utilizing a pre-trained neural network (video encoder) to obtain corresponding characterization (namely video fragment features) of the video fragments. For the text labels (namely text data) corresponding to the video, feature extraction is carried out by utilizing a pre-trained language model (text encoder) to obtain text features. The processed video segment features and text features are preprocessed data, text labels are texts contained in the video, and matching relation between the texts and the video segments is determined in a bipartite graph matching mode described later.
For example, for a complete video, a sequence of video frames may be sampled at a sampling rate of 3 frames/second, and video clip features extracted using a pre-trained I3D network (expanded 3D network). Then, a pre-trained Roberta network (a language model) is used to extract text features corresponding to text labels in the video. These video clip features and corresponding text features are pre-processed data.
3. And (5) processing a network model.
The input of the network model is the preprocessed data, and the main processing procedure of the network model is as follows:
(1) And (5) feature coding.
The method is realized by a feature coding module, position codes are added to the video segment features and the text features obtained by preprocessing, and a encoder based on a transducer is used for further embedding, so that embedded characterization of the text and the video segment is obtained.
(2) Video level feature extraction.
The method is realized by a video level feature extraction module, and the embedded features of the text and the video clips obtained by the feature coding module are used as input and output as feature vectors of a video level, namely, coarse-granularity video embedded features and fine-granularity video embedded features. The method mainly comprises the following steps:
firstly, cosine similarity is calculated and inverted for each query text and each video segment according to the embedded representation, and a loss matrix is constructed, wherein row indexes of the matrix correspond to each query text, and column indexes correspond to each video segment. And carrying out bipartite graph matching on the loss matrix by using a Hungary algorithm to obtain video segments matched with each query text, wherein the embedded representation of the matched video segments is coarse-granularity video level representation.
Then, embedding the embedded representation of the video segment through the full-connection layer and then embedding with the coarse-granularity video to calculate attention values, and aggregating the video embedding through the other full-connection layer by utilizing the attention values to finally obtain the fine-granularity video embedding; specifically, the above coarse-granularity video level representation is referred to as query (Q), the embedded representation of all video segments is referred to as key (K), and the value (V) is calculated by using an attention mechanism, so as to obtain the fine-granularity video level representation.
In the stage, the video segments matched with each query text can be determined, and the query text is the text label of a certain video because the current training stage is adopted, so that the matching is only carried out in the affiliated video, and the corresponding matched video segments are obtained; for each query video, the embedded representation of the matched video segments is called coarse-granularity video embedded representation, and the coarse-granularity video embedded representation is also used as a guide to obtain video embedded representations, called fine-granularity video embedded representations, each query text has the matched video segments, and the corresponding coarse-granularity video embedded representation and the fine-granularity video embedded representation, each query text and the matched video segments form matched text and video segment pairs, the matched text and video segment pairs are formed with the corresponding coarse-granularity video embedded representations and the fine-granularity video embedded representations, the video segments matched with other query texts form unmatched text and video segment pairs, and the coarse-granularity video embedded representations and the fine-granularity video embedded representations corresponding to other query texts form unmatched triples.
Exemplary, for two query texts q 1 And q 2 By the above scheme, a matching triplet (q 1 , m 1 , k 1 ),(q 2 , m 2 , k 2 ) Then the unmatched triplet is (q 1 , m 2 , k 2 ),(q 2 , m 1 , k 1 ) The method comprises the steps of carrying out a first treatment on the surface of the M here 1 And k is equal to 1 Query text q 1 Corresponding coarse-granularity video embedding representation and fine-granularity video embedding representation, m 2 And k is equal to 2 Query text q 2 The corresponding coarse-granularity video embedding characterization and fine-granularity video embedding characterization.
(3) And (5) similarity calculation.
The method is realized by a similarity calculation module, and is used for calculating the embedded representation of the query text and the coarse-granularity video in each matched triplet and the embedded representation of the fine-granularity video in each unmatched triplet and calculating the embedded representation of the query text and the coarse-granularity video and the embedded representation of the fine-granularity video in each unmatched triplet. And, the similarity between different query texts (calculated by text embedding characterization) and the similarity between different matched video clips are also calculated. Note that the text content on the left side of fig. 4 is merely an example, and is not limiting.
4. Loss calculation and network optimization.
In an embodiment of the present invention, the calculated loss includes: triplet ordering loss, contrast loss, and L1 regularization loss; the method comprises the steps of obtaining triple sorting loss and comparison loss by respectively calculating similarity of query text and coarse-granularity video embedding characterization and fine-granularity video embedding characterization in each triple; l1 regularization loss is calculated by calculating the similarity between different query texts and the similarity between different matched video segments.
(1) The triplet ordering penalty expression is:
wherein,representing arbitrary query text +.>And video clip->For it isThree cases are included: query text->Video clip matching->The pair, i.e. the matched text and video clip pair, is denoted +.>The method comprises the steps of carrying out a first treatment on the surface of the Query text->Video clip which does not match->For, record->The method comprises the steps of carrying out a first treatment on the surface of the Another query text +.>And query text->Matched video clip->Is marked as->,/>And->Text and video clip pairs, which are all mismatched, < ->The edge value of the penalty for the triplet ordering, +.>Representing cosine similarity.
The triplet ordering penalty comprises a two-part penalty: in the first partial loss calculation, the similarity between the text embedded representation of the query text and the coarse-grained video embedded representation is brought into the formula to calculate the first partial loss, namelyCalculated is the matching of query text in triples with coarse-grained video embedded representation, +.>Calculating the similarity between the query text in the unmatched triplet and the coarse-granularity video embedded representation; in the second partial loss calculation, the similarity between the text embedded representation of the query text and the fine-grained video embedded representation is brought into the above formula to calculate the second partial loss, i.e. & lt & gt>Calculated is the matching of query text in triples with fine-grained video embedded representation, +.>、/>Calculating the similarity between the query text in the unmatched triplet and the fine-grained video embedded representation; and combining the two losses to obtain the triple sorting loss.
(2) The comparative loss expression is:
wherein,and->Each representing a text of a query,/>and->All represent video, ++>Representing matching text and video clip pairs, +.>Representing a non-matching text and video segment pair, P representing a matching text and video segment pair set, and N representing a non-matching text and video segment pair set.
Here, unmatched text and video clip pairsComprising +.>And->Two types of cases.
Similarly, the contrast loss also includes a two-part loss, similar to the triplet ordering loss, and when calculating the first part loss, the first part loss is calculated by substituting the similarity of the text embedded representation of the query text and the coarse-grained video embedded representation into the expression of the contrast loss, namelyCalculated is the matching of query text in triples with coarse-grained video embedded representation, +.>Calculating the similarity between the query text in the unmatched triplet and the coarse-granularity video embedded representation; calculating the second portion by taking the similarity of the text embedded representation of the query text and the fine-grained video embedded representation into the expression of the contrast loss when calculating the second portion loss calculationLoss, i.e.)>Calculated is the matching of query text in triples with fine-grained video embedded representation, +.>Calculating the similarity between the query text in the unmatched triplet and the fine-grained video embedded representation; the two-part losses are also combined to obtain a comparative loss.
(3) The L1 regularization loss expression is:
wherein,is L1 norm sign, < >>Representing two query texts +.>And->Is used for the degree of similarity of (c) to (c),representing two video clips->And->Similarity of->And->For matched text and video clip pairs, +.>And->Sequence numbers for matching pairs of text and video segments.
Similarly, the L1 regularization loss also comprises two parts of loss, and when the first part of loss is calculated, the similarity of two video clips is calculated by utilizing coarse-granularity video embedding characterization, and then two query texts are combinedAnd->Similarity of (2)Obtaining a first partial loss, i.e.)>Carry in two video clips during computation>And->Is embedded (coarse-grained video embedded representation); when the second part is lost, calculating the similarity between two video clips by utilizing fine-granularity video embedding characterization, and combining two query texts +.>And->Similarity of->Obtaining a second partial loss, i.e.)>Carry in two video clips during computation>And->A corresponding fine-grained video embedded representation (i.e., calculated according to the attention mechanism described above); the two-part loss is also integrated to obtain the L1 regularization loss.
A random gradient descent algorithm may then be used to optimize the triplet ordering penalty, contrast penalty, and L1 regularization penalty, and the optimizer used may be an Adam (adaptive moment estimation) optimizer. Exemplary: each batch was 32 in size, the initial learning rate was set to 0.00025, and the schedule was adjusted using the cosine learning rate at hot start.
3. And (5) video retrieval.
After the network model is trained, as shown in fig. 4, for the input query text, text features are obtained by using the step 1, and embedded representations of all video segments of each video and text embedded representations are obtained by using the step 2; selecting matched video clips from each video through the step 3, and further obtaining coarse-granularity video embedded characterization and fine-granularity video embedded characterization; through step 4, for each video, calculating the similarity (similarity 1) between the coarse-granularity video embedded representation and the text embedded representation, and the similarity (similarity 2) between the fine-granularity video embedded representation and the text embedded representation, and weighting the calculated two similarities to obtain the similarity between each video and the text. And generating a video retrieval result through the step 5.
From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.
Example two
The present invention also provides a video retrieval system, which is mainly used for implementing the method provided in the foregoing embodiment, as shown in fig. 5, and the system mainly includes:
the text feature extraction module is used for extracting features of the query text to obtain text features;
the feature coding module is used for processing all video segment features and text features of each video in the video library by adopting an attention mechanism respectively to obtain embedded characterization of all video segments and text embedded characterization of each video;
the video level feature extraction module is used for selecting matched video fragments by utilizing the embedded features of all video fragments of each video and the text embedded features in a bipartite graph matching mode, wherein the embedded features of the matched video fragments are called coarse-granularity video embedded features, the coarse-granularity video embedded features are used as guidance, and the embedded features of all video fragments are combined through an attention mechanism to obtain video embedded features which are called fine-granularity video embedded features;
the similarity calculation module is used for calculating the similarity between each video and the query text by utilizing the coarse-granularity video embedding representation and the fine-granularity video embedding representation for each video;
and the video search result generation module is used for generating a video search result according to the similarity between each video and the query text.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the system is divided into different functional modules to perform all or part of the functions described above.
Example III
The present invention also provides a processing apparatus, as shown in fig. 6, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.
Further, the processing device further comprises at least one input device and at least one output device; in the processing device, the processor, the memory, the input device and the output device are connected through buses.
In the embodiment of the invention, the specific types of the memory, the input device and the output device are not limited; for example:
the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;
the output device may be a display terminal;
the memory may be random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as disk memory.
Example IV
The invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.
The readable storage medium according to the embodiment of the present invention may be provided as a computer readable storage medium in the aforementioned processing apparatus, for example, as a memory in the processing apparatus. The readable storage medium may be any of various media capable of storing a program code, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, and an optical disk.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (9)

1. A video retrieval method, comprising:
step 1, extracting characteristics of a query text to obtain text characteristics;
step 2, processing all video segment characteristics and text characteristics of each video in the video library by adopting an attention mechanism respectively to obtain embedded characterization and text embedded characterization of all video segments of each video;
step 3, for each video, selecting matched video fragments by utilizing the embedded characterization of all video fragments of the video and adopting a bipartite graph matching mode for text embedded characterization, namely, the embedded characterization of the matched video fragments is called as coarse-granularity video embedded characterization, the coarse-granularity video embedded characterization is used as a guide, and the embedded characterization of all video fragments is combined through an attention mechanism to obtain video embedded characterization which is called as fine-granularity video embedded characterization; the embedded representation of all the video clips is processed through a full connection layer, so that the embedded representation of all the processed video clips is obtained; calculating attention values by utilizing the processed embedded characterizations of all video clips and the coarse-granularity video embedded characterizations, and aggregating the processed embedded characterizations of all video clips by utilizing the attention values to obtain fine-granularity video embedded characterizations;
step 4, for each video, calculating the similarity between each video and the query text by using the coarse-granularity video embedded representation and the fine-granularity video embedded representation, including: for each video, calculating the similarity of the coarse-granularity video embedded representation and the text embedded representation and the similarity of the fine-granularity video embedded representation and the text embedded representation respectively; the calculated two similarities are weighted and summed to obtain the similarity of each video and the query text;
and 5, generating a video retrieval result according to the similarity between each video and the query text.
2. The method of claim 1, wherein for each video, using the embedded representations of all of its video segments, and the text embedded representations using bipartite graph matching, selecting matching video segments comprises:
calculating cosine similarity of the text embedded representation and the embedded representation of each video segment one to one respectively, taking a negative value to obtain a loss matrix, wherein two axes of the loss matrix respectively represent a query text index and a video segment index, and elements corresponding to the ith row and the jth column of the matrix represent losses for matching the ith query text with the jth video segment; and sending the loss matrix into a solver to obtain the video clips matched with the query text.
3. The video retrieval method according to claim 1, wherein the step 2 is implemented by a feature encoding module, the step 3 is implemented by a video level feature extraction module, the step 4 is implemented by a similarity calculation module, and the feature encoding module, the video level feature extraction module and the similarity calculation module form a network model, and the network model is trained in advance;
the training set comprises a plurality of videos and query texts corresponding to the videos; for the current video and the query text corresponding to the current video, obtaining text characteristics and embedded characterization of all video fragments through the processing and characteristic coding module in the step 1; selecting video fragments matched with each query text through a video level feature extraction module, using the embedded characterization of the matched video fragments as coarse-granularity video embedded characterization, and obtaining video embedded characterization, namely fine-granularity video embedded characterization; finally, each query text obtains a corresponding coarse-granularity video embedded representation and a corresponding fine-granularity video embedded representation;
each query text has a matched video segment, and a corresponding coarse-granularity video embedding representation and a corresponding fine-granularity video embedding representation, each query text and the matched video segment form a matched text and video segment pair, the matched text and video segment pair is formed with the corresponding coarse-granularity video embedding representation and the corresponding fine-granularity video embedding representation, the matched text and video segment pair is formed with the video segments matched with other query texts, and the coarse-granularity video embedding representation and the fine-granularity video embedding representation corresponding to the other query texts form a non-matched triplet;
and respectively calculating the similarity of the query text and the coarse-granularity video embedded representation and the fine-granularity video embedded representation in each matching triplet and each non-matching triplet, the similarity between different query texts and the similarity between different matching video fragments by a similarity calculation module, calculating triplet ordering loss, comparison loss and L1 regularization loss by combining the matching relation between the query texts and the video fragments, and training the network model by using the calculated three types of losses.
4. A video retrieval method according to claim 3, wherein the expression of the triplet ordering penalty is:
wherein,representing arbitrary query text +.>And video clip->For this, it contains three cases: query text->Video clip matching->The pair, i.e. the matched text and video clip pair, is denoted +.>The method comprises the steps of carrying out a first treatment on the surface of the Query text->Video clip which does not match->For, record->The method comprises the steps of carrying out a first treatment on the surface of the Another query text +.>And query text->Matched video clip->Is marked as,/>And->Text and video clip pairs, which are all mismatched, < ->The edge value of the penalty for the triplet ordering, +.>Representing cosine similarity;
the triplet ordering penalty comprises a two-part penalty: in the first partial loss calculation, the similarity between the text embedded representation of the query text and the coarse-grained video embedded representation is brought into the expression of the triplet ordering loss to calculate the first partial loss, namelyCalculated is the matching of query text in triples with coarse-grained video embedded representation, +.>Calculating the similarity between the query text in the unmatched triplet and the coarse-granularity video embedded representation; in the second partial loss calculation, the similarity between the text embedded representation of the query text and the fine-grained video embedded representation is brought into the expression of the triple ordering loss to calculate the second partial loss, namely +.>Calculated is the matching of query text in triples with fine-grained video embedded representation, +.>、/>Calculating the similarity between the query text in the unmatched triplet and the fine-grained video embedded representation; and combining the two losses to obtain the triple sorting loss.
5. A video retrieval method according to claim 3, wherein the contrast loss expression is:
wherein,and->All representing query text, < >>And->All represent video, ++>Representing matching text and video clip pairs, +.>Representing a pair of mismatched text and video segments, P representing a set of matched pairs of text and video segments, N representing a set of mismatched pairs of text and video segments;
the contrast loss comprises two parts of loss, and when calculating the first part of loss, the similarity between the text embedded representation of the query text and the coarse-granularity video embedded representation is brought into the expression of the contrast loss to calculate the first part of loss, namelyCalculated is the matching of query text in triples with coarse-grained video embedded representation, +.>Calculating the similarity between the query text in the unmatched triplet and the coarse-granularity video embedded representation; when calculating the second partial loss calculation, the second partial loss is calculated by taking the similarity of the text embedded representation of the query text and the fine-grained video embedded representation into the expression of the contrast loss, i.e. +.>Calculated is the matching of query text in triples with fine-grained video embedded representation, +.>Calculating the similarity between the query text in the unmatched triplet and the fine-grained video embedded representation; the two-part loss is integrated to obtain the contrast loss。
6. A video retrieval method according to claim 3, wherein the expression of L1 regularization loss is:
wherein,is L1 norm sign, < >>Representing two query texts +.>And->Is used for the degree of similarity of (c) to (c),representing two video clips->And->Similarity of->And->For matched text and video clip pairs, +.>And->Sequence numbers of matched text and video segment pairs;
the L1 regularization loss comprises two parts of loss, and when the first part of loss is calculated, the similarity of two video fragments is calculated by utilizing coarse-granularity video embedding characterization, and then two query texts are combinedAnd->The similarity of (2) gets the first partial loss, i.eCarry in two video clips during computation>And->An embedded representation of (a) coarse granularity video embedded representation; when the second part is lost, calculating the similarity between two video clips by utilizing fine-granularity video embedding characterization, and combining two query texts +.>And->The similarity of (2) gets the second partial loss, i.e. +.>Carry-in of two video clips during computationAnd->The video embedding representation of the corresponding fine granularity; and integrating the two-part loss to obtain the L1 regularization loss.
7. A video retrieval system, comprising:
the text feature extraction module is used for extracting features of the query text to obtain text features;
the feature coding module is used for processing all video segment features and text features of each video in the video library by adopting an attention mechanism respectively to obtain embedded characterization of all video segments and text embedded characterization of each video;
the video level feature extraction module is used for selecting matched video fragments by utilizing the embedded features of all video fragments of each video and the text embedded features in a bipartite graph matching mode, wherein the embedded features of the matched video fragments are called coarse-granularity video embedded features, the coarse-granularity video embedded features are used as guidance, and the embedded features of all video fragments are combined through an attention mechanism to obtain video embedded features which are called fine-granularity video embedded features; the embedded representation of all the video clips is processed through a full connection layer, so that the embedded representation of all the processed video clips is obtained; calculating attention values by utilizing the processed embedded characterizations of all video clips and the coarse-granularity video embedded characterizations, and aggregating the processed embedded characterizations of all video clips by utilizing the attention values to obtain fine-granularity video embedded characterizations;
the similarity calculation module is configured to calculate, for each video, similarity between each video and the query text by using the coarse-granularity video embedding representation and the fine-granularity video embedding representation, and includes: for each video, calculating the similarity of the coarse-granularity video embedded representation and the text embedded representation and the similarity of the fine-granularity video embedded representation and the text embedded representation respectively; the calculated two similarities are weighted and summed to obtain the similarity of each video and the query text;
and the video search result generation module is used for generating a video search result according to the similarity between each video and the query text.
8. A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.
9. A readable storage medium storing a computer program, which when executed by a processor implements the method according to any one of claims 1-6.
CN202311331941.9A 2023-10-16 2023-10-16 Video retrieval method, system, device and storage medium Active CN117076712B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311331941.9A CN117076712B (en) 2023-10-16 2023-10-16 Video retrieval method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311331941.9A CN117076712B (en) 2023-10-16 2023-10-16 Video retrieval method, system, device and storage medium

Publications (2)

Publication Number Publication Date
CN117076712A CN117076712A (en) 2023-11-17
CN117076712B true CN117076712B (en) 2024-02-23

Family

ID=88717578

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311331941.9A Active CN117076712B (en) 2023-10-16 2023-10-16 Video retrieval method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN117076712B (en)

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102890700A (en) * 2012-07-04 2013-01-23 北京航空航天大学 Method for retrieving similar video clips based on sports competition videos
CN109905772A (en) * 2019-03-12 2019-06-18 腾讯科技(深圳)有限公司 Video clip querying method, device, computer equipment and storage medium
CN112800979A (en) * 2021-02-01 2021-05-14 南京邮电大学 Dynamic expression recognition method and system based on characterization flow embedded network
WO2021092632A2 (en) * 2021-02-26 2021-05-14 Innopeak Technology, Inc. Weakly-supervised text-based video moment retrieval via cross attention modeling
CN113784199A (en) * 2021-09-10 2021-12-10 中国科学院计算技术研究所 System and method for generating video description text
CN113822135A (en) * 2021-07-21 2021-12-21 腾讯科技(深圳)有限公司 Video processing method, device and equipment based on artificial intelligence and storage medium
CN114003770A (en) * 2021-09-15 2022-02-01 之江实验室 Cross-modal video retrieval method inspired by reading strategy
CN114048350A (en) * 2021-11-08 2022-02-15 湖南大学 Text-video retrieval method based on fine-grained cross-modal alignment model
CN114064967A (en) * 2022-01-18 2022-02-18 之江实验室 Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network
CN114417056A (en) * 2022-01-20 2022-04-29 山东大学 Video time retrieval method and system based on double-stream Transformer
CN115408558A (en) * 2022-08-23 2022-11-29 浙江工商大学 Long video retrieval method and device based on multi-scale multi-example similarity learning
CN115687687A (en) * 2023-01-05 2023-02-03 山东建筑大学 Video segment searching method and system for open domain query
CN116052054A (en) * 2023-01-31 2023-05-02 上海科技大学 Weak supervision video representation learning method without aligned text in sequence video
CN116226452A (en) * 2023-03-03 2023-06-06 浙江工商大学 Cross-modal video retrieval method and device based on double-branch dynamic distillation learning
CN116226443A (en) * 2023-05-11 2023-06-06 山东建筑大学 Weak supervision video clip positioning method and system based on large-scale video corpus
CN116385946A (en) * 2023-06-06 2023-07-04 山东大学 Video-oriented target fragment positioning method, system, storage medium and equipment
CN116450883A (en) * 2023-04-24 2023-07-18 西安电子科技大学 Video moment retrieval method based on video content fine granularity information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11568247B2 (en) * 2019-03-22 2023-01-31 Nec Corporation Efficient and fine-grained video retrieval

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102890700A (en) * 2012-07-04 2013-01-23 北京航空航天大学 Method for retrieving similar video clips based on sports competition videos
CN109905772A (en) * 2019-03-12 2019-06-18 腾讯科技(深圳)有限公司 Video clip querying method, device, computer equipment and storage medium
CN112800979A (en) * 2021-02-01 2021-05-14 南京邮电大学 Dynamic expression recognition method and system based on characterization flow embedded network
WO2021092632A2 (en) * 2021-02-26 2021-05-14 Innopeak Technology, Inc. Weakly-supervised text-based video moment retrieval via cross attention modeling
CN113822135A (en) * 2021-07-21 2021-12-21 腾讯科技(深圳)有限公司 Video processing method, device and equipment based on artificial intelligence and storage medium
CN113784199A (en) * 2021-09-10 2021-12-10 中国科学院计算技术研究所 System and method for generating video description text
CN114003770A (en) * 2021-09-15 2022-02-01 之江实验室 Cross-modal video retrieval method inspired by reading strategy
CN114048350A (en) * 2021-11-08 2022-02-15 湖南大学 Text-video retrieval method based on fine-grained cross-modal alignment model
CN114064967A (en) * 2022-01-18 2022-02-18 之江实验室 Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network
CN114417056A (en) * 2022-01-20 2022-04-29 山东大学 Video time retrieval method and system based on double-stream Transformer
CN115408558A (en) * 2022-08-23 2022-11-29 浙江工商大学 Long video retrieval method and device based on multi-scale multi-example similarity learning
CN115687687A (en) * 2023-01-05 2023-02-03 山东建筑大学 Video segment searching method and system for open domain query
CN116052054A (en) * 2023-01-31 2023-05-02 上海科技大学 Weak supervision video representation learning method without aligned text in sequence video
CN116226452A (en) * 2023-03-03 2023-06-06 浙江工商大学 Cross-modal video retrieval method and device based on double-branch dynamic distillation learning
CN116450883A (en) * 2023-04-24 2023-07-18 西安电子科技大学 Video moment retrieval method based on video content fine granularity information
CN116226443A (en) * 2023-05-11 2023-06-06 山东建筑大学 Weak supervision video clip positioning method and system based on large-scale video corpus
CN116385946A (en) * 2023-06-06 2023-07-04 山东大学 Video-oriented target fragment positioning method, system, storage medium and equipment

Also Published As

Publication number Publication date
CN117076712A (en) 2023-11-17

Similar Documents

Publication Publication Date Title
CN112765306B (en) Intelligent question-answering method, intelligent question-answering device, computer equipment and storage medium
US11238093B2 (en) Video retrieval based on encoding temporal relationships among video frames
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
CN101449271B (en) Annotated by search
CN109508414B (en) Synonym mining method and device
CN108280114B (en) Deep learning-based user literature reading interest analysis method
US8577882B2 (en) Method and system for searching multilingual documents
CN113434636B (en) Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium
CN115563287B (en) Data processing system for obtaining associated object
CN111291177A (en) Information processing method and device and computer storage medium
CN110990555B (en) End-to-end retrieval type dialogue method and system and computer equipment
CN105488077A (en) Content tag generation method and apparatus
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN112989120B (en) Video clip query system and video clip query method
US20180276244A1 (en) Method and system for searching for similar images that is nearly independent of the scale of the collection of images
CN113360646A (en) Text generation method and equipment based on dynamic weight and storage medium
CN111325030A (en) Text label construction method and device, computer equipment and storage medium
CN113220864B (en) Intelligent question-answering data processing system
CN111182364A (en) Short video copyright detection method and system
CN111325033B (en) Entity identification method, entity identification device, electronic equipment and computer readable storage medium
CN118113815B (en) Content searching method, related device and medium
CN113326701A (en) Nested entity recognition method and device, computer equipment and storage medium
CN107193916B (en) Personalized and diversified query recommendation method and system
CN111475711A (en) Information pushing method and device, electronic equipment and computer readable medium
Meng et al. Concept-concept association information integration and multi-model collaboration for multimedia semantic concept detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant