CN117725261A

CN117725261A - Cross-modal retrieval method, device, equipment and medium for video text

Info

Publication number: CN117725261A
Application number: CN202311654788.3A
Authority: CN
Inventors: 赵山; 马文涛; 袁鹏飞; 辜丽川; 吴晓倩
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2023-12-05
Filing date: 2023-12-05
Publication date: 2024-03-19

Abstract

The invention discloses a method, a device, equipment and a medium for cross-modal retrieval of video text, which relate to the technical field of cross-modal retrieval and comprise the following steps: obtaining a video feature characterization and a text feature characterization; establishing a multi-mode comparison knowledge distillation model, wherein the multi-mode comparison knowledge distillation model comprises a multi-mode comparison loss teacher module and a pairing ordering loss student module; inputting the frequency data characteristic representation and the text data characteristic representation into a multi-mode contrast loss teacher module for training to obtain a public representation space and a soft label similarity matrix containing a plurality of types of characteristic representations; inputting the video data characteristic representation and the text data characteristic representation into a pairing ordering loss student module for training; and utilizing the pairing ordering loss student module to perform cross-modal retrieval. According to the invention, based on the cooperative training of the soft tag similarity matrix, the multi-mode comparison loss and the pairing sorting loss student modules, the robust information is transmitted to the pairing sorting loss student modules, so that the retrieval accuracy of the sorting loss student modules is improved.

Description

Cross-modal retrieval method, device, equipment and medium for video text

Technical Field

The invention relates to the technical field of cross-modal retrieval, in particular to a cross-modal retrieval method, device, equipment and medium for video text.

Background

With the rapid development of mobile internet and digital media, multimedia data using video as a carrier is continuously generated in network space. Video text cross-modal retrieval has received extensive attention in academia and industry as a promising data management technique.

Contrast learning is a self-supervised learning framework that can learn feature profiles with strong discriminative power to improve downstream task performance, and has achieved dramatic achievements in multi-modal applications, such as cross-modal retrieval. While existing approaches achieve good results in picture-text matching through contrast loss, learning semantic alignment between video and text is more challenging because video contains more complex spatiotemporal characterization information than images. Knowledge distillation is essentially a model-independent compression strategy for generating efficient models in a teacher-student paradigm while maintaining performance, i.e., passing knowledge extracted from a large model to another model as a supervisory signal. It has been widely used in fields including recommendation systems and cross-modality retrieval.

In a video text retrieval task, one video may be described by multiple text descriptions, and one text description may also correspond to multiple different videos. The existing video text cross-mode retrieval methods all tend to learn a common characterization space, in which samples of different modes are subjected to similarity measurement comparison through a set pairing sorting loss function, wherein video texts are just pulled towards each other, and negative pairs are pushed away; the pairing ordering penalty focuses on the distance of global feature characterization between video and text.

However, semantic concepts in videos tend to be complex, one video may contain multiple objects, scenes, and actions, and one text description may also involve multiple aspects, resulting in multiple possible representations in a common token space, resulting in instability of the common token space, such that the cross-modal information learned by a student module from the common token space is not robust, and the retrieval results obtained when using the student module for cross-modal retrieval are inaccurate.

Disclosure of Invention

The embodiment of the invention provides a cross-modal retrieval method, device, equipment and medium for video text, which can solve the problem of inaccurate retrieval results of student modules in a knowledge distillation model due to unstable public characterization space in the prior art.

The embodiment of the invention provides a cross-modal retrieval method for video text, which comprises the following steps:

obtaining video data, video feature characterization and text feature characterization corresponding to text data describing the video data;

establishing a multi-mode comparison knowledge distillation model comprising a multi-mode comparison loss teacher module and a pairing ordering loss student module;

inputting the video feature representation and the text feature representation into a multi-modal contrast loss teacher module, and training the multi-modal contrast loss teacher module by introducing a multi-modal contrast loss function of probability, wherein the multi-modal contrast loss teacher module comprises a shared neural network layer for mapping the video feature representation and the text feature representation to a public representation space;

obtaining a trained public characterization space based on the trained multi-modal contrast loss teacher module, and outputting a soft tag similarity matrix according to the trained public characterization space;

inputting the video feature representation and the text feature representation into a pairing sorting loss student module, and training the sorting loss student module by minimizing the difference between the result output by the pairing sorting loss student module and the corresponding soft label similarity matrix;

and using the trained pairing ordering loss student module to perform cross-modal retrieval of the video text.

Further, the multi-mode contrast loss function used in the training process of the multi-mode contrast loss teacher module is as follows:

wherein,data sample representing the ith modality of the jth instance,>a feature characterization vector of the data sample of the ith mode of the jth instance, n represents the number of instance samples in one batch-size, and m represents the number of modes; τ is the superparameter and P is +.>The probability of belonging to j in an example sample containing m modality data.

Further, the loss function used in the training process of the pairing sorting loss student module is as follows:

L＝αL _rank +(1-α)L _mc

L _rank ＝[Δ+S(V ⁺ ,T ^- )-S(V ⁺ ,T ⁺ )]+[Δ+S(T ⁺ ,V ^- )-S(T ⁺ ,V ⁺ )]

the loss function L is used for calculating the difference between the result output by the pairing sorting loss student module and the corresponding soft label similarity matrix, and L _rank Is the pairing ordering loss, alpha is the adjustment parameter; v and T represent video feature representation and text feature representation, respectively, (V) ⁺ ,T ⁺ ) Sum (T) ⁺ ,V ⁺ ) Representing a matching alignment of a video feature representation and a text feature representation, (V) ⁺ ,T ^- ) Sum (T) ⁺ ,V ^- ) Representing video feature representations and textThe negative pair S (·, ·) of feature representation mismatch represents the metric cosine distance, Δ is the predefined boundary threshold.

Further, the saidA feature characterization vector representing a data sample of an ith modality of a jth instance, comprising:

where L is the dimension of the common token space, g _i Is the i-th modality specific feature characterization embedding function.

Further, the obtaining the video feature representation and the text feature representation corresponding to the video data and the text data includes:

using a pre-trained ResNet as a backbone network to encode video data, and obtaining video characteristic characterization;

initializing a text encoder by using a pre-trained GloVe, and encoding the text data by using the initialized text encoder to obtain the text feature characterization.

Further, the optimization algorithm used in the training of the multi-modal contrast loss teacher module and the training of the pairing ordering loss student module is an ADAM optimizer.

A video text cross-modality retrieval device comprising:

the data acquisition module is used for acquiring video data, video feature characterization and text feature characterization corresponding to text data describing the video data;

model training module for:

and the cross-modal retrieval module is used for performing cross-modal retrieval of the video text by using the trained pairing ordering loss student module.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the video text cross-modality retrieval method described above when the program is executed.

A computer readable storage medium storing a computer program which when executed by a processor implements the video text cross-modality retrieval method described above.

The embodiment of the invention provides a cross-modal retrieval method for video texts, which has the following beneficial effects compared with the prior art:

in the process of training the multi-modal contrast loss teacher module, the invention maximizes the distances between different modalities in the vision-text joint semantic public characterization space through the designed multi-modal contrast loss function introducing probability, obtains a stable public characterization space, then obtains a soft tag similarity matrix based on the stable public characterization space, and transmits robust information to the pairing ordering loss student module based on the soft tag similarity matrix, the multi-modal contrast loss and the pairing ordering loss student module for collaborative training, thereby improving the retrieval accuracy of the ordering loss student module.

Drawings

FIG. 1 is a schematic diagram of a multi-modal comparative knowledge distillation model;

FIG. 2 is a Pearson's correlation visualization on an MSR-VTT dataset;

FIG. 3 is a similarity distribution over MSR-VTT datasets;

fig. 4 is a graph showing the effect of different alpha values on the performance index of the MCKD model at R@1.

Detailed Description

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit of the invention, whereby the invention is not limited to the specific embodiments disclosed below.

Examples

Pairing ordering penalty (Pair-wise Ranking Loss) is a widely used objective function in video text cross-modality retrieval tasks that makes the distance between positive pairs of samples smaller than the distance between negative pairs of samples by a predefined boundary threshold. The existing method adopts pairing ordering loss to concentrate on the distance of global feature characterization between the video and the text. However, using only the pairing ordering penalty for global feature token alignment is challenging, and sometimes not suitable for video text cross-modal retrieval tasks in real-world practice. This is because semantic concepts in video tend to be complex, common reference data sets for text retrieval of video in common, there are multiple text sentences that can describe the same video from different views. Thus, the textual characterization of multiple views presents unique challenges for pairing ordering loss: the semantic boundaries between text describing different videos may be ambiguous. The reason is that if text is only partially aligned with its corresponding video concept, false associations can be created between text and non-corresponding video. That is, the text of the corresponding video is not always a reliable positive sample pair in the loss of pairing order. Thus, existing solutions that rely on unreliable positive sample pairs can lead to unstable optimizations, while also resulting in collapse of the video text retrieval common representation space. In addition, the pairing ordering penalty employs a predefined hard similarity (hard similarity) to determine positive and negative pairs. However, hard similarity may discard intra-instance and inter-instance dependencies.

According to the above analysis, in order to maintain a stable and differentiated visual-text joint semantic public token space, it is necessary to capture the semantic relationship between text feature tokens and their corresponding video feature tokens to better screen reliable positive and negative sample pairs.

In particular, there is ambiguity describing semantic boundaries between text of different videos, which is obviously not desirable for instance-level video-text cross-modality retrieval tasks. To suppress this unreliable characterization of text, the proposed MCKD model names the video and its corresponding text as a "video/text group" based on a strong assumption: each "video/text group" is a different instance category, and then, considering that the multi-modal data essentially consists of multiple modalities, feature characterization is learned by maximizing the consistency between the different modalities in the visual-text joint semantic common characterization space, which is a self-supervising approach to achieve classification of the "video/text group". Notably, the "video/text group" can avoid the risk of collapse of the common representation space, resulting in multiple texts being distinguishable (the same video corresponding to different views), e.g., "two dogs are chasing on the lawn" is semantically equivalent to "two dogs onthe lawn" in multi-modal contrast learning. Thus, a pairing ordering penalty needs to be employed to ensure variability between text within an instance. In addition, the pairing ordering penalty employs a predefined hard similarity (hard similarity) to determine positive and negative pairs. However, hard similarity may discard intra-instance and inter-instance dependencies. The present invention proposes "soft label" similarity (soft similarity) for cross-modal retrieval tasks in video-text. Briefly, the knowledge distillation module is utilized to combine the advantages of the multi-modal contrast loss and the pairing ordering loss. I.e., multimodal contrast loss, by narrowing the heterogeneous and semantic gaps between video-text, and then providing "soft-label" supervision information for pairing ordering loss to ensure variability between text within an instance.

Referring to fig. 1, the present disclosure provides a cross-modal retrieval method for video text, including the following steps:

1. establishing a multi-modal comparison knowledge distillation model for capturing semantic relationships between text features and corresponding video features thereof, thereby constructing a stable and distinguishable visual-text joint semantic public characterization space, wherein the multi-modal comparison knowledge distillation model comprises:

the multi-mode contrast loss teacher module adopts multi-mode contrast learning to bridge semantic gaps between video texts so as to disperse samples among modes to the maximum extent, and compresses samples in the modes in a public characterization space at the same time, thereby eliminating unreliable text characterization. The multi-modal contrast loss teacher module includes a shared neural network layer for mapping video feature tokens and text feature tokens to a common token space. Outputting an inter-instance mutual information similarity matrix S, and providing soft label supervision information for pairing and sorting loss;

and the pairing ordering loss student module combines multi-modal contrast loss and pairing ordering loss regularization cross-modal joint semantic public characterization space by utilizing soft label supervision information transmitted by a teacher model so as to ensure the difference between texts in the examples.

2. For video encoding, word embedding size is set to 300 for text encoding using a pre-trained res net as backbone network, and the encoder is initialized with a pre-trained GloVe, specifically:

using a pre-trained ResNet as a backbone network to encode video data, and obtaining video characteristic characterization; initializing a text encoder by using a pre-trained GloVe, and encoding the text data by using the initialized text encoder to obtain the text feature characterization.

3. Training the multi-modal comparison knowledge distillation model by combining the pairing ordering loss and the multi-modal comparison loss through a knowledge distillation module; wherein, the multi-modal contrast loss function is:

where n represents the number of instance samples within one batch-size, m represents the number of modalities,data samples (including video, text, audio, etc.) representing the ith modality of the jth instance, and the like>Representation->Probability of belonging to instance j in an instance sample containing m modality data:

the feature characterization vector of the data sample representing the ith modality of the jth instance is obtained from the following formula:where L is the dimension of the common token space, g _i Is the i-th modality specific feature characterization embedding function.

The pairing ordering loss is:

v and T respectively represent the input feature characterization of video and text, S (&, &) represents the measurement criterion cosine distance. For a given quaternary input (V ⁺ ,T ⁺ ,V ^- ,T ^- ) Wherein, (V) ⁺ ,T ⁺ ) Sum (T) ⁺ ,V ⁺ ) Representing a matching alignment of a video feature representation and a text feature representation, (V) ⁺ ,T ^- ) Sum (T) ⁺ ,V ^- ) Negative pairs representing a mismatch of video feature representation and text feature representation, comprising feature representation vectors of video and text, facing (V ⁺ ,T ⁺ ) Sum (T) ⁺ ,V ⁺ ) Will be pulled toward each other, while the more difficult negative pair (V ⁺ ,T ^- ) Sum (T) ⁺ ,V ^- ) Then it is pushed farther than the predefined boundary threshold delta.

4. And performing cross-modal retrieval of the video text by using the trained multi-modal comparison knowledge distillation model.

The method specifically comprises the following steps:

pairing order loss student module: the conventional pairing ordering penalty (Pair-wise Ranking Loss) is such that the distance between positive pairs of samples is smaller than the distance between negative pairs of samples by a predefined boundary threshold. However, hard similarity may discard intra-instance and inter-instance dependencies. The invention adopts the soft label similarity for the video-text cross-mode retrieval task, namely the multi-mode contrast loss is realized by reducing the heterogeneous gap and the semantic gap between the video and the text and outputting a similarity matrix S, and then provides soft label supervision information for the pairing sorting loss so as to ensure the difference between the texts in the examples. Specifically, in one batch-size, V and T represent input feature characterizations of video and text, respectively. For a given quaternary input (V ⁺ ,T ⁺ ,V ^- ,T ^- ) Which contains feature characterization vectors of video and text, facing (V ⁺ ,T ⁺ ) Sum (T) ⁺ ,V ⁺ ) Will be pulled toward each other, while the more difficult negative pair (V ⁺ ,T ^- ) Sum (T) ⁺ ,V ^- ) Then it is pushed farther than the predefined boundary threshold delta. Namely (V) ⁺ ,T ⁺ ,V ^- ,T ^- ) The sorting penalty for a set of input pairs can be described as:

L _rank ＝[Δ+S(V ⁺ ,T ^- )-S(V ⁺ ,T ⁺ )]+[Δ+S(T ⁺ ,V ^- )-S(T ⁺ ,V ⁺ )] (1)

where S (·, ·) represents the metric, cosine distance is employed in the present invention. For a query V ⁺ As "video anchor", the corresponding text description should have higher similarity. At the same time, for a query T ⁺ As "text anchor," its semantically related video ranking should be higher. Pairing ordering loss is a basic matching strategy that, although widely used, focuses on the distance between global feature characterizations of video and text. Therefore, it is sometimes not applicable in real-world video-text retrieval. For example, given a few video frames that differ slightly in semantics, the model may output similar feature characterizations, resulting in false associations with non-corresponding text.

Therefore, in order to maintain a stable and differentiated visual-text joint semantic public representation space, inspired by knowledge distillation and contrast learning widely applied in a cross-modal retrieval task, the invention provides a novel video-text matching model MCKD, which can transmit robust and structured inter-instance mutual information through knowledge distillation and regularize the cross-modal joint semantic public representation space in parallel through pairing sorting loss.

And the multi-mode contrast loss teacher module is used for realizing the contrast training of the single-mode data in a data enhancement mode. In contrast, multi-modal data is naturally composed of multiple modalities, and data from multiple modalities per instance sample can be utilized very naturally to maximize mutual information. The invention provides an instance-level multi-modal contrast loss, which explicitly and fully considers the distribution in and among the modalities so as to improve the mutual information and inhibit unreliable semantic text characterization in the instance. Specifically, the definition is based on the "video/text group" assumption (i.e., each "video/text group" is a different class of instances, each containing data for video and text modalities)The probability of belonging to j in an example sample containing m modality data is as follows:

where n represents the number of instance samples within one batch-size, m represents the number of modalities,data samples (including video, text, audio, etc.) representing the ith modality of the jth instance, j e [1, n],i∈[1,m],/>The feature characterization vector of the data sample representing the ith modality of the jth instance is obtained from the following formula:

The multi-modal contrast loss is as follows:

by minimizing equation (4), semantically related positive samples in the common token space will be pulled up (i.e., data considered to belong to the same instance, as forThen->) The method comprises the steps of carrying out a first treatment on the surface of the The negative samples will be pushed away (i.e. data that are considered not to belong to the same instance, as for +.>Then->)

The overall loss function of the MCKD model can be described as:

L＝αL _rank +(1-α)L _mc

wherein the hyper-parameter α balances the impact of each type of constraint.

The MCKD model adopts a random gradient descent optimization algorithm ADAM to minimize the joint loss function L in a batch-to-batch mode. Thus, the MCKD model can maximize the consistency between the intrinsic symbiotic modalities to bridge the gaps between different modalities and mine intra-and inter-instance differences, thereby preserving a stable and differentiated visual-text joint semantic public characterization space.

The present specification employs a generic optimizer ADAM to train the MCKD model and deploys the training strategy of Two-stage:

stage I: first, the backbone network weights of the pre-trained video text dual tower structure are frozen using only the proposed multi-modal contrast loss L _mc To fine tune the parameters of the remainder, the main purpose of this stage being to suppress ambiguous text semantic representations;

stage II: next, as Stage I converges, by cross-distillation (i.e. in combination with L _mc And L _rank Overall loss L) of video text matches. That is, knowledge distillation modules are employed to communicate robust and structured inter-instance information;

according to the multi-mode comparison knowledge-based distillation video text cross-mode retrieval method provided by the specification, the common knowledge of a self-supervision teacher model is adopted to correct ambiguity of semantic boundaries. Specifically, the teacher model generates a stable visual-text joint semantic representation space by maximizing inter-modal information in multi-modal contrast learning. Then, the mutual information between the robust and structured examples is transmitted to the student model through the cooperative training of the pairing sorting loss, so that the matching performance of the model is improved.

The specification also provides experimental results:

data set: extensive experiments were performed to evaluate the effect of the multimodal comparative knowledge distillation model. Is trained and tested on video Text retrieval dataset MSR-VTT, TGIF, VATEX and Youtube2 Text. Wherein the MSR-VTT data set is composed of 10k videos, each video being 10-30 seconds long and corresponding to 20 natural text descriptions. 6573 videos are used for training, 497 and 2990 videos are used for verification and testing, respectively. The TGIF dataset consists of GIF formatted videos, each corresponding to 1-3 natural text descriptions. 79451 videos are used for training, 10651 and 11310 videos are used for verification and testing, respectively. The vat data set contains 34991 videos, each containing 10 english and 10 chinese natural text descriptions. 25991 video was used for training, 3000 and 6000 videos were used for verification and testing, respectively. It is noted that this experiment uses only english text descriptions for each video.

The Youtube2Text dataset contained 1970 videos, each with 40 natural Text descriptions. 1200 videos are used for training, 100 and 670 videos are used for verification and testing. The generalization performance of the proposed MCKD model was evaluated using only the test set of Youtube2 Text.

Evaluation index: three commonly used evaluation metrics were used to measure the performance of the MCKD model, namely recall (R@K), median ranking (MedR) and mean ranking (MnR). Where R@K denotes the possibility of a correct match occurring in the first K bits of the search list, following the conventional settings, i.e. k=1, 5 and 10. The MedR and MnR represent the median and average numbers, respectively, that are closest to the true result, and the lower the score the better. Meanwhile, to evaluate the overall performance of the MCKD model, the "Rsum" index of the sum of all R@K was also used.

Experimental results: the proposed MCKD model was compared to models of both paradigms (non-graph-based and graph-based) to evaluate its superiority in video text cross-modality retrieval tasks. Both formats included 13 SoTA baseline methods. For the non-graph based paradigm, VSE, vse++, W2VV, dualEn, S Bin, dual en are compared, PSM, T2VLAD. ViSERN, HGR, HANet, HCGC, QAMF is compared for a graph-based paradigm. Code and feature characterization by some methods is implemented for comparative fairness. Meanwhile, the original experimental data in the partial papers are directly cited as appropriate.

Table 1 shows the results of the comparison of the MCKD model and 13 SoTA baseline methods on the MSR-VTT dataset, giving the following two observations: 1) The MCKD model achieves the best performance compared to all baseline methods (including traditional and most advanced video text retrieval methods). While they both implement hierarchical reasoning for fine-grained video text matching, MCKD is also superior to both representative baseline methods HGR and HCGC, as compared to them. In particular, because the HGR method has difficulty exploring video-text hierarchical matching strategies, the MCKD model uses multi-modal contrast loss to model invariance between multi-modal data instance samples. HCGC combines modeling multi-graph consistency learning in video-text cross-modal matching, and its improved performance significantly demonstrates the advantage of relying on intra-modal and inter-modal interactions. 2) The performance of the MCKD model also exceeded the SoTA competitors dual en's, PSM and T2VLAD, on all criteria (including R@1, R@5, r@10 and Rsum). In particular, for the Rsum index reflecting the overall search quality, the MCKD model has a relatively large increase in magnitude of +21.5%, +14.7% and +6.9%, respectively.

TABLE 1 comparison of Performance of MCKD model on MSR-VTT datasets

Table 2 shows the performance of the MCKD model compared to the SoTA baseline method on TGIF and vat datasets, from the data in the table, it can be seen that MCKD consistently performed better on TGIF datasets than other SoTA baseline methods. Notably, the same approach has lower performance in Table 1, which means that the TGIF dataset is more complex than MSR-VTT. Even so, MCKD can still achieve 6.8%,18.7% and 25.6% performance on R@K (k=1, 5 and 10) indices, respectively. On the vat dataset, MCKD again exceeded all the listed methods and maintained 39.6%,77.4% and 85.4% performance on the R@K (k=1, 5 and 10) index, as compared to only 36.8%,73.6% and 83.7% for dual en.

TABLE 2 comparison of the performance of MCKD model on TGIF and VATEX datasets

In conclusion, compared with other SoTA baseline methods, the MCKD has obvious advantages, and the effectiveness of the proposed video text cross-modal retrieval method is shown.

Ablation experiment: a series of ablative experiments were performed on the MCKD model on the MSR-VTT dataset to explore the different components (i.e., two-Stage training strategy, and L _rank L and _mc ) Impact on model performance. The experimental results are shown in table 3, from which the following two conclusions can be drawn:

training strategies: respectively adopt L _mc And L _rank The pre-trained video-text dual-tower backbone network weights were frozen at Stage I to evaluate the performance of the model. As can be seen from the first two rows of Table 3, L _mc Better results were obtained. Due to L _rank Of interest is the distance between the video and the global feature representation of the text, where the text of the different video descriptions is ambiguous on semantic boundaries, which may lead to false associations between the text and its non-corresponding video. And L is _mc The consistency of mutual information between the intrinsic symbiotic modalities can be maximized to bridge the semantic gap between video text. In Stage II, only L is used _rank Or L _mc The performance of the MCKD model can continue to be better than Stage I. Even more than some SoTA baseline methods (e.g., dualEn, S ² Bin and HGR) are better. Furthermore, rather than using only L _rank Or L _mc The complete MCKD combined with both loss functions has higher performance compared to the MCKD model variant of (a), which suggests that the multimodal contrast loss can remain stable and discriminative vision-text joint semantic public token space and pass robust and structured inter-instance mutual information.

Double loss function: respectively compare L _rank And L _mc Distribution of video-text feature characterization to explore L _mc Can learn the characteristic characterization with discrimination in the mode and is L _rank Robust and structured inter-instance information is passed. As shown in fig. 2 and 3, the following two observations can be made therefrom: in Stage II training, 100 video-text pairs are randomly selected from the MSR-VTT dataset and used with L, respectively _rank And L _mc And extracting the characteristics. Meanwhile, as shown in fig. 2, pearson correlation visualization is performed on the feature characterization of the two modes of video and text. That is, the lower the pearson correlation between the two modality characterization, the higher the orthogonality. Due to L _rank The sample distance between instances can be explicitly considered, so that one can observe the distance between L _rank After training, the pearson correlation between the characterization of the two modalities is small. In practice, L _rank The model is encouraged to find fine-grained detail information to distinguish semantically similar "video/text groups". To illustrate the distribution of positive and negative pairs of samples in semantic space provided by different loss functions, the present invention also refers to the previously working approach: quantitative visualization of intra-instance similarity distribution P and inter-instance similarity distribution Q on MSR-VTT datasets. False associations between text and non-corresponding videos may result due to the presence of ambiguous semantic boundaries between text describing different videos. Therefore, as shown in FIG. 3 (a), only L is used _rank A relatively large separation is obtained between the positive and negative pairs (i.e., there are many "difficult" negative samples of high similarity in the visual-text joint semantic public token space). Using quantitative index functions (the lower the value, the better) to calculate respectively onlyBy L _rank Using only L _mc And full version MCKD model (using L _rank And L _mc ) The following index scores: area (L) _rank )＝0.3246，Area(L _mc ) = 0.2714 and Area (MCKD) = 0.1673. That is, the degree of separability of features in different embedded public token spaces may be formalized as Area (MCKD)>Area(L _mc )>Area(L _rank ). Thus, the MCKD model can provide a stable and differentiated joint semantic public token space for video-text bi-directional retrieval.

Generalization ability: the proposed MCKD model was also evaluated for its generalization ability over unseen datasets, i.e. to perform the Zero-Shot task on the Youtube2Text dataset. The existing most advanced video-text retrieval methods are mainly evaluated on test sets derived from the original data set. However, generalizing the trained model to out-of-domain (never seen) data is also an important indicator of evaluation performance in practical scenarios. Thus, the model is trained on the MSR-VTT dataset and then performance is tested on the Youtube2Text test set. As shown in Table 3, it can be seen that the MCKD model still achieves good performance on the Youtube2Text data set. Compared to the results in Table 1, both DualEn and VSE++ achieved good performance on MSR-VTT, but were very difficult to generalize well on new datasets. In addition, HGR model and HCGC model have similar phenomena. The MCKD model can produce consistent performance benefits over different data sets (including both intra-and extra-domain) compared to other baseline approaches.

TABLE 3 ablation experiments of MCKD model on MSR-VTT datasets

Furthermore, α is a key hyper-parameter that balances the pairing ordering penalty and the multi-modal contrast penalty. Thus, in Stage II training of the experiment, an attempt was made to manually adjust L in 0.1 steps _rank And L _mc The weight ratio between (e.g., 0.1 and 0.9,0.2 and 0.8, etc.) to evaluate the effect of the hyper-parameter ratio. As shown in fig. 4, in MSThe MCKD model can achieve stable performance over a relatively dense range (i.e., 0.4 and 0.6,0.5 and 0.5,0.4 and 0.6) over the R-VTT dataset. Therefore, the weight super parameter of 1:1 ratio is adopted by default in the invention experiment.

The specification also provides generalization performance assessment of 6 methods on the Youtube2Text dataset, as shown in table 4.

TABLE 4 generalization Performance evaluation on Youtube2Text dataset

For specific limitations on the video text cross-modal retrieval device, reference may be made to the above limitation on the video text cross-modal retrieval method, and no further description is given here. The various modules in the video text cross-modality retrieval device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

The present specification also provides a computer readable storage medium storing a computer program operable to perform the video text cross-modality retrieval method described above.

The present specification also provides a computer device that includes, at a hardware level, a processor, an internal bus, a network interface, a memory, and a non-volatile storage, although it may include hardware required for other services. The processor reads the corresponding computer program from the nonvolatile memory to the memory and then runs the computer program to realize the cross-modal retrieval method of the video text.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A cross-modal retrieval method for video text comprises the following steps:

2. The method for cross-modal retrieval of video text according to claim 1, wherein the multi-modal contrast loss function used in the training process of the multi-modal contrast loss teacher module is:

3. The video text cross-modal retrieval method of claim 2, wherein the loss function used in the training process of the paired-order loss student module is:

L＝αL _rank +(1-α)L _mc

the loss function L is used for calculating the difference between the result output by the pairing sorting loss student module and the corresponding soft label similarity matrix, and L _rank Is the pairing ordering loss, alpha is the adjustment parameter; v and T represent video feature representation and text feature representation, respectively, (V) ⁺ ,T ⁺ ) Sum (T) ⁺ ,V ⁺ ) Representing a matching alignment of a video feature representation and a text feature representation, (V) ⁺ ,T ^- ) Sum (T) ⁺ ,V ^- ) The negative pair S (·, ·) representing the mismatch of the video feature representation and the text feature representation represents the metric cosine distance, Δ is the predefined boundary threshold.

4. A video text cross-modality retrieval method as claimed in claim 2, wherein the steps ofA feature characterization vector representing a data sample of an ith modality of a jth instance, comprising:

5. The method for cross-modal retrieval of video text according to claim 1, wherein the obtaining the video feature representation and the text feature representation corresponding to the video data and the text data comprises:

6. The video text cross-modal retrieval method as recited in claim 1 wherein the optimization algorithm used in training the multi-modal contrast loss teacher module and training the paired-order loss student module is an ADAM optimizer.

7. A video text cross-modality retrieval device, comprising:

model training module for:

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the preceding claims 1-6 when executing the program.

9. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-6.