CN112883229A

CN112883229A - Video-text cross-modal retrieval method and device based on multi-feature-map attention network model

Info

Publication number: CN112883229A
Application number: CN202110256218.3A
Authority: CN
Inventors: 吴大衍; 郝孝帅; 周玉灿; 李波; 王伟平; 孟丹
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2021-06-01
Anticipated expiration: 2041-03-09
Also published as: CN112883229B

Abstract

The invention relates to a video-text cross-modal retrieval method and device based on a multi-feature-map attention network model. The method comprises the following steps: establishing a multi-feature map attention network model for mining the structural relationship among different modal features of the video and obtaining efficient video feature representation through high-level semantic information exchange among different video features; the multi-feature-diagram attention network model is trained by adopting a double-constraint ordering loss function, wherein the double-constraint ordering loss function comprises an ordering constraint function between a video-text pair and a structural constraint function inside single-class data, so that not only can texts and videos with similar semantics be similar in an embedding space, but also the original structural characteristics can be kept in the embedding space; and performing cross-modal retrieval of the video and the text by using the trained multi-feature map attention network model. The invention obviously improves the retrieval performance of video-text retrieval.

Description

Video-text cross-modal retrieval method and device based on multi-feature-map attention network model

Technical Field

The invention belongs to the technical field of information, and particularly relates to a video-text cross-modal retrieval method and device based on a multi-feature-map attention network model.

Background

With the rapid development of technologies such as internet, intelligent mobile devices, social media and instant messaging, multimedia data show explosive growth. In recent years, there has been an increasing interest in how users can quickly and accurately find their desired content in mass multimedia data. Video-text cross-modal retrieval is an important retrieval task for both video and text modalities in multimedia retrieval. The task aims to search out a corresponding video given a text checking object or search out a corresponding text given a video checking object. In the conventional method, the similarity between the data of the cross-modal is generally learned by constraining the distance between the positive sample pairs of the data of different modalities to be closer than the distance between the negative sample pairs in a joint embedding space. Due to the complexity of video data, research hotspots of the method tend to focus on feature representation learning of the video and how to maintain the similarity relation of the video/text in the original feature space in the joint embedding space.

The defects of the prior art are mainly reflected in that:

1. the feature representation learning method of the video comprises the following steps: most of the existing video processing methods only utilize visual features in videos, and ignore rich information such as actions, human faces, sounds, subtitles and the like contained in the videos. In recent years, the aggregation method of multi-modal features in video significantly improves the performance of video-text cross-modal retrieval. However, the super-high and heterogeneous characteristics of multi-modal features in video are often ignored in the past methods, and the inherent structural relationship between the features is not taken into consideration, so that the fusion efficiency is not high, and the video representation still faces a great challenge.

2. Designing a loss function of cross-modal retrieval: the existing video-text retrieval method generally adopts a bidirectional maximum interval sorting loss function training network based on difficult samples, and the method excavates video-text sample pairs which are difficult to retrieve in a feature combined embedding space of a video and a text, draws close matched sample pairs and pushes away unmatched sample pairs to realize the updating of network parameters. However, this ordering penalty only considers the semantic relationship between text and video, ignoring the semantic relationship within video/text. Exploiting semantic similarity relationships within video/text helps to enhance the representation of a single video/text sample.

Disclosure of Invention

The invention aims to design a multi-feature-map attention network model for cross-modal retrieval of video-text. Specifically, the invention provides a multi-feature map attention module which fully excavates the structural relationship between different modal features of the video and obtains more efficient video feature representation through high-level semantic information exchange between different video features. In addition, the invention also designs a new dual-constraint ordering loss function, which simultaneously considers the ordering constraint between video-text pairs and the structural constraint inside single-type data (video/text). The function can not only make texts and videos with similar semantics close in an embedding space, but also can keep original structural characteristics in the embedding space.

The technical scheme adopted by the invention is as follows:

a video-text cross-modal retrieval method based on a multi-feature map attention network model comprises the following steps:

establishing a multi-feature map attention network model for mining the structural relationship among different modal features of the video and obtaining efficient video feature representation through high-level semantic information exchange among different video features;

training the multi-feature-diagram attention network model by adopting a dual-constraint ordering loss function, wherein the dual-constraint ordering loss function comprises an ordering constraint function between video-text pairs and a structural constraint function inside single-class data;

and performing cross-modal retrieval of the video and the text by using the trained multi-feature map attention network model.

Further, the multi-feature-map attention network model includes:

the video coding module is responsible for extracting a plurality of characteristics of a video, obtaining video characteristic vectors with fixed lengths by aggregation along a time dimension, and realizing interaction of high-level semantic information among the characteristics by the multi-characteristic-diagram attention module to finally form effective video characteristic representation;

the text coding module is responsible for coding the query statement into a single vector representation and then projecting the single vector representation into a subspace aiming at each video feature;

and the joint embedding space module is responsible for optimizing joint embedding representation of the video and the text by utilizing the ordering constraint between the video-text pairs and the structural constraint inside the single-type data.

Further, the plurality of features of the video extracted by the video coding module include: object features, motion features, audio features, speech features, caption features, and face features.

Further, the multi-feature map attention module forms an effective video feature representation by the following steps, including:

the multiple feature map attention module performs a self-attention calculation using a shared attention mechanism, as follows, wherein e_ijRepresents the importance of node j to node i, h_i，h_jRepresenting the ith and jth video feature vectors:

e_ij＝a(h_i，h_j)

using softmax function pair e_ijNormalized, the formula is as follows, wherein N_iIs a set of neighboring nodes, alpha, of node i in the graph_ijDenotes the attention coefficient between node i and node j:

and performing characteristic updating by using the obtained attention coefficient and the characteristic of each node to generate the final output characteristic of each node, wherein the characteristic updating formula is as follows, wherein sigma represents a ReLU nonlinear activation function:

finally, the output video features are connected to a single fixed length vector to form an effective video feature representation.

Further, given N text and video pairs, N pairs of embedded features (V) are obtained_i，T_i) In which V is_iAnd T_iIs a property of video and text in the ith pair of text-video pairs embedded in space, two types of triplets (V) are constructed for the ordering constraint_i，T_i，T_j) And (T)_i，V_i，V_j) Where i ≠ j, two triplets (V) are sampled identically for the structural constraints_i，V_j，V_k) And (T)_i，T_j，T_k) And if i ≠ j ≠ k, then the dual constraint ordering loss function is:

in the formula, lambda balances the influence of each type of data structure constraint; c_ijk(x) Is defined as follows:

wherein x is_i，x_j，x_kRepresenting a trainable feature code jointly embedded in space,

is a property in the original space, sign (x) is a sign function that returns 1 if x is a positive number, 0 if x is zero, and-1 if x is a negative number.

A video-text cross-modal retrieval device based on a multi-feature map attention network model adopting the method comprises the following steps:

the model training module is used for establishing a multi-feature diagram attention network model and training the multi-feature diagram attention network model by adopting a double-constraint sequencing loss function, wherein the double-constraint sequencing loss function comprises a sequencing constraint function between video-text pairs and a structural constraint function inside single-class data;

and the cross-modal retrieval module is used for performing cross-modal retrieval of the video-text by utilizing the trained multi-feature map attention network model.

Compared with the existing method, the invention provides a multi-feature graph attention network to aggregate multiple features in the video. The framework comprises a multi-feature map attention module, and the multi-feature map attention module can fully utilize the exchange of high-level semantic information among a plurality of features, enrich the representation of each feature and finally form effective video feature representation. Meanwhile, the invention also designs a new dual-constraint ordering loss function, which simultaneously considers the ordering constraint between video-text pairs and the structure constraint inside single-type data (video/text). Experiments show that the retrieval performance of the method is obviously improved on the existing video-text retrieval data set.

Drawings

Fig. 1 is a schematic diagram of a multi-feature attention network structure.

FIG. 2 is a diagram of a doubly constrained ordering penalty function.

FIG. 3 is the significance of the multi-feature map attention module, where (a) is the MSR-VTT dataset and (b) is the MSVD dataset.

FIG. 4 is a dual constraint ordering loss function hyperreference λ discussion where (a) is the MSR-VTT dataset and (b) is the MSVD dataset.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.

As shown in FIG. 1, the multi-feature attention network model proposed by the present invention is composed of three parts, 1) a video coding module: firstly, extracting a plurality of characteristics of a video, obtaining a video characteristic vector with a fixed length through a time aggregation module, realizing the interaction of high-level semantic information among the characteristics through a multi-characteristic diagram attention module, enriching the representation of each characteristic, and finally forming effective video characteristic representation; 2) a text encoding module: encoding the query statement into a single vector representation and then projecting into a subspace for each video feature; 3) jointly embedding the space module: the joint embedded representation of video and text is optimized using ordering constraints between video-text pairs and structural constraints inside the single class of data (video/text).

A video encoding module:

characteristic extraction: to fully utilize various information in video, we use a set of pre-trained models to extract different video features. Specifically, the method uses a SEnet-154 model pre-trained on an ImageNet data set to extract the object features of the video; extracting motion characteristics of the video by using an I3D model trained on a Kinetics data set; extracting the human face features in the video by using a ResNet50 model trained on a VGGFace2 data set; extracting audio features in the video by using a VGGish model trained on a YouTube-8m data set; extracting voice features in the video by using a voice-to-text API of Google cloud; the trained model on the Synth90K data set was used to extract the caption features in the video. Mapping video to M sets of video features

Represents the ith video feature (subscript var represents the variable length output of the video frame sequence) in a set of video features, wherein the video features refer to objects, motion, audio, speech, subtitles, and facial features in the video. Then, each element of the video feature set is aggregated along the time dimension, and a video feature vector { I with a fixed length is generated for each video⁽¹⁾,...,I^M}。

The multi-feature drawing attention module: after obtaining temporally aggregated video feature vectors, we transform these video feature vectors to feature space of the same dimension using linear projection. The transformed video feature vector is noted as:

H＝{h₁，h₂，...，h_M}， (1)

wherein h is_iRepresenting the ith video feature vector, h_i∈R^FF is the number of features and R represents a real number.

To aggregate these features, we first construct a multi-feature map for each video. In particular, we assume that each video is represented by a set of nodes, i.e. one video is represented as H ═ H₁，h₂，...，h_MAnd each node represents a high-level video Feature, and in order to enrich the representation of each Feature, a Multi-Feature graphic attention Module (MFGAT) is proposed to realize the interaction and enhancement of high-level semantic information among multiple features.

After obtaining the multi-feature map, the multi-feature map attention module uses a shared attention mechanism a: r^F×R^F→ R for self-attention calculations, where R represents a real number, the formula is as follows:

e_ij＝a(h_i，h_j)， (2)

e in the above formula_ijRepresenting the importance of node j to node i, any two nodes are interrelated in the multi-feature graph attention module.

To facilitate coefficient comparison between different nodes, we use the softmax function on them (i.e., e)_ij) Normalization was performed:

wherein N is_iIs a set of neighboring nodes, alpha, of node i in the graph_ijIndicating the attention coefficient between node i and node j.

In the model of the present invention, note that mechanism a is a single-layer forward neural network with parameters of

Where R represents a real number and uses the LeakyReLU nonlinear activation function. Thus, the attention coefficient calculated by the attention mechanism may be expressed as:

where | is the join operation.

Then, feature updating is carried out by using the obtained attention coefficient and the feature of each node, so as to generate the final output feature of each node, wherein the feature updating formula is as follows:

where σ denotes the ReLU nonlinear activation function.

Therefore, we get the new video feature V ═ h'₁h′₂..h′_M}. Finally, the output video features are connected to a single fixed length vector to form an effective video feature representation.

(II) a text encoding module:

given a query sentence, each word is firstly input into a pre-training word2vec model published by google to obtain a word code. And then, embedding all word codes through a pre-trained OpenAI-GPT model, and extracting word embedding characteristics considering the context. These word-embedded features are then aggregated into a single feature vector using a NetVLAD aggregation module, resulting in an entire sentence feature representation. Then, we project the entire sentence feature representation into separate subspaces of each video feature using a Gated Encoding Module (GEM). The text representation consists of M subspaces:

wherein psiⁱRepresenting the ith subspace text feature vector, and M representing the number of subspaces. The word2vec model, the OpenAI-GPT model, the netVLAD aggregation module and the gating coding module can be realized by adopting the prior art.

And (III) jointly embedding the space module:

joint embedding space learning is a common method in cross-modal retrieval at present. It expects heterogeneous video, text data to be metric learning in a unified space, namely named joint embedding space. In the joint embedding space, the distance between the positive sample pairs of the data of different modes is constrained to be closer than that between the negative sample pairs, so that the similarity among the data of the cross-mode is learned.

The Double Constraint Ranking Loss (DCRL) is described in detail below, which takes into account both the Ranking constraints between video-text pairs and the structural constraints within a single type of data (video/text).

Ordering constraint function: the cross-modal ordering constraint is a loss function commonly used in cross-modal retrieval, and aims to enable texts and videos with similar semantics to be closer and enable the dissimilar texts and videos to be far away from each other. Given a video input, the expression is:

d(V_i，T_i)+m＜d(V_i，T_j)， (6)

wherein, V_i(anchor point) and T_i(Positive samples) are the multi-modal embedded spatial features of the ith video and text. T is_j(negative examples) represent the jth text feature. d (V, T) represents the distance of the two features in the embedding space, and m represents the margin constant. Similarly, given a text input, we set the cross-modal ordering constraint as follows:

d(T_i，V_i)+m＜d(T_i，V_j). (7)

in the ordering constraint function, we use the Bi-directional Hard-negative Ranking penalty (Bi-HNRL) (Fartash Faghri, David J fly, Jamie Ryan Kiros, and Sanja Fidler.2018.VSE + +, improvement Visual-Semantic indexes with Hard Negaties. in British Machine Vision reference), which considers only the penalties given by the most difficult samples.

Structural constraint function: if only cross-modal ordering constraints are employed to train the entire embedded network, the structural properties inside each type of data are lost. To solve this problem, we have designed a structural constraint function. Given any three samples (video or text), we can use a video coding module or a text coding module to extract features. Since these features have not been entered into the federated embedded network, they can be used to measure similarity within a single class of data. As shown in fig. 2, by introducing the internal structural constraint of single-type data, we can make the video/text keep its inherent structural characteristics in the joint embedding space. Therefore, for video data, the structural constraint expression is:

wherein, V_i，V_j，V_kFeatures of the ith, j, and k videos, respectively, in the joint embedding space.

Are properties in the corresponding original video space. Similarly, given a text input, we set the intra-modal structural constraints as follows:

wherein T is_i，T_j，T_kFeatures of the i, j and k texts in the joint embedding space, respectively.

Is a property in its original text space.

Double constrained ordering loss (DCRL): we constrain the orderingIn combination with the structural constraints, a new ordering penalty is constructed. Given N text, video pairs, we can get N pairs of embedded features (V)_i，T_i) Wherein i, V_iAnd T_iIs a characteristic of the video and text in the ith pair of text-video pairs in the embedding space. For the ordering constraint, we build two types of triples (V)_i，T_i，T_j) And (T)_i，V_i，V_j) Where i ≠ j. For structural constraints, we sample two triplets (V) as well_i，V_j，V_k) And (T)_i，T_j，T_k) Where i ≠ j ≠ k. Then, the double constrained ordering penalty (DCRL) can be written as:

in the equation, λ balances the effect of each type of data structure constraint. Ci in equation (10)_jk (x) is defined as follows:

wherein x is_i，x_j，x_kIs a representation of a trainable feature code jointly embedded in space,

is a property in the original space. sign (x) is a sign function that returns 1 if x is a positive number, 0 if x is zero, and-1 if x is a negative number.

(IV) method for performing video-text cross-modal retrieval by adopting multi-feature map attention network

The method for performing video-text cross-modal retrieval by adopting the multi-feature map attention network model comprises the following steps:

1) a training stage:

and (3) training the multi-feature diagram attention network model by using the double-constraint ordering loss until the loss function is trained continuously for 10 rounds without reduction, and finishing the training.

2) And (3) a retrieval stage:

2.1) inputting a text to be queried, calculating the similarity between the text to be queried and all videos in a video library by using the trained multi-feature-diagram attention network model, returning to a similarity list, and sequencing the videos from large to small according to the similarity so as to obtain a retrieval result, namely retrieving the corresponding videos.

2.2) inputting a video to be queried, calculating the similarity between the text to be queried and all videos in the video library by using the trained multi-feature-diagram attention network model, returning to a similarity list, and sequencing the texts according to the similarity from large to small so as to obtain a retrieval result, namely retrieving the corresponding text.

The key points of the invention are as follows:

1, a multi-feature map attention network model is proposed for video-text cross-model retrieval. Specifically, a multi-feature map attention module is designed to fully mine the structural relationship among various features of videos, and high-efficiency video feature representation is obtained by realizing high-level semantic information interaction among video features.

2, designing a new double-constraint ordering loss function, wherein the loss function simultaneously considers the ordering constraint between video-text pairs and the structural constraint inside single-class data (video/text). The function can not only make texts and videos with similar semantics close in an embedding space, but also can keep original structural characteristics in the embedding space.

The experimental effect of the invention is as follows:

data set: we performed extensive experiments to evaluate the effect of a Multi-Feature Graph ATtention Network Model (MFGATN). Our model was trained and tested on video-text search datasets MSR-VTT and MSVD. Wherein the MSR-VTT dataset consists of 10,000 video segments and 200,000 texts describing the visual content of the video segments, the number of texts per video segment being 20. Wherein 6,513 video clips were used for training, 497 video clips were used for verification, and 2,990 video clips were used for testing. The MSVD data set contains 1970 Youtube videos, each containing about 40 texts. Where 1200 videos were used for training, 100 videos were used for verification, 670 videos were used for testing, noting that we used all text for training and testing.

Evaluation indexes are as follows: we used the standard criteria commonly used in the search: recall (R @ K) and Median Rank (Median Rank, MedR). Where R @ K represents the recall to the top K ranked query search results, typically K is set to a value of 1, 5, 10. The recall rate reflects the number of correct samples in the search results, the higher the better. MedR reflects the number of correct samples in the search arrangement, and the lower the MedR, the better. In addition, in order to reflect the comprehensive performance of retrieval, R @ K is summed to obtain an Rsum index.

The experimental results are as follows: under the same setup and data partitioning, we compared the proposed multiple feature graph attention network (MFGATN) method with some of the most advanced methods in the prior art to verify its validity. The video-text cross-modal retrieval method can be divided into two types, namely a single-feature method and a multi-feature method according to the features of videos. For the single feature method, we compared VSE, VSE + +, W2VV, dual encoding, HGR, LJRV, ST. In addition, we have compared several multi-feature methods such as JEMC, Simple localization, MoEE, CE, etc. For the single-feature approach, we refer directly to the results of the corresponding paper. For the multi-feature aggregation method, in order to realize fair comparison, the effect of the multi-feature method is improved in two aspects, namely firstly, the same video features are used; second, we use a Bi-directional maximum interval difficulty sample ordering loss function (Bi-HNRL) in training.

Table 1 and table 2 show the overall performance of MFGATN and the performance of all baseline methods on the MSR-VTT and MSVD datasets, respectively. As shown in table 1, the MFGATN approach presented herein achieves relative improvements of 21%, 14.2%, and 11.9% over the MSR-VTT dataset compared to the prior state-of-the-art CE approach on R @1, R @5, and R @10 indices, respectively. Likewise, as shown in table 2, MFGATN achieved 17.6%, 16.7%, and 10.7% enhancement on MSVD datasets compared to CE methods results on R @1, R @5, and R @10, respectively.

In summary, MFGATN has significant advantages over other multi-feature aggregation methods, indicating the effectiveness of our proposed multi-feature graph attention network.

Ablation experiment: we performed extensive ablation experiments to explore the importance of the different components (multi-feature attention module and different training loss functions).

The availability of the multi-feature graph attention module: to further investigate the effectiveness of the multi-feature attention module, we designed an ablation experiment for MFGATN (w/o. mfgat). In particular, MFGATN (w/o. MFGAT) is a variant of the MFGATN method that deletes MFGAT modules from the complete MFGATN. As shown in fig. 3 (left), on the MSR-VTT dataset we found that the proposed MFGATN (full) method improved 27.4%, 17.5% and 13.2% at R @1, R @5, and R @10, respectively, compared to the MFGATN (w/o. mfgat) method. Likewise, as shown in fig. 3 (right), the MFGATN (full) method is improved by 14.1%, 9.5%, and 6.9% at R @1, R @5, and R @10, respectively, over the MSVD dataset as compared to the MFGATN (w/o.mfgat) method. Compared with the MFGATN (w/o. MFGAT) method, the MFGATN (full) method proposed by us achieves a significant improvement, indicating that the MFGAT module plays an important role in video text retrieval.

Loss function: we compared the proposed doubly constrained ordering loss function with the existing ordering loss function to verify its validity. For example, (1) the two-way most difficult sample order loss function (Bi-HBi-HNRL) that only considers ordering constraints between video texts, (2) the dual constraint ordering loss (DCRL) that considers both ordering constraints and structural constraints. The results are shown in tables 3 and 4, illustrating the advantages of considering both ordering constraints and structural constraints. Specifically, we propose DCRL with 2.5%, 2.4% and 2.3% improvement in the R @1, R @5, and R @10 indices, respectively, on the MSR-VTT dataset as compared to Bi-HNRL. Similarly, on the MSVD dataset, DCRL achieved relative improvements of 2.7%, 3.8% and 2.97% on R @1, R @5, and R @ 10.

In addition, λ is a key hyper-parameter that balances the ordering constraint and the structural constraint. Therefore, we changed this hyperparameter from 0.1 to 0.5, and further explored its effect. Fig. 4 shows the effect of this hyper-parameter on the MSR-VTT and MSVD data sets. It can be seen that on the MSR VTT dataset, the best performance is achieved with λ set to 0.1, and the MSVD dataset with λ set to 0.3.

TABLE 1 MSR-VTT dataset Performance comparison

Table 2 MSVD dataset performance comparison

TABLE 3 ablation experiment loss function comparison (MSR-VTT dataset)

TABLE 4 ablation experiment loss function comparison (MSVD data set)

Based on the same inventive concept, another embodiment of the present invention provides a cross-modal video-text retrieval apparatus based on a multiple feature map attention network model, which uses the above method, and includes:

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.

The particular embodiments of the present invention disclosed above are illustrative only and are not intended to be limiting, since various alternatives, modifications, and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The invention should not be limited to the disclosure of the embodiments in the present specification, but the scope of the invention is defined by the appended claims.

Claims

1. A video-text cross-modal retrieval method based on a multi-feature map attention network model is characterized by comprising the following steps:

2. The method of claim 1, wherein the multi-feature graph attention network model comprises:

3. The method of claim 2, wherein the plurality of features of the video extracted by the video encoding module comprises: object features, motion features, audio features, speech features, caption features, and face features.

4. The method of claim 2, wherein the multi-feature attention module forms the valid video feature representation by:

the multiple feature map attention module performs a self-attention calculation using a shared attention mechanism, as follows, wherein e_ijRepresents the importance of node j to node i, h_i,h_jRepresenting the ith and jth video feature vectors:

e_ij＝a(h_i,h_j)

5. The method of claim 1, wherein the ordering constraint function is expressed as:

given a video input, the ordering constraint function has the expression:

d(V_i,T_i)+m<d(V_i,T_j)

wherein, anchor point V_iAnd a positive sample T_iIs the multi-modal embedded spatial feature of the ith video and text, negative example T_jRepresenting the jth text feature, d (V, T) representing the distance of the two features in the embedding space, and m representing an edge distance constant;

similarly, given a text input, the ordering constraint function has the expression:

d(T_i,V_i)+m<d(T_i,V_j)

a two-way hard negative rank penalty is used in the ordering constraint function that only considers the penalties incurred by the most difficult samples.

6. The method of claim 5, wherein the structural constraint function is expressed as:

given a video input, the structural constraint function has the expression:

wherein, V_i,V_j,V_kI, j and k views, respectivelyThe characteristics of the frequencies in the joint embedding space,

is a property in the corresponding original video space;

similarly, given a text input, the structural constraint function is expressed as:

wherein T is_i,T_j,T_kFeatures of the i, j and k texts in the joint embedding space,

is a property in its original text space.

7. Method according to claim 6, characterized in that given N text, video pairs, N pairs of embedded features (V) are obtained_i,T_i) In which V is_iAnd T_iIs a property of video and text in the ith pair of text-video pairs embedded in space, two types of triplets (V) are constructed for the ordering constraint_i,T_i,T_j) And (T)_i,V_i,V_j) Where i ≠ j, two triplets (V) are sampled identically for the structural constraints_i,V_j,V_k) And (T)_i,T_j,T_k) And if i ≠ j ≠ k, then the dual constraint ordering loss function is:

wherein x is_i,x_j,x_kRepresenting a trainable feature code jointly embedded in space,

8. A multi-feature-map attention network model-based cross-modality video-text retrieval device adopting the method of any one of claims 1 to 7, comprising:

9. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 7.