CN112883229A - Video-text cross-modal retrieval method and device based on multi-feature-map attention network model - Google Patents

Video-text cross-modal retrieval method and device based on multi-feature-map attention network model Download PDF

Info

Publication number
CN112883229A
CN112883229A CN202110256218.3A CN202110256218A CN112883229A CN 112883229 A CN112883229 A CN 112883229A CN 202110256218 A CN202110256218 A CN 202110256218A CN 112883229 A CN112883229 A CN 112883229A
Authority
CN
China
Prior art keywords
video
feature
text
constraint
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110256218.3A
Other languages
Chinese (zh)
Other versions
CN112883229B (en
Inventor
吴大衍
郝孝帅
周玉灿
李波
王伟平
孟丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202110256218.3A priority Critical patent/CN112883229B/en
Publication of CN112883229A publication Critical patent/CN112883229A/en
Application granted granted Critical
Publication of CN112883229B publication Critical patent/CN112883229B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a video-text cross-modal retrieval method and device based on a multi-feature-map attention network model. The method comprises the following steps: establishing a multi-feature map attention network model for mining the structural relationship among different modal features of the video and obtaining efficient video feature representation through high-level semantic information exchange among different video features; the multi-feature-diagram attention network model is trained by adopting a double-constraint ordering loss function, wherein the double-constraint ordering loss function comprises an ordering constraint function between a video-text pair and a structural constraint function inside single-class data, so that not only can texts and videos with similar semantics be similar in an embedding space, but also the original structural characteristics can be kept in the embedding space; and performing cross-modal retrieval of the video and the text by using the trained multi-feature map attention network model. The invention obviously improves the retrieval performance of video-text retrieval.

Description

Video-text cross-modal retrieval method and device based on multi-feature-map attention network model
Technical Field
The invention belongs to the technical field of information, and particularly relates to a video-text cross-modal retrieval method and device based on a multi-feature-map attention network model.
Background
With the rapid development of technologies such as internet, intelligent mobile devices, social media and instant messaging, multimedia data show explosive growth. In recent years, there has been an increasing interest in how users can quickly and accurately find their desired content in mass multimedia data. Video-text cross-modal retrieval is an important retrieval task for both video and text modalities in multimedia retrieval. The task aims to search out a corresponding video given a text checking object or search out a corresponding text given a video checking object. In the conventional method, the similarity between the data of the cross-modal is generally learned by constraining the distance between the positive sample pairs of the data of different modalities to be closer than the distance between the negative sample pairs in a joint embedding space. Due to the complexity of video data, research hotspots of the method tend to focus on feature representation learning of the video and how to maintain the similarity relation of the video/text in the original feature space in the joint embedding space.
The defects of the prior art are mainly reflected in that:
1. the feature representation learning method of the video comprises the following steps: most of the existing video processing methods only utilize visual features in videos, and ignore rich information such as actions, human faces, sounds, subtitles and the like contained in the videos. In recent years, the aggregation method of multi-modal features in video significantly improves the performance of video-text cross-modal retrieval. However, the super-high and heterogeneous characteristics of multi-modal features in video are often ignored in the past methods, and the inherent structural relationship between the features is not taken into consideration, so that the fusion efficiency is not high, and the video representation still faces a great challenge.
2. Designing a loss function of cross-modal retrieval: the existing video-text retrieval method generally adopts a bidirectional maximum interval sorting loss function training network based on difficult samples, and the method excavates video-text sample pairs which are difficult to retrieve in a feature combined embedding space of a video and a text, draws close matched sample pairs and pushes away unmatched sample pairs to realize the updating of network parameters. However, this ordering penalty only considers the semantic relationship between text and video, ignoring the semantic relationship within video/text. Exploiting semantic similarity relationships within video/text helps to enhance the representation of a single video/text sample.
Disclosure of Invention
The invention aims to design a multi-feature-map attention network model for cross-modal retrieval of video-text. Specifically, the invention provides a multi-feature map attention module which fully excavates the structural relationship between different modal features of the video and obtains more efficient video feature representation through high-level semantic information exchange between different video features. In addition, the invention also designs a new dual-constraint ordering loss function, which simultaneously considers the ordering constraint between video-text pairs and the structural constraint inside single-type data (video/text). The function can not only make texts and videos with similar semantics close in an embedding space, but also can keep original structural characteristics in the embedding space.
The technical scheme adopted by the invention is as follows:
a video-text cross-modal retrieval method based on a multi-feature map attention network model comprises the following steps:
establishing a multi-feature map attention network model for mining the structural relationship among different modal features of the video and obtaining efficient video feature representation through high-level semantic information exchange among different video features;
training the multi-feature-diagram attention network model by adopting a dual-constraint ordering loss function, wherein the dual-constraint ordering loss function comprises an ordering constraint function between video-text pairs and a structural constraint function inside single-class data;
and performing cross-modal retrieval of the video and the text by using the trained multi-feature map attention network model.
Further, the multi-feature-map attention network model includes:
the video coding module is responsible for extracting a plurality of characteristics of a video, obtaining video characteristic vectors with fixed lengths by aggregation along a time dimension, and realizing interaction of high-level semantic information among the characteristics by the multi-characteristic-diagram attention module to finally form effective video characteristic representation;
the text coding module is responsible for coding the query statement into a single vector representation and then projecting the single vector representation into a subspace aiming at each video feature;
and the joint embedding space module is responsible for optimizing joint embedding representation of the video and the text by utilizing the ordering constraint between the video-text pairs and the structural constraint inside the single-type data.
Further, the plurality of features of the video extracted by the video coding module include: object features, motion features, audio features, speech features, caption features, and face features.
Further, the multi-feature map attention module forms an effective video feature representation by the following steps, including:
the multiple feature map attention module performs a self-attention calculation using a shared attention mechanism, as follows, wherein eijRepresents the importance of node j to node i, hi,hjRepresenting the ith and jth video feature vectors:
eij=a(hi,hj)
using softmax function pair eijNormalized, the formula is as follows, wherein NiIs a set of neighboring nodes, alpha, of node i in the graphijDenotes the attention coefficient between node i and node j:
Figure BDA0002967354200000021
and performing characteristic updating by using the obtained attention coefficient and the characteristic of each node to generate the final output characteristic of each node, wherein the characteristic updating formula is as follows, wherein sigma represents a ReLU nonlinear activation function:
Figure BDA0002967354200000031
finally, the output video features are connected to a single fixed length vector to form an effective video feature representation.
Further, given N text and video pairs, N pairs of embedded features (V) are obtainedi,Ti) In which V isiAnd TiIs a property of video and text in the ith pair of text-video pairs embedded in space, two types of triplets (V) are constructed for the ordering constrainti,Ti,Tj) And (T)i,Vi,Vj) Where i ≠ j, two triplets (V) are sampled identically for the structural constraintsi,Vj,Vk) And (T)i,Tj,Tk) And if i ≠ j ≠ k, then the dual constraint ordering loss function is:
Figure BDA0002967354200000032
in the formula, lambda balances the influence of each type of data structure constraint; cijk(x) Is defined as follows:
Figure BDA0002967354200000033
wherein x isi,xj,xkRepresenting a trainable feature code jointly embedded in space,
Figure BDA0002967354200000034
is a property in the original space, sign (x) is a sign function that returns 1 if x is a positive number, 0 if x is zero, and-1 if x is a negative number.
A video-text cross-modal retrieval device based on a multi-feature map attention network model adopting the method comprises the following steps:
the model training module is used for establishing a multi-feature diagram attention network model and training the multi-feature diagram attention network model by adopting a double-constraint sequencing loss function, wherein the double-constraint sequencing loss function comprises a sequencing constraint function between video-text pairs and a structural constraint function inside single-class data;
and the cross-modal retrieval module is used for performing cross-modal retrieval of the video-text by utilizing the trained multi-feature map attention network model.
Compared with the existing method, the invention provides a multi-feature graph attention network to aggregate multiple features in the video. The framework comprises a multi-feature map attention module, and the multi-feature map attention module can fully utilize the exchange of high-level semantic information among a plurality of features, enrich the representation of each feature and finally form effective video feature representation. Meanwhile, the invention also designs a new dual-constraint ordering loss function, which simultaneously considers the ordering constraint between video-text pairs and the structure constraint inside single-type data (video/text). Experiments show that the retrieval performance of the method is obviously improved on the existing video-text retrieval data set.
Drawings
Fig. 1 is a schematic diagram of a multi-feature attention network structure.
FIG. 2 is a diagram of a doubly constrained ordering penalty function.
FIG. 3 is the significance of the multi-feature map attention module, where (a) is the MSR-VTT dataset and (b) is the MSVD dataset.
FIG. 4 is a dual constraint ordering loss function hyperreference λ discussion where (a) is the MSR-VTT dataset and (b) is the MSVD dataset.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.
As shown in FIG. 1, the multi-feature attention network model proposed by the present invention is composed of three parts, 1) a video coding module: firstly, extracting a plurality of characteristics of a video, obtaining a video characteristic vector with a fixed length through a time aggregation module, realizing the interaction of high-level semantic information among the characteristics through a multi-characteristic diagram attention module, enriching the representation of each characteristic, and finally forming effective video characteristic representation; 2) a text encoding module: encoding the query statement into a single vector representation and then projecting into a subspace for each video feature; 3) jointly embedding the space module: the joint embedded representation of video and text is optimized using ordering constraints between video-text pairs and structural constraints inside the single class of data (video/text).
A video encoding module:
characteristic extraction: to fully utilize various information in video, we use a set of pre-trained models to extract different video features. Specifically, the method uses a SEnet-154 model pre-trained on an ImageNet data set to extract the object features of the video; extracting motion characteristics of the video by using an I3D model trained on a Kinetics data set; extracting the human face features in the video by using a ResNet50 model trained on a VGGFace2 data set; extracting audio features in the video by using a VGGish model trained on a YouTube-8m data set; extracting voice features in the video by using a voice-to-text API of Google cloud; the trained model on the Synth90K data set was used to extract the caption features in the video. Mapping video to M sets of video features
Figure BDA0002967354200000041
Figure BDA0002967354200000042
Represents the ith video feature (subscript var represents the variable length output of the video frame sequence) in a set of video features, wherein the video features refer to objects, motion, audio, speech, subtitles, and facial features in the video. Then, each element of the video feature set is aggregated along the time dimension, and a video feature vector { I with a fixed length is generated for each video(1),...,IM}。
The multi-feature drawing attention module: after obtaining temporally aggregated video feature vectors, we transform these video feature vectors to feature space of the same dimension using linear projection. The transformed video feature vector is noted as:
H={h1,h2,...,hM}, (1)
wherein h isiRepresenting the ith video feature vector, hi∈RFF is the number of features and R represents a real number.
To aggregate these features, we first construct a multi-feature map for each video. In particular, we assume that each video is represented by a set of nodes, i.e. one video is represented as H ═ H1,h2,...,hMAnd each node represents a high-level video Feature, and in order to enrich the representation of each Feature, a Multi-Feature graphic attention Module (MFGAT) is proposed to realize the interaction and enhancement of high-level semantic information among multiple features.
After obtaining the multi-feature map, the multi-feature map attention module uses a shared attention mechanism a: rF×RF→ R for self-attention calculations, where R represents a real number, the formula is as follows:
eij=a(hi,hj), (2)
e in the above formulaijRepresenting the importance of node j to node i, any two nodes are interrelated in the multi-feature graph attention module.
To facilitate coefficient comparison between different nodes, we use the softmax function on them (i.e., e)ij) Normalization was performed:
Figure BDA0002967354200000051
wherein N isiIs a set of neighboring nodes, alpha, of node i in the graphijIndicating the attention coefficient between node i and node j.
In the model of the present invention, note that mechanism a is a single-layer forward neural network with parameters of
Figure BDA0002967354200000052
Where R represents a real number and uses the LeakyReLU nonlinear activation function. Thus, the attention coefficient calculated by the attention mechanism may be expressed as:
Figure BDA0002967354200000053
where | is the join operation.
Then, feature updating is carried out by using the obtained attention coefficient and the feature of each node, so as to generate the final output feature of each node, wherein the feature updating formula is as follows:
Figure BDA0002967354200000054
where σ denotes the ReLU nonlinear activation function.
Therefore, we get the new video feature V ═ h'1h′2..h′M}. Finally, the output video features are connected to a single fixed length vector to form an effective video feature representation.
(II) a text encoding module:
given a query sentence, each word is firstly input into a pre-training word2vec model published by google to obtain a word code. And then, embedding all word codes through a pre-trained OpenAI-GPT model, and extracting word embedding characteristics considering the context. These word-embedded features are then aggregated into a single feature vector using a NetVLAD aggregation module, resulting in an entire sentence feature representation. Then, we project the entire sentence feature representation into separate subspaces of each video feature using a Gated Encoding Module (GEM). The text representation consists of M subspaces:
Figure BDA0002967354200000061
wherein psiiRepresenting the ith subspace text feature vector, and M representing the number of subspaces. The word2vec model, the OpenAI-GPT model, the netVLAD aggregation module and the gating coding module can be realized by adopting the prior art.
And (III) jointly embedding the space module:
joint embedding space learning is a common method in cross-modal retrieval at present. It expects heterogeneous video, text data to be metric learning in a unified space, namely named joint embedding space. In the joint embedding space, the distance between the positive sample pairs of the data of different modes is constrained to be closer than that between the negative sample pairs, so that the similarity among the data of the cross-mode is learned.
The Double Constraint Ranking Loss (DCRL) is described in detail below, which takes into account both the Ranking constraints between video-text pairs and the structural constraints within a single type of data (video/text).
Ordering constraint function: the cross-modal ordering constraint is a loss function commonly used in cross-modal retrieval, and aims to enable texts and videos with similar semantics to be closer and enable the dissimilar texts and videos to be far away from each other. Given a video input, the expression is:
d(Vi,Ti)+m<d(Vi,Tj), (6)
wherein, Vi(anchor point) and Ti(Positive samples) are the multi-modal embedded spatial features of the ith video and text. T isj(negative examples) represent the jth text feature. d (V, T) represents the distance of the two features in the embedding space, and m represents the margin constant. Similarly, given a text input, we set the cross-modal ordering constraint as follows:
d(Ti,Vi)+m<d(Ti,Vj). (7)
in the ordering constraint function, we use the Bi-directional Hard-negative Ranking penalty (Bi-HNRL) (Fartash Faghri, David J fly, Jamie Ryan Kiros, and Sanja Fidler.2018.VSE + +, improvement Visual-Semantic indexes with Hard Negaties. in British Machine Vision reference), which considers only the penalties given by the most difficult samples.
Structural constraint function: if only cross-modal ordering constraints are employed to train the entire embedded network, the structural properties inside each type of data are lost. To solve this problem, we have designed a structural constraint function. Given any three samples (video or text), we can use a video coding module or a text coding module to extract features. Since these features have not been entered into the federated embedded network, they can be used to measure similarity within a single class of data. As shown in fig. 2, by introducing the internal structural constraint of single-type data, we can make the video/text keep its inherent structural characteristics in the joint embedding space. Therefore, for video data, the structural constraint expression is:
Figure BDA0002967354200000071
wherein, Vi,Vj,VkFeatures of the ith, j, and k videos, respectively, in the joint embedding space.
Figure BDA0002967354200000072
Are properties in the corresponding original video space. Similarly, given a text input, we set the intra-modal structural constraints as follows:
Figure BDA0002967354200000073
wherein T isi,Tj,TkFeatures of the i, j and k texts in the joint embedding space, respectively.
Figure BDA0002967354200000074
Is a property in its original text space.
Double constrained ordering loss (DCRL): we constrain the orderingIn combination with the structural constraints, a new ordering penalty is constructed. Given N text, video pairs, we can get N pairs of embedded features (V)i,Ti) Wherein i, ViAnd TiIs a characteristic of the video and text in the ith pair of text-video pairs in the embedding space. For the ordering constraint, we build two types of triples (V)i,Ti,Tj) And (T)i,Vi,Vj) Where i ≠ j. For structural constraints, we sample two triplets (V) as welli,Vj,Vk) And (T)i,Tj,Tk) Where i ≠ j ≠ k. Then, the double constrained ordering penalty (DCRL) can be written as:
Figure BDA0002967354200000075
in the equation, λ balances the effect of each type of data structure constraint. Ci in equation (10)jk (x) is defined as follows:
Figure BDA0002967354200000076
wherein x isi,xj,xkIs a representation of a trainable feature code jointly embedded in space,
Figure BDA0002967354200000077
is a property in the original space. sign (x) is a sign function that returns 1 if x is a positive number, 0 if x is zero, and-1 if x is a negative number.
(IV) method for performing video-text cross-modal retrieval by adopting multi-feature map attention network
The method for performing video-text cross-modal retrieval by adopting the multi-feature map attention network model comprises the following steps:
1) a training stage:
and (3) training the multi-feature diagram attention network model by using the double-constraint ordering loss until the loss function is trained continuously for 10 rounds without reduction, and finishing the training.
2) And (3) a retrieval stage:
2.1) inputting a text to be queried, calculating the similarity between the text to be queried and all videos in a video library by using the trained multi-feature-diagram attention network model, returning to a similarity list, and sequencing the videos from large to small according to the similarity so as to obtain a retrieval result, namely retrieving the corresponding videos.
2.2) inputting a video to be queried, calculating the similarity between the text to be queried and all videos in the video library by using the trained multi-feature-diagram attention network model, returning to a similarity list, and sequencing the texts according to the similarity from large to small so as to obtain a retrieval result, namely retrieving the corresponding text.
The key points of the invention are as follows:
1, a multi-feature map attention network model is proposed for video-text cross-model retrieval. Specifically, a multi-feature map attention module is designed to fully mine the structural relationship among various features of videos, and high-efficiency video feature representation is obtained by realizing high-level semantic information interaction among video features.
2, designing a new double-constraint ordering loss function, wherein the loss function simultaneously considers the ordering constraint between video-text pairs and the structural constraint inside single-class data (video/text). The function can not only make texts and videos with similar semantics close in an embedding space, but also can keep original structural characteristics in the embedding space.
The experimental effect of the invention is as follows:
data set: we performed extensive experiments to evaluate the effect of a Multi-Feature Graph ATtention Network Model (MFGATN). Our model was trained and tested on video-text search datasets MSR-VTT and MSVD. Wherein the MSR-VTT dataset consists of 10,000 video segments and 200,000 texts describing the visual content of the video segments, the number of texts per video segment being 20. Wherein 6,513 video clips were used for training, 497 video clips were used for verification, and 2,990 video clips were used for testing. The MSVD data set contains 1970 Youtube videos, each containing about 40 texts. Where 1200 videos were used for training, 100 videos were used for verification, 670 videos were used for testing, noting that we used all text for training and testing.
Evaluation indexes are as follows: we used the standard criteria commonly used in the search: recall (R @ K) and Median Rank (Median Rank, MedR). Where R @ K represents the recall to the top K ranked query search results, typically K is set to a value of 1, 5, 10. The recall rate reflects the number of correct samples in the search results, the higher the better. MedR reflects the number of correct samples in the search arrangement, and the lower the MedR, the better. In addition, in order to reflect the comprehensive performance of retrieval, R @ K is summed to obtain an Rsum index.
The experimental results are as follows: under the same setup and data partitioning, we compared the proposed multiple feature graph attention network (MFGATN) method with some of the most advanced methods in the prior art to verify its validity. The video-text cross-modal retrieval method can be divided into two types, namely a single-feature method and a multi-feature method according to the features of videos. For the single feature method, we compared VSE, VSE + +, W2VV, dual encoding, HGR, LJRV, ST. In addition, we have compared several multi-feature methods such as JEMC, Simple localization, MoEE, CE, etc. For the single-feature approach, we refer directly to the results of the corresponding paper. For the multi-feature aggregation method, in order to realize fair comparison, the effect of the multi-feature method is improved in two aspects, namely firstly, the same video features are used; second, we use a Bi-directional maximum interval difficulty sample ordering loss function (Bi-HNRL) in training.
Table 1 and table 2 show the overall performance of MFGATN and the performance of all baseline methods on the MSR-VTT and MSVD datasets, respectively. As shown in table 1, the MFGATN approach presented herein achieves relative improvements of 21%, 14.2%, and 11.9% over the MSR-VTT dataset compared to the prior state-of-the-art CE approach on R @1, R @5, and R @10 indices, respectively. Likewise, as shown in table 2, MFGATN achieved 17.6%, 16.7%, and 10.7% enhancement on MSVD datasets compared to CE methods results on R @1, R @5, and R @10, respectively.
In summary, MFGATN has significant advantages over other multi-feature aggregation methods, indicating the effectiveness of our proposed multi-feature graph attention network.
Ablation experiment: we performed extensive ablation experiments to explore the importance of the different components (multi-feature attention module and different training loss functions).
The availability of the multi-feature graph attention module: to further investigate the effectiveness of the multi-feature attention module, we designed an ablation experiment for MFGATN (w/o. mfgat). In particular, MFGATN (w/o. MFGAT) is a variant of the MFGATN method that deletes MFGAT modules from the complete MFGATN. As shown in fig. 3 (left), on the MSR-VTT dataset we found that the proposed MFGATN (full) method improved 27.4%, 17.5% and 13.2% at R @1, R @5, and R @10, respectively, compared to the MFGATN (w/o. mfgat) method. Likewise, as shown in fig. 3 (right), the MFGATN (full) method is improved by 14.1%, 9.5%, and 6.9% at R @1, R @5, and R @10, respectively, over the MSVD dataset as compared to the MFGATN (w/o.mfgat) method. Compared with the MFGATN (w/o. MFGAT) method, the MFGATN (full) method proposed by us achieves a significant improvement, indicating that the MFGAT module plays an important role in video text retrieval.
Loss function: we compared the proposed doubly constrained ordering loss function with the existing ordering loss function to verify its validity. For example, (1) the two-way most difficult sample order loss function (Bi-HBi-HNRL) that only considers ordering constraints between video texts, (2) the dual constraint ordering loss (DCRL) that considers both ordering constraints and structural constraints. The results are shown in tables 3 and 4, illustrating the advantages of considering both ordering constraints and structural constraints. Specifically, we propose DCRL with 2.5%, 2.4% and 2.3% improvement in the R @1, R @5, and R @10 indices, respectively, on the MSR-VTT dataset as compared to Bi-HNRL. Similarly, on the MSVD dataset, DCRL achieved relative improvements of 2.7%, 3.8% and 2.97% on R @1, R @5, and R @ 10.
In addition, λ is a key hyper-parameter that balances the ordering constraint and the structural constraint. Therefore, we changed this hyperparameter from 0.1 to 0.5, and further explored its effect. Fig. 4 shows the effect of this hyper-parameter on the MSR-VTT and MSVD data sets. It can be seen that on the MSR VTT dataset, the best performance is achieved with λ set to 0.1, and the MSVD dataset with λ set to 0.3.
TABLE 1 MSR-VTT dataset Performance comparison
Figure BDA0002967354200000101
Table 2 MSVD dataset performance comparison
Figure BDA0002967354200000102
TABLE 3 ablation experiment loss function comparison (MSR-VTT dataset)
Figure BDA0002967354200000103
TABLE 4 ablation experiment loss function comparison (MSVD data set)
Figure BDA0002967354200000111
Based on the same inventive concept, another embodiment of the present invention provides a cross-modal video-text retrieval apparatus based on a multiple feature map attention network model, which uses the above method, and includes:
the model training module is used for establishing a multi-feature diagram attention network model and training the multi-feature diagram attention network model by adopting a double-constraint sequencing loss function, wherein the double-constraint sequencing loss function comprises a sequencing constraint function between video-text pairs and a structural constraint function inside single-class data;
and the cross-modal retrieval module is used for performing cross-modal retrieval of the video-text by utilizing the trained multi-feature map attention network model.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.
The particular embodiments of the present invention disclosed above are illustrative only and are not intended to be limiting, since various alternatives, modifications, and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The invention should not be limited to the disclosure of the embodiments in the present specification, but the scope of the invention is defined by the appended claims.

Claims (10)

1. A video-text cross-modal retrieval method based on a multi-feature map attention network model is characterized by comprising the following steps:
establishing a multi-feature map attention network model for mining the structural relationship among different modal features of the video and obtaining efficient video feature representation through high-level semantic information exchange among different video features;
training the multi-feature-diagram attention network model by adopting a dual-constraint ordering loss function, wherein the dual-constraint ordering loss function comprises an ordering constraint function between video-text pairs and a structural constraint function inside single-class data;
and performing cross-modal retrieval of the video and the text by using the trained multi-feature map attention network model.
2. The method of claim 1, wherein the multi-feature graph attention network model comprises:
the video coding module is responsible for extracting a plurality of characteristics of a video, obtaining video characteristic vectors with fixed lengths by aggregation along a time dimension, and realizing interaction of high-level semantic information among the characteristics by the multi-characteristic-diagram attention module to finally form effective video characteristic representation;
the text coding module is responsible for coding the query statement into a single vector representation and then projecting the single vector representation into a subspace aiming at each video feature;
and the joint embedding space module is responsible for optimizing joint embedding representation of the video and the text by utilizing the ordering constraint between the video-text pairs and the structural constraint inside the single-type data.
3. The method of claim 2, wherein the plurality of features of the video extracted by the video encoding module comprises: object features, motion features, audio features, speech features, caption features, and face features.
4. The method of claim 2, wherein the multi-feature attention module forms the valid video feature representation by:
the multiple feature map attention module performs a self-attention calculation using a shared attention mechanism, as follows, wherein eijRepresents the importance of node j to node i, hi,hjRepresenting the ith and jth video feature vectors:
eij=a(hi,hj)
using softmax function pair eijNormalized, the formula is as follows, wherein NiIs a set of neighboring nodes, alpha, of node i in the graphijDenotes the attention coefficient between node i and node j:
Figure FDA0002967354190000011
and performing characteristic updating by using the obtained attention coefficient and the characteristic of each node to generate the final output characteristic of each node, wherein the characteristic updating formula is as follows, wherein sigma represents a ReLU nonlinear activation function:
Figure FDA0002967354190000021
finally, the output video features are connected to a single fixed length vector to form an effective video feature representation.
5. The method of claim 1, wherein the ordering constraint function is expressed as:
given a video input, the ordering constraint function has the expression:
d(Vi,Ti)+m<d(Vi,Tj)
wherein, anchor point ViAnd a positive sample TiIs the multi-modal embedded spatial feature of the ith video and text, negative example TjRepresenting the jth text feature, d (V, T) representing the distance of the two features in the embedding space, and m representing an edge distance constant;
similarly, given a text input, the ordering constraint function has the expression:
d(Ti,Vi)+m<d(Ti,Vj)
a two-way hard negative rank penalty is used in the ordering constraint function that only considers the penalties incurred by the most difficult samples.
6. The method of claim 5, wherein the structural constraint function is expressed as:
given a video input, the structural constraint function has the expression:
Figure FDA0002967354190000022
wherein, Vi,Vj,VkI, j and k views, respectivelyThe characteristics of the frequencies in the joint embedding space,
Figure FDA0002967354190000023
is a property in the corresponding original video space;
similarly, given a text input, the structural constraint function is expressed as:
Figure FDA0002967354190000024
wherein T isi,Tj,TkFeatures of the i, j and k texts in the joint embedding space,
Figure FDA0002967354190000025
is a property in its original text space.
7. Method according to claim 6, characterized in that given N text, video pairs, N pairs of embedded features (V) are obtainedi,Ti) In which V isiAnd TiIs a property of video and text in the ith pair of text-video pairs embedded in space, two types of triplets (V) are constructed for the ordering constrainti,Ti,Tj) And (T)i,Vi,Vj) Where i ≠ j, two triplets (V) are sampled identically for the structural constraintsi,Vj,Vk) And (T)i,Tj,Tk) And if i ≠ j ≠ k, then the dual constraint ordering loss function is:
Figure FDA0002967354190000026
in the formula, lambda balances the influence of each type of data structure constraint; cijk(x) Is defined as follows:
Figure FDA0002967354190000031
wherein x isi,xj,xkRepresenting a trainable feature code jointly embedded in space,
Figure FDA0002967354190000032
is a property in the original space, sign (x) is a sign function that returns 1 if x is a positive number, 0 if x is zero, and-1 if x is a negative number.
8. A multi-feature-map attention network model-based cross-modality video-text retrieval device adopting the method of any one of claims 1 to 7, comprising:
the model training module is used for establishing a multi-feature diagram attention network model and training the multi-feature diagram attention network model by adopting a double-constraint sequencing loss function, wherein the double-constraint sequencing loss function comprises a sequencing constraint function between video-text pairs and a structural constraint function inside single-class data;
and the cross-modal retrieval module is used for performing cross-modal retrieval of the video-text by utilizing the trained multi-feature map attention network model.
9. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 7.
CN202110256218.3A 2021-03-09 2021-03-09 Video-text cross-modal retrieval method and device based on multi-feature-map attention network model Active CN112883229B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110256218.3A CN112883229B (en) 2021-03-09 2021-03-09 Video-text cross-modal retrieval method and device based on multi-feature-map attention network model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110256218.3A CN112883229B (en) 2021-03-09 2021-03-09 Video-text cross-modal retrieval method and device based on multi-feature-map attention network model

Publications (2)

Publication Number Publication Date
CN112883229A true CN112883229A (en) 2021-06-01
CN112883229B CN112883229B (en) 2022-11-15

Family

ID=76053929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110256218.3A Active CN112883229B (en) 2021-03-09 2021-03-09 Video-text cross-modal retrieval method and device based on multi-feature-map attention network model

Country Status (1)

Country Link
CN (1) CN112883229B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113868442A (en) * 2021-08-26 2021-12-31 北京中知智慧科技有限公司 Joint retrieval method and device
CN114154587A (en) * 2021-12-10 2022-03-08 北京航空航天大学 Multi-mode event detection method based on complementary content perception
CN114154587B (en) * 2021-12-10 2024-07-05 北京航空航天大学 Multi-mode event detection method based on complementary content perception

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170357720A1 (en) * 2016-06-10 2017-12-14 Disney Enterprises, Inc. Joint heterogeneous language-vision embeddings for video tagging and search
CN111753116A (en) * 2019-05-20 2020-10-09 北京京东尚科信息技术有限公司 Image retrieval method, device, equipment and readable storage medium
CN111897913A (en) * 2020-07-16 2020-11-06 浙江工商大学 Semantic tree enhancement based cross-modal retrieval method for searching video from complex text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170357720A1 (en) * 2016-06-10 2017-12-14 Disney Enterprises, Inc. Joint heterogeneous language-vision embeddings for video tagging and search
CN111753116A (en) * 2019-05-20 2020-10-09 北京京东尚科信息技术有限公司 Image retrieval method, device, equipment and readable storage medium
CN111897913A (en) * 2020-07-16 2020-11-06 浙江工商大学 Semantic tree enhancement based cross-modal retrieval method for searching video from complex text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KEVINCK: "跨模态论文——Content-Based Video–Music Retrieval", 《HTTPS://ZHUANLAN.ZHIHU.COM/P/249004077》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113868442A (en) * 2021-08-26 2021-12-31 北京中知智慧科技有限公司 Joint retrieval method and device
CN114154587A (en) * 2021-12-10 2022-03-08 北京航空航天大学 Multi-mode event detection method based on complementary content perception
CN114154587B (en) * 2021-12-10 2024-07-05 北京航空航天大学 Multi-mode event detection method based on complementary content perception

Also Published As

Publication number Publication date
CN112883229B (en) 2022-11-15

Similar Documents

Publication Publication Date Title
Hua et al. Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN109783655A (en) A kind of cross-module state search method, device, computer equipment and storage medium
Shi et al. Learning-to-rank for real-time high-precision hashtag recommendation for streaming news
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
US20140324879A1 (en) Content based search engine for processing unstructured digital data
KR102264899B1 (en) A natural language processing system, a learning method for the same and computer-readable recording medium with program
JP2015162244A (en) Methods, programs and computation processing systems for ranking spoken words
CN113032552B (en) Text abstract-based policy key point extraction method and system
CN113360646A (en) Text generation method and equipment based on dynamic weight and storage medium
CN113111836A (en) Video analysis method based on cross-modal Hash learning
CN109271624A (en) A kind of target word determines method, apparatus and storage medium
CN115408558A (en) Long video retrieval method and device based on multi-scale multi-example similarity learning
Duan et al. Multimodal Matching Transformer for Live Commenting.
CN112883229B (en) Video-text cross-modal retrieval method and device based on multi-feature-map attention network model
CN116977701A (en) Video classification model training method, video classification method and device
Zhuang et al. Synthesis and generation for 3D architecture volume with generative modeling
Xia et al. Content-irrelevant tag cleansing via bi-layer clustering and peer cooperation
Jiang Web-scale multimedia search for internet video content
Tian et al. Multimedia integrated annotation based on common space learning
CN113657116B (en) Social media popularity prediction method and device based on visual semantic relationship
Yu et al. A comprehensive survey of 3d dense captioning: Localizing and describing objects in 3d scenes
Alqhtani et al. A multiple kernel learning based fusion for earthquake detection from multimedia twitter data
Hammad et al. Characterizing the impact of using features extracted from pre-trained models on the quality of video captioning sequence-to-sequence models
US20240185629A1 (en) Method, electronic device and computer program product for data processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant