CN114817627A

CN114817627A - Text-to-video cross-modal retrieval method based on multi-face video representation learning

Info

Publication number: CN114817627A
Application number: CN202210425802.1A
Authority: CN
Inventors: 董建锋; 陈先客; 王勋; 刘宝龙; 包翠竹
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-07-29

Abstract

The invention discloses a text-to-video cross-modal retrieval method based on multi-face video representation learning, which comprises the following steps: acquiring video and text preliminary characteristics; grouping initial frames of the video according to different scenes by using a video lens splitting tool, and inputting a display coding branch for explicit coding to obtain explicit multi-surface representation of different scenes of the video; inputting the initial video features into an implicit coding branch, and carrying out implicit coding on the initial video features through a leading feature multi-attention network to obtain implicit multi-surface representations expressing different semantic contents of the video; fusing the multi-face codes of the two branches to obtain multi-face video feature representation; respectively mapping the multi-face video feature representation and the text feature into a public space, learning the correlation degree between two modes by using a public space learning algorithm, training a model in an end-to-end mode, and realizing the cross-mode retrieval from the text to the video. The method of the invention utilizes the idea of multi-surface representation of the video, and improves the retrieval performance.

Description

Text-to-video cross-modal retrieval method based on multi-face video representation learning

Technical Field

The invention relates to the technical field of video cross-modal retrieval, in particular to a text-to-video cross-modal retrieval method based on multi-face video representation learning.

Background

In recent years, due to the popularization of the internet and mobile intelligent devices and the rapid development of communication and multimedia technologies, a great amount of multimedia data is created and uploaded to the internet every day, data of different modalities, such as characters, images, videos and the like, is increasing at an explosive speed, and the multimedia data also becomes the most main source for obtaining information of modern people. Especially video data, people will upload and share videos created by themselves more easily, and how to quickly and accurately retrieve videos demanded by users from the videos is a difficult challenge. Text-to-video cross-modal retrieval is one of the key technologies that alleviate this challenge.

The existing text-to-video cross-modal retrieval assumes that all videos do not have any text labels, a user describes the query requirement of the user through natural sentences, and a retrieval model returns videos with high query relevance through calculating the cross-modal relevance of the texts and the videos. The core of the retrieval mode is to calculate the cross-modal correlation degree of the text and the video. The model structure of the existing text-to-video cross-modal retrieval method is mainly in an efficient double-tower form, and a video and a corresponding query text are respectively encoded into a video and a text vector by respective feature encoders and then mapped into a public space for feature representation learning. However, the conventional model coding output mode has the following defects: due to the characteristics of videos and texts, a video may have a plurality of different scenes along with the movement of a photographer or the switching of the view angle during shooting, and a query text may not describe the entire content of the corresponding video, i.e., the query text and the video are partially related. If the video is only represented as a single feature vector, multi-scene information in the video may be blurred, so that the representation of the video is inaccurate, and the accuracy of the text-video retrieval result is finally affected.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a text-to-video cross-modal retrieval method based on multi-face video representation learning.

The purpose of the invention is realized by the following technical scheme: a text-to-video cross-modal retrieval method based on multi-face video representation learning comprises the following steps:

(1) respectively carrying out feature pre-extraction on the text and the video to obtain initial features of two modal data, namely the text and the features;

(2) performing explicit multi-face representation coding on the initial features of the video obtained in the step (1), wherein the explicit multi-face representation coding comprises the following steps: correspondingly grouping initial video frames according to different scenes by using a video lens splitting tool, inputting the grouped initial video characteristics into a display coding branch for explicit coding, and obtaining explicit multi-surface representation of different scenes of the video;

(3) carrying out implicit multi-face representation coding on the initial features of the video obtained in the step (1), wherein the implicit multi-face representation coding comprises the following steps: inputting the initial video features into an implicit coding branch, and carrying out implicit coding on the initial video features through a leading feature multi-attention network to obtain implicit multi-surface representations expressing different semantic contents of the video;

(4) carrying out interactive coding on the explicit multi-face representation obtained in the step (2) and the implicit multi-face representation obtained in the step (4) to obtain multi-face video feature representation;

(5) encoding the initial features of the text in a parallel mode to obtain text features;

(6) respectively mapping the multi-face video feature representation obtained in the step (4) and the text feature obtained in the step (5) into a public space, learning the correlation degree between two modes by using a public space learning algorithm, and finally training a model in an end-to-end mode;

(7) and (4) realizing the cross-modal retrieval from the text to the video by utilizing the model obtained by training in the step (6).

Further, the method for extracting video and text features in step (1) comprises:

(1-1) extracting visual features of an input video frame by using a pre-trained deep convolutional neural network to obtain initial video features;

and (1-2) using a dictionary carried by the BERT model to perform subscript coding on each word in the text as an initial text feature.

Further, in the step (2), since the input of the explicit coding branch is the video packet already segmented by the video segmentation tool, only the feature representation in each video packet needs to be considered and modeled, and the branch is not needed to learn complex video segmentation; the step comprises the following substeps:

(2-1) carrying out embedded coding on the type of the packet of each frame of the video by using the packet subscript of each frame in the video so as to distinguish the scene to which each frame belongs;

(2-2) inputting the video frames subjected to type embedding coding into a transformer to model mutual information among frames in each video packet and mutual information among the video packets;

and (2-3) aggregating the characteristics of the transform output according to groups by using the group subscript of each frame to obtain an explicit polyhedral representation after explicit coding.

Further, the step (3) is specifically:

in the prior art, the global feature is usually obtained by simply performing maximum or average pooling operation on initial video features in the aspect of obtaining video global information, but the video is composed of image sequences and has a time sequence, so that it is very important to add time sequence information into the global feature. And (2) initially encoding the video by using the bi-directional LSTM (bi-LSTM) through the initial video characteristics obtained in the step (1) to obtain hidden states of the bi-directional LSTM at each moment, performing maximum pooling operation on the hidden states to obtain global video characteristics, and simultaneously keeping the hidden states at each moment as video time sequence characteristics.

For a video, it is desirable that the features encoded in the video represent different scenes of the video, which are distinguished from each other. Assuming that the number of output segments of one video setting is n, it is subjected to the leading feature attention coding n times. In each coding process, firstly, a full connection layer (Fc) is used for coding the video global characteristics obtained in the step to obtain specific global attention guide characteristics, corresponding weights are calculated by combining the output characteristics of the previous leading characteristic attention coding to carry out weighted sum on the video time sequence characteristics obtained in the step, and the current section characteristics are output.

And taking the multi-segment output features obtained by attention coding the multiple leading features of the video as the final implicit multi-surface representation of the input video.

Further, in the step (3-2), for the ith coding, the video global feature q and the video feature e output by the (i-1) th coding are compared _i-1 Splicing and Using full connecting layer Fc _i And a ReLu activation function to generate a global attention guide feature g _i Namely:

g _i ＝ReLu(Fc _i ([q,e _i-1 ]))

using full connection layers

Directing features g to global attention _i Performing dimension reduction to obtain characteristics

Using full connection layers

To video timing characteristics f _v Performing dimensionality reduction to obtain

Features of the same dimension

For the

By row dimension and

adding, namely:

to pair

Using tanh activation function and fully connected layers

Deriving aggregate weights for each frame of video

And to video timing characteristics f _v Weighted sum is carried out to obtain the ith output e of the leading feature attention coding _i Namely:

further, the step (4) is specifically as follows: although the expicity coding branch has strong interpretability to the video multi-surface coding, the expicity coding branch is attached with strong subjective information, and unexplainable implicit coding which is learnt by the implicit coding branch is needed to complete information, so an interactive coding module is designed, each implicit characteristic output by the implicit coding branch and all explicit characteristics output by the explicit coding branch calculate cosine similarity, the weight of each explicit characteristic to the current implicit characteristic is obtained, all explicit characteristics are weighted and added with the current implicit characteristic, and multi-surface video characteristic representation is obtained.

Further, the step (5) is specifically: for the initial features of the text obtained in the step (1), firstly, position encoding (positional encoding) is carried out on the text, and the text is input into an existing BERT (bidirectional Encoder retrieval from transforms) text encoding model which is directly called, so that the output of the text word level is obtained. Meanwhile, in view of the characteristics of the BERT model, the head ([ CLS ]) of the text features output by the BERT model contains the semantics of the whole sentence, so the [ CLS ] is selected as the output feature of the text end.

Further, in the step (6), the method for learning the correlation between the two modalities and training the model by using the common space learning algorithm is as follows:

(6-1) respectively mapping the multi-face video feature representation obtained in the step (4) and the text feature obtained in the step (5) to a uniform public space through a full connection layer for expression, and using a Batch Normalization (BN) layer after the full connection layer;

(6-2) training the model in an end-to-end manner through ternary ordering loss, so that the model automatically learns the correlation between the two modalities.

Further, the step (7) is specifically:

(7-1) mapping the input text query and all candidate videos to a public space through a trained model;

(7-2) calculating the similarity of the text query and all candidate videos in a public space, selecting the maximum value of the similarity of all output features of the current video and the query text as each video outputs a plurality of features, sequencing the candidate videos according to the similarity, and returning a retrieval result.

The invention has the beneficial effects that: the method of the invention uses both explicit and implicit aspects to represent video. For the aspect of display, a video mirror dividing tool for opening sources on the internet is used to divide the video into a plurality of groups of different scenes and encode the scenes, so that the aim of multi-surface representation is achieved. For the implicit aspect, a leading feature multi-attention network is proposed, and a video coding network with multi-segment feature output capability is used for video feature learning for the first time. After the videos are subjected to explicit and implicit multi-surface representation, the two representations are subjected to interactive coding through a coding network to obtain final multi-segment video features, finally the multi-segment video features and the corresponding text features are mapped to a public space, the correlation degree of the multi-segment video features and the corresponding text features in the public space is calculated, the maximum value is taken as the final text-video correlation degree, and cross-modal retrieval from the texts to the videos is achieved. The invention utilizes the idea of video multi-surface representation, and compared with a general deep learning video retrieval model, the retrieval performance is greatly improved.

Drawings

FIG. 1 is a schematic diagram of an explicit coding network structure of a multi-faceted representation of a video according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an implicit coding network structure of a multi-faceted representation of a video according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a common space learning model based on multi-faceted video representation learning according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

In order to solve the problem of cross-modal retrieval from text to video, the invention provides a text-to-video cross-modal retrieval method based on multi-face video representation learning, which comprises the following specific steps in one embodiment:

(1) and respectively extracting the features of the two modes, namely the video and the text, by using different feature extraction methods.

(1-1) for a given one video, it is pre-specified that j video frames are uniformly extracted from the video every 0.5 seconds. Depth features are then extracted for each frame using a Convolutional Neural Network (CNN) model trained on the ImageNet dataset, such as the ResNet model. Thus, the video can be characterized by a series of featuresQuantity { v ₁ ,v ₂ ,...,v _t ,...,v _j Where v is _t The feature vector of the t-th frame is represented, and meanwhile when the pre-extracted video frame feature vector is input, the idea of data enhancement is used, and 20% of video frame feature vectors are discarded randomly, so that the robustness of the model to different video features is further improved.

And (1-2) classifying the initial frames of the videos by using a video lens splitting tool while extracting all frame features of one video. In this embodiment, a content-aware-based mirror splitting algorithm in an online open-source mirror splitting tool pyscenedect and pyscenedect is selected. Firstly, converting RGB values of a video frame into HSV values (an HSV color space is composed of data in three aspects of hue, saturation and sensitivity, is more complex compared with RGB, and is easier to track and divide objects), then calculating the difference of the HSV values between all pixels of adjacent frames, averaging, and finally judging whether the calculated average difference exceeds a set threshold value to confirm whether the two frames are in the same scene. After all the video frames are traversed, the scenes to which the video frames belong can be distributed, and meanwhile, the scene grouping subscripts are reserved. Inputting an initial frame of a video into a video lens splitting tool, and manually setting a threshold value of an inter-frame HSV (hue, saturation and value) difference value to 27 to obtain s scenes of the video, wherein each scene has corresponding partial video frame characteristics, such as scenes

And creating a dictionary to store scene subscripts (labels ═ v) corresponding to each frame in a key-value pair mode ₁ :1,v ₂ :1,…v _j :s}。

(1-3) given a sentence of length l, since the BERT model has its own dictionary, the present embodiment uses the subscript of each word in the dictionary for the corresponding encoding. Thus, a dictionary index-coded vector sequence { w } can be generated ₁ ,w ₂ ,...,w _t ,...,w _l In which w _t Indicating the index of the t-th word in the dictionary. This initially extracts features of the text.

Through the feature extraction of the steps, initial features of the video and the text are respectively obtained, but the features are only simply extracted through a CNN model and dictionary coding and are subjected to some preprocessing, and then the initial features are mainly expressed through multi-video features of a multi-face video coding network.

(2) And (2) performing explicit and implicit multi-face coding on the video visual characteristics obtained in the step (1) by the multi-face video coding network respectively, and performing interactive coding on the explicit multi-face representation and the implicit multi-face representation to obtain multi-face video characteristic representations. The steps for explicit and implicit polyhedral coding are as follows:

(2-1) explicit multi-faceted coding. FIG. 1 is a schematic diagram of an explicit coding network architecture for explicitly coded branching input video frames { v } ₁ ,v ₂ ,...,v _j Firstly, type-embedding encoding (type-embedding) is performed on each frame of characteristics of a video according to the saved corresponding scene subscript of each frame, so as to distinguish the scenes to which each frame belongs, that is:

secondly, before inputting video features into a transform, some documents propose that the feature shows better performance when the feature is coded in a transform network with the dimension of 768 or 1024, so we firstly carry out dimension reduction on the feature through a linear layer, and simultaneously achieve the effect of reducing network parameters of the transform, and for the transform network used here, we set the feature as 2-layer and 4-head self-attention coding. Namely:

then, aggregating the k frames corresponding to each scene through average pooling to obtain a video multi-face representation of an explicit multi-face coding branch, namely:

(2-2) implicit polyhedral coding. Fig. 2 is a schematic diagram of an implicit coding network structure, and since the features of each frame of video have been extracted by using the pre-trained CNN model in step (1), we first encode the temporal information of these features, and the known bidirectional recurrent neural network can effectively use the past and future context information of a given sequence. Therefore, we use it to model video timing information. We use a bi-directional LSTM (bi-LSTM) network. The bidirectional LSTM is composed of two independent LSTM layers, namely forward LSTM and backward LSTM. Forward LSTM encodes features of video frames in normal order, i.e., from front to back; while backward LSTM encodes video frame features in the reverse order. Order to

And

representing the respective hidden state for the given time step t 1, 2. Two hidden states are generated:

wherein

And

representing forward LSTM and backward LSTM, the information of the last time step is respectively

And

and carrying. Splicing the current timeIn steps

And

the output of the bidirectional LSTM at the time t is obtained

The hidden state size of the forward and backward LSTM is set to 1024 dimensions. Thus, h _t Is 2048 dimensions. Put all outputs together to get a feature map H ═ H ₁ ,h ₂ ,…,h _j Dimension 2048 × j. Representing bi-LSTM based coding as f _v As a time-series coding feature. While obtaining the video global information feature q by applying a max pooling operation on H along the row dimension, i.e.

Where j is the number of video frames, h _t Is the hidden state at time step t.

The global information characteristic q and the time sequence characteristic f of the video are obtained _v The encoding of the leading attention network characteristics can then be performed. As shown in fig. 1, if n features are to be output from a video, n times of leading network attention feature encoding is required, and the specific encoding steps are as follows:

taking the ith coding as an example, regarding the global information characteristic q of the video, let it and the video characteristic e outputted by the (i-1) th coding _i-1 (if it is the first encoding, e ₀ All 0 vectors) and use all connection layers Fc _i (parameters are not shared in different times of encoding) and a ReLu activation function to generate a global information guide vector g _i Namely:

g _i ＝ReLu(Fc _i ([q,e _i-1 ]))

can be regarded as g _i Simultaneously carries global information of a scene in a certain time period in a videoAnd all scene information encoded before this time period, and may be correlated with the temporal characteristics f of the video _v Combine to generate corresponding video scene segment features. First, the full connection layer is reused

(parameters are shared when encoding is carried out at different times), and original 2048-dimensional features g are combined _i Performing dimension reduction to obtain characteristics

For timing characteristics

Also using a fully-connected layer

(sharing parameters during different times of coding) to obtain

Then, for

Each frame level feature it contains

And

adding, so as to achieve the purpose of highlighting a specific scene frame and suppressing a frame irrelevant to the current scene, that is:

then, to

Using tanh activation function and a fully connected layer

(sharing parameters during different times of coding) to obtain the aggregate weight of all the frames of the video

By the weight of each frame of video

And initial timing characteristics of each frame of video

After multiplication, weighted sum is carried out to obtain the output of the ith leading attention network characteristic code

Namely:

after n times of leading attention network feature coding, obtaining video multi-face representation of implicit multi-face coding branches:

E＝{e ₁ ,e ₂ ,…,e _n }

(3) explicit and implicit coding branches are interactively coded. Since the characteristic dimension of the implicit coding branch output is 2048 and the characteristic dimension of the explicit coding branch output is 1024, the explicit coding branch output is firstly carried outThe features are mapped to 2048 dimensions by the full link layer, which facilitates subsequent feature fusion. For each implicitly coded feature e _i Calculating its explicit coding features with each mapped

Degree of similarity of (a) _i And according to α _i Come to right

Weighted sum is carried out to obtain the sum e _i Related explicit coding scene features s _i E is to be _i And s _i Adding to obtain the final video multi-face code

Namely:

m _i ＝s _i +e _i

(4) text features extracted in (1-3)

Calling the existing BERT model, and selecting the BERT-base with less parameters (the number of the tansformer layers is 12, and each layer has 12 heads) from all the trained models published by the BERT model. Text features are not aggregated after BERT coding and are still word-level features

At the same time, considering the characteristics of the BERT model, the first position of the text feature output by the BERT model ([ CLS)]) Contains the semantics of the whole sentence, so select [ CLS]Output characteristics as text end

(5) Acquiring multi-face representation characteristics of the video and coding characteristics of the text through the steps (3) and (4)

And

and (6) finally. Due to the fact that

And

there is no correlation between them, so they cannot be directly compared. For similarity calculation of video features and text features, feature vectors of the video features and the text features need to be mapped into a uniform common space for calculation. Therefore, a common space learning algorithm is used for learning the correlation degree between the two modes, and finally the model is trained in an end-to-end mode, so that the model can automatically learn the relation between the data of the two modes of the text and the video, and the cross-mode retrieval from the text to the video is realized. The method comprises the following steps:

(5-1) feature vector of video at given encoding

And feature vectors of sentences

They are mapped into a common space through a Fully Connected (FC) layer. Furthermore, a Batch Normalization (BN) layer is additionally used after the FC layer, which contributes to the performance improvement of the model. Finally, the video multi-segment feature vector set f (v) and sentence feature vector f(s) of the video v and sentence s in the public space are:

wherein W _v And W _s Is an affine matrix parameter of the FC layer, b _v And b _s Is the bias term.

(5-2) coding network parameters and common spatial learning network parameters of video and text are trained together in an end-to-end manner, except that pre-trained image convolution network parameters for extracting video features are fixed. We note all trainable parameters as θ, S _θ (v, S) represents the similarity between video v and text S, since f (v) is a video feature segment set, there will be multiple cosine similarities (representing the similarity between text and a certain segment of video) between f (v) and f (S), and the maximum value of the multiple cosine similarities is taken as S _θ (v,s)。

A ternary ranking loss (marking loss) is used, which penalizes the model by the hardest negative sample (hardest negative sample). Specifically, the loss function L (v, s; θ) for a relevant video-sentence pair is defined as:

L(v,s；θ)＝max(0,α+S _θ (v,s ^- )-S _θ (v,s))+max(0,α+S _θ (v ^- ,s)-S _θ (v,s))

where α is a margin constant (margin), set to 0.2, s ^- And v ^- Respectively representing irrelevant sentence examples of video v and irrelevant video examples of sentence s. These two uncorrelated samples are not randomly sampled, but rather the sentences and videos in the current small batch of data where the model predictions are most similar but are actually uncorrelated are selected.

(5-3) training the model in an end-to-end manner by minimizing the value of the ternary ordering loss function on the training set. By adopting a random Gradient Descent (Stochastic Gradient Description) optimization algorithm of batch data based on Adam, the size of the batch data (mini-batch) is set to be 128, the initial learning rate is 0.0001, and the maximum training period is set to be 50. During the training process, if the performance on the validation set is not improved after two consecutive periods, the learning rate is divided by 2; if the performance on the validation set does not improve for 10 consecutive training cycles, the training is stopped.

(6) Through the training of the model in the step (5), the model already learns the mutual connection between the video and the text. Given a text query, the model finds out relevant videos related to the text query from a candidate video set and takes the relevant videos as a retrieval result, and the specific steps are as follows:

(6-1) mapping a given text query and all candidate videos to a public space through the model trained in the step (6), wherein the text s is expressed as f(s), and the video v is expressed as f (v).

And (6-2) calculating cosine similarity of the text query and all candidate videos in a public space, then sorting all the candidate videos in a descending order according to the cosine similarity, and returning the videos with the top order as a retrieval result, thereby realizing cross-modal retrieval from the text to the videos.

The foregoing is merely a preferred embodiment of the present invention, and although the present invention has been disclosed in the context of preferred embodiments, it is not intended to be limited thereto. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. A text-to-video cross-modal retrieval method based on multi-face video representation learning is characterized by comprising the following steps:

2. The method for text-to-video cross-modal retrieval based on multi-faceted video representation learning according to claim 1, wherein the method for extracting video and text features in step (1) comprises:

3. The method for text-to-video cross-modal retrieval based on multi-faceted video representation learning according to claim 1, wherein said step (2) is specifically:

4. The method for text-to-video cross-modal retrieval based on multi-faceted video representation learning according to claim 1, wherein said step (3) is specifically:

(3-1) encoding the initial video features by using the bidirectional LSTM to obtain a hidden state of the bidirectional LSTM at each moment, performing maximum pooling operation on the hidden states to obtain global video features, and meanwhile, reserving the hidden state of each moment as a video time sequence feature;

(3-2) assuming that the number of output segments set by one video is n, performing n-time leading feature attention coding on the video; in each coding process, a full-link layer is used for coding video global features to obtain specific global attention guide features, corresponding weights are calculated by combining output features of previous leading feature attention coding to carry out weighted sum on video time sequence features, and current section features are output;

and (3-3) performing attention coding on the multiple leading features of the video to obtain multiple output features, and taking the multiple output features as the final implicit multi-surface representation of the input video.

5. The method for text-to-video cross-modal search based on multi-faceted video representation learning according to claim 4, wherein in said step (3-2), for ith encoding, the video global feature q and the video feature e outputted from the ith-1 encoding are combined _i-1 Splicing and Using full connecting layer Fc _i And a ReLu activation function to generate a global attention guide feature g _i Namely:

g _i ＝ReLu(Fc _i ([q，e _i-1 ]))

using full connection layers

Using full connection layers

Features of the same dimension

For the

By row dimension and

adding, namely:

to pair

Using tanh activation function and fully connected layers

Deriving aggregate weights for each frame of video

6. the method for text-to-video cross-modal retrieval based on multi-faceted video representation learning according to claim 1, wherein said step (4) is specifically: calculating cosine similarity between each implicit characteristic output by the implicit coding branch and all explicit characteristics output by the explicit coding branch to obtain the weight of each explicit characteristic to the current implicit characteristic, carrying out weighted sum on all the explicit characteristics, and adding the weighted sum to the current implicit characteristic to obtain multi-face video characteristic representation.

7. The method for text-to-video cross-modal retrieval based on multi-faceted video representation learning according to claim 1, wherein said step (5) is specifically: carrying out position coding on the text, and inputting the text into a BERT model to obtain the output of the text word level; and selecting the first position [ CLS ] of the text features output by the BERT model as the output features of the text end.

8. The method for text-to-video cross-modal search based on multi-faceted video representation learning according to claim 1, wherein in said step (6), the method for learning the correlation between two modalities and training the model by using the common spatial learning algorithm is as follows:

(6-1) mapping the multi-face video feature representation obtained in the step (4) and the text feature obtained in the step (5) to a uniform public space through a full connection layer to be expressed, and using a batch normalization layer after the full connection layer;

9. The method for text-to-video cross-modal retrieval based on multi-faceted video representation learning according to claim 1, wherein said step (7) is specifically:

and (7-2) calculating the similarity of the text query and all candidate videos in a public space, sequencing the candidate videos according to the similarity, and returning a retrieval result.