CN114817627A - Text-to-video cross-modal retrieval method based on multi-face video representation learning - Google Patents

Text-to-video cross-modal retrieval method based on multi-face video representation learning Download PDF

Info

Publication number
CN114817627A
CN114817627A CN202210425802.1A CN202210425802A CN114817627A CN 114817627 A CN114817627 A CN 114817627A CN 202210425802 A CN202210425802 A CN 202210425802A CN 114817627 A CN114817627 A CN 114817627A
Authority
CN
China
Prior art keywords
video
text
features
coding
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210425802.1A
Other languages
Chinese (zh)
Inventor
董建锋
陈先客
王勋
刘宝龙
包翠竹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Original Assignee
Zhejiang Gongshang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University filed Critical Zhejiang Gongshang University
Priority to CN202210425802.1A priority Critical patent/CN114817627A/en
Publication of CN114817627A publication Critical patent/CN114817627A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • G06F16/785Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using colour or luminescence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a text-to-video cross-modal retrieval method based on multi-face video representation learning, which comprises the following steps: acquiring video and text preliminary characteristics; grouping initial frames of the video according to different scenes by using a video lens splitting tool, and inputting a display coding branch for explicit coding to obtain explicit multi-surface representation of different scenes of the video; inputting the initial video features into an implicit coding branch, and carrying out implicit coding on the initial video features through a leading feature multi-attention network to obtain implicit multi-surface representations expressing different semantic contents of the video; fusing the multi-face codes of the two branches to obtain multi-face video feature representation; respectively mapping the multi-face video feature representation and the text feature into a public space, learning the correlation degree between two modes by using a public space learning algorithm, training a model in an end-to-end mode, and realizing the cross-mode retrieval from the text to the video. The method of the invention utilizes the idea of multi-surface representation of the video, and improves the retrieval performance.

Description

Text-to-video cross-modal retrieval method based on multi-face video representation learning
Technical Field
The invention relates to the technical field of video cross-modal retrieval, in particular to a text-to-video cross-modal retrieval method based on multi-face video representation learning.
Background
In recent years, due to the popularization of the internet and mobile intelligent devices and the rapid development of communication and multimedia technologies, a great amount of multimedia data is created and uploaded to the internet every day, data of different modalities, such as characters, images, videos and the like, is increasing at an explosive speed, and the multimedia data also becomes the most main source for obtaining information of modern people. Especially video data, people will upload and share videos created by themselves more easily, and how to quickly and accurately retrieve videos demanded by users from the videos is a difficult challenge. Text-to-video cross-modal retrieval is one of the key technologies that alleviate this challenge.
The existing text-to-video cross-modal retrieval assumes that all videos do not have any text labels, a user describes the query requirement of the user through natural sentences, and a retrieval model returns videos with high query relevance through calculating the cross-modal relevance of the texts and the videos. The core of the retrieval mode is to calculate the cross-modal correlation degree of the text and the video. The model structure of the existing text-to-video cross-modal retrieval method is mainly in an efficient double-tower form, and a video and a corresponding query text are respectively encoded into a video and a text vector by respective feature encoders and then mapped into a public space for feature representation learning. However, the conventional model coding output mode has the following defects: due to the characteristics of videos and texts, a video may have a plurality of different scenes along with the movement of a photographer or the switching of the view angle during shooting, and a query text may not describe the entire content of the corresponding video, i.e., the query text and the video are partially related. If the video is only represented as a single feature vector, multi-scene information in the video may be blurred, so that the representation of the video is inaccurate, and the accuracy of the text-video retrieval result is finally affected.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a text-to-video cross-modal retrieval method based on multi-face video representation learning.
The purpose of the invention is realized by the following technical scheme: a text-to-video cross-modal retrieval method based on multi-face video representation learning comprises the following steps:
(1) respectively carrying out feature pre-extraction on the text and the video to obtain initial features of two modal data, namely the text and the features;
(2) performing explicit multi-face representation coding on the initial features of the video obtained in the step (1), wherein the explicit multi-face representation coding comprises the following steps: correspondingly grouping initial video frames according to different scenes by using a video lens splitting tool, inputting the grouped initial video characteristics into a display coding branch for explicit coding, and obtaining explicit multi-surface representation of different scenes of the video;
(3) carrying out implicit multi-face representation coding on the initial features of the video obtained in the step (1), wherein the implicit multi-face representation coding comprises the following steps: inputting the initial video features into an implicit coding branch, and carrying out implicit coding on the initial video features through a leading feature multi-attention network to obtain implicit multi-surface representations expressing different semantic contents of the video;
(4) carrying out interactive coding on the explicit multi-face representation obtained in the step (2) and the implicit multi-face representation obtained in the step (4) to obtain multi-face video feature representation;
(5) encoding the initial features of the text in a parallel mode to obtain text features;
(6) respectively mapping the multi-face video feature representation obtained in the step (4) and the text feature obtained in the step (5) into a public space, learning the correlation degree between two modes by using a public space learning algorithm, and finally training a model in an end-to-end mode;
(7) and (4) realizing the cross-modal retrieval from the text to the video by utilizing the model obtained by training in the step (6).
Further, the method for extracting video and text features in step (1) comprises:
(1-1) extracting visual features of an input video frame by using a pre-trained deep convolutional neural network to obtain initial video features;
and (1-2) using a dictionary carried by the BERT model to perform subscript coding on each word in the text as an initial text feature.
Further, in the step (2), since the input of the explicit coding branch is the video packet already segmented by the video segmentation tool, only the feature representation in each video packet needs to be considered and modeled, and the branch is not needed to learn complex video segmentation; the step comprises the following substeps:
(2-1) carrying out embedded coding on the type of the packet of each frame of the video by using the packet subscript of each frame in the video so as to distinguish the scene to which each frame belongs;
(2-2) inputting the video frames subjected to type embedding coding into a transformer to model mutual information among frames in each video packet and mutual information among the video packets;
and (2-3) aggregating the characteristics of the transform output according to groups by using the group subscript of each frame to obtain an explicit polyhedral representation after explicit coding.
Further, the step (3) is specifically:
in the prior art, the global feature is usually obtained by simply performing maximum or average pooling operation on initial video features in the aspect of obtaining video global information, but the video is composed of image sequences and has a time sequence, so that it is very important to add time sequence information into the global feature. And (2) initially encoding the video by using the bi-directional LSTM (bi-LSTM) through the initial video characteristics obtained in the step (1) to obtain hidden states of the bi-directional LSTM at each moment, performing maximum pooling operation on the hidden states to obtain global video characteristics, and simultaneously keeping the hidden states at each moment as video time sequence characteristics.
For a video, it is desirable that the features encoded in the video represent different scenes of the video, which are distinguished from each other. Assuming that the number of output segments of one video setting is n, it is subjected to the leading feature attention coding n times. In each coding process, firstly, a full connection layer (Fc) is used for coding the video global characteristics obtained in the step to obtain specific global attention guide characteristics, corresponding weights are calculated by combining the output characteristics of the previous leading characteristic attention coding to carry out weighted sum on the video time sequence characteristics obtained in the step, and the current section characteristics are output.
And taking the multi-segment output features obtained by attention coding the multiple leading features of the video as the final implicit multi-surface representation of the input video.
Further, in the step (3-2), for the ith coding, the video global feature q and the video feature e output by the (i-1) th coding are compared i-1 Splicing and Using full connecting layer Fc i And a ReLu activation function to generate a global attention guide feature g i Namely:
g i =ReLu(Fc i ([q,e i-1 ]))
using full connection layers
Figure BDA0003608468620000041
Directing features g to global attention i Performing dimension reduction to obtain characteristics
Figure BDA0003608468620000042
Using full connection layers
Figure BDA0003608468620000043
To video timing characteristics f v Performing dimensionality reduction to obtain
Figure BDA0003608468620000044
Features of the same dimension
Figure BDA0003608468620000045
For the
Figure BDA0003608468620000046
By row dimension and
Figure BDA0003608468620000047
adding, namely:
Figure BDA0003608468620000048
Figure BDA0003608468620000049
Figure BDA00036084686200000410
to pair
Figure BDA00036084686200000411
Using tanh activation function and fully connected layers
Figure BDA00036084686200000412
Deriving aggregate weights for each frame of video
Figure BDA00036084686200000413
And to video timing characteristics f v Weighted sum is carried out to obtain the ith output e of the leading feature attention coding i Namely:
Figure BDA00036084686200000414
Figure BDA00036084686200000415
further, the step (4) is specifically as follows: although the expicity coding branch has strong interpretability to the video multi-surface coding, the expicity coding branch is attached with strong subjective information, and unexplainable implicit coding which is learnt by the implicit coding branch is needed to complete information, so an interactive coding module is designed, each implicit characteristic output by the implicit coding branch and all explicit characteristics output by the explicit coding branch calculate cosine similarity, the weight of each explicit characteristic to the current implicit characteristic is obtained, all explicit characteristics are weighted and added with the current implicit characteristic, and multi-surface video characteristic representation is obtained.
Further, the step (5) is specifically: for the initial features of the text obtained in the step (1), firstly, position encoding (positional encoding) is carried out on the text, and the text is input into an existing BERT (bidirectional Encoder retrieval from transforms) text encoding model which is directly called, so that the output of the text word level is obtained. Meanwhile, in view of the characteristics of the BERT model, the head ([ CLS ]) of the text features output by the BERT model contains the semantics of the whole sentence, so the [ CLS ] is selected as the output feature of the text end.
Further, in the step (6), the method for learning the correlation between the two modalities and training the model by using the common space learning algorithm is as follows:
(6-1) respectively mapping the multi-face video feature representation obtained in the step (4) and the text feature obtained in the step (5) to a uniform public space through a full connection layer for expression, and using a Batch Normalization (BN) layer after the full connection layer;
(6-2) training the model in an end-to-end manner through ternary ordering loss, so that the model automatically learns the correlation between the two modalities.
Further, the step (7) is specifically:
(7-1) mapping the input text query and all candidate videos to a public space through a trained model;
(7-2) calculating the similarity of the text query and all candidate videos in a public space, selecting the maximum value of the similarity of all output features of the current video and the query text as each video outputs a plurality of features, sequencing the candidate videos according to the similarity, and returning a retrieval result.
The invention has the beneficial effects that: the method of the invention uses both explicit and implicit aspects to represent video. For the aspect of display, a video mirror dividing tool for opening sources on the internet is used to divide the video into a plurality of groups of different scenes and encode the scenes, so that the aim of multi-surface representation is achieved. For the implicit aspect, a leading feature multi-attention network is proposed, and a video coding network with multi-segment feature output capability is used for video feature learning for the first time. After the videos are subjected to explicit and implicit multi-surface representation, the two representations are subjected to interactive coding through a coding network to obtain final multi-segment video features, finally the multi-segment video features and the corresponding text features are mapped to a public space, the correlation degree of the multi-segment video features and the corresponding text features in the public space is calculated, the maximum value is taken as the final text-video correlation degree, and cross-modal retrieval from the texts to the videos is achieved. The invention utilizes the idea of video multi-surface representation, and compared with a general deep learning video retrieval model, the retrieval performance is greatly improved.
Drawings
FIG. 1 is a schematic diagram of an explicit coding network structure of a multi-faceted representation of a video according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an implicit coding network structure of a multi-faceted representation of a video according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a common space learning model based on multi-faceted video representation learning according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
In order to solve the problem of cross-modal retrieval from text to video, the invention provides a text-to-video cross-modal retrieval method based on multi-face video representation learning, which comprises the following specific steps in one embodiment:
(1) and respectively extracting the features of the two modes, namely the video and the text, by using different feature extraction methods.
(1-1) for a given one video, it is pre-specified that j video frames are uniformly extracted from the video every 0.5 seconds. Depth features are then extracted for each frame using a Convolutional Neural Network (CNN) model trained on the ImageNet dataset, such as the ResNet model. Thus, the video can be characterized by a series of featuresQuantity { v 1 ,v 2 ,...,v t ,...,v j Where v is t The feature vector of the t-th frame is represented, and meanwhile when the pre-extracted video frame feature vector is input, the idea of data enhancement is used, and 20% of video frame feature vectors are discarded randomly, so that the robustness of the model to different video features is further improved.
And (1-2) classifying the initial frames of the videos by using a video lens splitting tool while extracting all frame features of one video. In this embodiment, a content-aware-based mirror splitting algorithm in an online open-source mirror splitting tool pyscenedect and pyscenedect is selected. Firstly, converting RGB values of a video frame into HSV values (an HSV color space is composed of data in three aspects of hue, saturation and sensitivity, is more complex compared with RGB, and is easier to track and divide objects), then calculating the difference of the HSV values between all pixels of adjacent frames, averaging, and finally judging whether the calculated average difference exceeds a set threshold value to confirm whether the two frames are in the same scene. After all the video frames are traversed, the scenes to which the video frames belong can be distributed, and meanwhile, the scene grouping subscripts are reserved. Inputting an initial frame of a video into a video lens splitting tool, and manually setting a threshold value of an inter-frame HSV (hue, saturation and value) difference value to 27 to obtain s scenes of the video, wherein each scene has corresponding partial video frame characteristics, such as scenes
Figure BDA0003608468620000061
And creating a dictionary to store scene subscripts (labels ═ v) corresponding to each frame in a key-value pair mode 1 :1,v 2 :1,…v j :s}。
(1-3) given a sentence of length l, since the BERT model has its own dictionary, the present embodiment uses the subscript of each word in the dictionary for the corresponding encoding. Thus, a dictionary index-coded vector sequence { w } can be generated 1 ,w 2 ,...,w t ,...,w l In which w t Indicating the index of the t-th word in the dictionary. This initially extracts features of the text.
Through the feature extraction of the steps, initial features of the video and the text are respectively obtained, but the features are only simply extracted through a CNN model and dictionary coding and are subjected to some preprocessing, and then the initial features are mainly expressed through multi-video features of a multi-face video coding network.
(2) And (2) performing explicit and implicit multi-face coding on the video visual characteristics obtained in the step (1) by the multi-face video coding network respectively, and performing interactive coding on the explicit multi-face representation and the implicit multi-face representation to obtain multi-face video characteristic representations. The steps for explicit and implicit polyhedral coding are as follows:
(2-1) explicit multi-faceted coding. FIG. 1 is a schematic diagram of an explicit coding network architecture for explicitly coded branching input video frames { v } 1 ,v 2 ,...,v j Firstly, type-embedding encoding (type-embedding) is performed on each frame of characteristics of a video according to the saved corresponding scene subscript of each frame, so as to distinguish the scenes to which each frame belongs, that is:
Figure BDA0003608468620000071
secondly, before inputting video features into a transform, some documents propose that the feature shows better performance when the feature is coded in a transform network with the dimension of 768 or 1024, so we firstly carry out dimension reduction on the feature through a linear layer, and simultaneously achieve the effect of reducing network parameters of the transform, and for the transform network used here, we set the feature as 2-layer and 4-head self-attention coding. Namely:
Figure BDA0003608468620000072
then, aggregating the k frames corresponding to each scene through average pooling to obtain a video multi-face representation of an explicit multi-face coding branch, namely:
Figure BDA0003608468620000073
(2-2) implicit polyhedral coding. Fig. 2 is a schematic diagram of an implicit coding network structure, and since the features of each frame of video have been extracted by using the pre-trained CNN model in step (1), we first encode the temporal information of these features, and the known bidirectional recurrent neural network can effectively use the past and future context information of a given sequence. Therefore, we use it to model video timing information. We use a bi-directional LSTM (bi-LSTM) network. The bidirectional LSTM is composed of two independent LSTM layers, namely forward LSTM and backward LSTM. Forward LSTM encodes features of video frames in normal order, i.e., from front to back; while backward LSTM encodes video frame features in the reverse order. Order to
Figure BDA0003608468620000074
And
Figure BDA0003608468620000075
representing the respective hidden state for the given time step t 1, 2. Two hidden states are generated:
Figure BDA0003608468620000076
Figure BDA0003608468620000077
wherein
Figure BDA0003608468620000078
And
Figure BDA0003608468620000079
representing forward LSTM and backward LSTM, the information of the last time step is respectively
Figure BDA00036084686200000710
And
Figure BDA00036084686200000711
and carrying. Splicing the current timeIn steps
Figure BDA00036084686200000712
And
Figure BDA00036084686200000713
the output of the bidirectional LSTM at the time t is obtained
Figure BDA0003608468620000081
The hidden state size of the forward and backward LSTM is set to 1024 dimensions. Thus, h t Is 2048 dimensions. Put all outputs together to get a feature map H ═ H 1 ,h 2 ,…,h j Dimension 2048 × j. Representing bi-LSTM based coding as f v As a time-series coding feature. While obtaining the video global information feature q by applying a max pooling operation on H along the row dimension, i.e.
Figure BDA0003608468620000082
Where j is the number of video frames, h t Is the hidden state at time step t.
The global information characteristic q and the time sequence characteristic f of the video are obtained v The encoding of the leading attention network characteristics can then be performed. As shown in fig. 1, if n features are to be output from a video, n times of leading network attention feature encoding is required, and the specific encoding steps are as follows:
taking the ith coding as an example, regarding the global information characteristic q of the video, let it and the video characteristic e outputted by the (i-1) th coding i-1 (if it is the first encoding, e 0 All 0 vectors) and use all connection layers Fc i (parameters are not shared in different times of encoding) and a ReLu activation function to generate a global information guide vector g i Namely:
g i =ReLu(Fc i ([q,e i-1 ]))
can be regarded as g i Simultaneously carries global information of a scene in a certain time period in a videoAnd all scene information encoded before this time period, and may be correlated with the temporal characteristics f of the video v Combine to generate corresponding video scene segment features. First, the full connection layer is reused
Figure BDA0003608468620000083
(parameters are shared when encoding is carried out at different times), and original 2048-dimensional features g are combined i Performing dimension reduction to obtain characteristics
Figure BDA0003608468620000084
For timing characteristics
Figure BDA0003608468620000085
Also using a fully-connected layer
Figure BDA0003608468620000086
(sharing parameters during different times of coding) to obtain
Figure BDA0003608468620000087
Then, for
Figure BDA0003608468620000088
Each frame level feature it contains
Figure BDA0003608468620000089
And
Figure BDA00036084686200000810
adding, so as to achieve the purpose of highlighting a specific scene frame and suppressing a frame irrelevant to the current scene, that is:
Figure BDA00036084686200000811
Figure BDA00036084686200000812
Figure BDA00036084686200000813
then, to
Figure BDA00036084686200000814
Using tanh activation function and a fully connected layer
Figure BDA00036084686200000815
(sharing parameters during different times of coding) to obtain the aggregate weight of all the frames of the video
Figure BDA0003608468620000091
By the weight of each frame of video
Figure BDA0003608468620000092
And initial timing characteristics of each frame of video
Figure BDA0003608468620000093
After multiplication, weighted sum is carried out to obtain the output of the ith leading attention network characteristic code
Figure BDA0003608468620000094
Namely:
Figure BDA0003608468620000095
Figure BDA0003608468620000096
after n times of leading attention network feature coding, obtaining video multi-face representation of implicit multi-face coding branches:
E={e 1 ,e 2 ,…,e n }
(3) explicit and implicit coding branches are interactively coded. Since the characteristic dimension of the implicit coding branch output is 2048 and the characteristic dimension of the explicit coding branch output is 1024, the explicit coding branch output is firstly carried outThe features are mapped to 2048 dimensions by the full link layer, which facilitates subsequent feature fusion. For each implicitly coded feature e i Calculating its explicit coding features with each mapped
Figure BDA0003608468620000097
Degree of similarity of (a) i And according to α i Come to right
Figure BDA0003608468620000098
Weighted sum is carried out to obtain the sum e i Related explicit coding scene features s i E is to be i And s i Adding to obtain the final video multi-face code
Figure BDA0003608468620000099
Namely:
Figure BDA00036084686200000910
m i =s i +e i
(4) text features extracted in (1-3)
Figure BDA00036084686200000911
Calling the existing BERT model, and selecting the BERT-base with less parameters (the number of the tansformer layers is 12, and each layer has 12 heads) from all the trained models published by the BERT model. Text features are not aggregated after BERT coding and are still word-level features
Figure BDA00036084686200000912
At the same time, considering the characteristics of the BERT model, the first position of the text feature output by the BERT model ([ CLS)]) Contains the semantics of the whole sentence, so select [ CLS]Output characteristics as text end
Figure BDA00036084686200000913
(5) Acquiring multi-face representation characteristics of the video and coding characteristics of the text through the steps (3) and (4)
Figure BDA00036084686200000914
And
Figure BDA00036084686200000915
and (6) finally. Due to the fact that
Figure BDA00036084686200000916
And
Figure BDA00036084686200000917
there is no correlation between them, so they cannot be directly compared. For similarity calculation of video features and text features, feature vectors of the video features and the text features need to be mapped into a uniform common space for calculation. Therefore, a common space learning algorithm is used for learning the correlation degree between the two modes, and finally the model is trained in an end-to-end mode, so that the model can automatically learn the relation between the data of the two modes of the text and the video, and the cross-mode retrieval from the text to the video is realized. The method comprises the following steps:
(5-1) feature vector of video at given encoding
Figure BDA0003608468620000101
And feature vectors of sentences
Figure BDA0003608468620000102
They are mapped into a common space through a Fully Connected (FC) layer. Furthermore, a Batch Normalization (BN) layer is additionally used after the FC layer, which contributes to the performance improvement of the model. Finally, the video multi-segment feature vector set f (v) and sentence feature vector f(s) of the video v and sentence s in the public space are:
Figure BDA0003608468620000103
Figure BDA0003608468620000104
wherein W v And W s Is an affine matrix parameter of the FC layer, b v And b s Is the bias term.
(5-2) coding network parameters and common spatial learning network parameters of video and text are trained together in an end-to-end manner, except that pre-trained image convolution network parameters for extracting video features are fixed. We note all trainable parameters as θ, S θ (v, S) represents the similarity between video v and text S, since f (v) is a video feature segment set, there will be multiple cosine similarities (representing the similarity between text and a certain segment of video) between f (v) and f (S), and the maximum value of the multiple cosine similarities is taken as S θ (v,s)。
A ternary ranking loss (marking loss) is used, which penalizes the model by the hardest negative sample (hardest negative sample). Specifically, the loss function L (v, s; θ) for a relevant video-sentence pair is defined as:
L(v,s;θ)=max(0,α+S θ (v,s - )-S θ (v,s))+max(0,α+S θ (v - ,s)-S θ (v,s))
where α is a margin constant (margin), set to 0.2, s - And v - Respectively representing irrelevant sentence examples of video v and irrelevant video examples of sentence s. These two uncorrelated samples are not randomly sampled, but rather the sentences and videos in the current small batch of data where the model predictions are most similar but are actually uncorrelated are selected.
(5-3) training the model in an end-to-end manner by minimizing the value of the ternary ordering loss function on the training set. By adopting a random Gradient Descent (Stochastic Gradient Description) optimization algorithm of batch data based on Adam, the size of the batch data (mini-batch) is set to be 128, the initial learning rate is 0.0001, and the maximum training period is set to be 50. During the training process, if the performance on the validation set is not improved after two consecutive periods, the learning rate is divided by 2; if the performance on the validation set does not improve for 10 consecutive training cycles, the training is stopped.
(6) Through the training of the model in the step (5), the model already learns the mutual connection between the video and the text. Given a text query, the model finds out relevant videos related to the text query from a candidate video set and takes the relevant videos as a retrieval result, and the specific steps are as follows:
(6-1) mapping a given text query and all candidate videos to a public space through the model trained in the step (6), wherein the text s is expressed as f(s), and the video v is expressed as f (v).
And (6-2) calculating cosine similarity of the text query and all candidate videos in a public space, then sorting all the candidate videos in a descending order according to the cosine similarity, and returning the videos with the top order as a retrieval result, thereby realizing cross-modal retrieval from the text to the videos.
The foregoing is merely a preferred embodiment of the present invention, and although the present invention has been disclosed in the context of preferred embodiments, it is not intended to be limited thereto. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims (9)

1. A text-to-video cross-modal retrieval method based on multi-face video representation learning is characterized by comprising the following steps:
(1) respectively carrying out feature pre-extraction on the text and the video to obtain initial features of two modal data, namely the text and the features;
(2) performing explicit multi-face representation coding on the initial features of the video obtained in the step (1), wherein the explicit multi-face representation coding comprises the following steps: correspondingly grouping initial video frames according to different scenes by using a video lens splitting tool, inputting the grouped initial video characteristics into a display coding branch for explicit coding, and obtaining explicit multi-surface representation of different scenes of the video;
(3) carrying out implicit multi-face representation coding on the initial features of the video obtained in the step (1), wherein the implicit multi-face representation coding comprises the following steps: inputting the initial video features into an implicit coding branch, and carrying out implicit coding on the initial video features through a leading feature multi-attention network to obtain implicit multi-surface representations expressing different semantic contents of the video;
(4) carrying out interactive coding on the explicit multi-face representation obtained in the step (2) and the implicit multi-face representation obtained in the step (4) to obtain multi-face video feature representation;
(5) encoding the initial features of the text in a parallel mode to obtain text features;
(6) respectively mapping the multi-face video feature representation obtained in the step (4) and the text feature obtained in the step (5) into a public space, learning the correlation degree between two modes by using a public space learning algorithm, and finally training a model in an end-to-end mode;
(7) and (4) realizing the cross-modal retrieval from the text to the video by utilizing the model obtained by training in the step (6).
2. The method for text-to-video cross-modal retrieval based on multi-faceted video representation learning according to claim 1, wherein the method for extracting video and text features in step (1) comprises:
(1-1) extracting visual features of an input video frame by using a pre-trained deep convolutional neural network to obtain initial video features;
and (1-2) using a dictionary carried by the BERT model to perform subscript coding on each word in the text as an initial text feature.
3. The method for text-to-video cross-modal retrieval based on multi-faceted video representation learning according to claim 1, wherein said step (2) is specifically:
(2-1) carrying out embedded coding on the type of the packet of each frame of the video by using the packet subscript of each frame in the video so as to distinguish the scene to which each frame belongs;
(2-2) inputting the video frames subjected to type embedding coding into a transformer to model mutual information among frames in each video packet and mutual information among the video packets;
and (2-3) aggregating the characteristics of the transform output according to groups by using the group subscript of each frame to obtain an explicit polyhedral representation after explicit coding.
4. The method for text-to-video cross-modal retrieval based on multi-faceted video representation learning according to claim 1, wherein said step (3) is specifically:
(3-1) encoding the initial video features by using the bidirectional LSTM to obtain a hidden state of the bidirectional LSTM at each moment, performing maximum pooling operation on the hidden states to obtain global video features, and meanwhile, reserving the hidden state of each moment as a video time sequence feature;
(3-2) assuming that the number of output segments set by one video is n, performing n-time leading feature attention coding on the video; in each coding process, a full-link layer is used for coding video global features to obtain specific global attention guide features, corresponding weights are calculated by combining output features of previous leading feature attention coding to carry out weighted sum on video time sequence features, and current section features are output;
and (3-3) performing attention coding on the multiple leading features of the video to obtain multiple output features, and taking the multiple output features as the final implicit multi-surface representation of the input video.
5. The method for text-to-video cross-modal search based on multi-faceted video representation learning according to claim 4, wherein in said step (3-2), for ith encoding, the video global feature q and the video feature e outputted from the ith-1 encoding are combined i-1 Splicing and Using full connecting layer Fc i And a ReLu activation function to generate a global attention guide feature g i Namely:
g i =ReLu(Fc i ([q,e i-1 ]))
using full connection layers
Figure FDA0003608468610000021
Directing features g to global attention i Performing dimension reduction to obtain characteristics
Figure FDA0003608468610000022
Using full connection layers
Figure FDA0003608468610000023
To video timing characteristics f v Performing dimensionality reduction to obtain
Figure FDA0003608468610000024
Features of the same dimension
Figure FDA0003608468610000025
For the
Figure FDA0003608468610000026
By row dimension and
Figure FDA0003608468610000027
adding, namely:
Figure FDA0003608468610000028
Figure FDA0003608468610000031
Figure FDA0003608468610000032
to pair
Figure FDA0003608468610000033
Using tanh activation function and fully connected layers
Figure FDA0003608468610000034
Deriving aggregate weights for each frame of video
Figure FDA0003608468610000035
And to video timing characteristics f v Weighted sum is carried out to obtain the ith output e of the leading feature attention coding i Namely:
Figure FDA0003608468610000036
Figure FDA0003608468610000037
6. the method for text-to-video cross-modal retrieval based on multi-faceted video representation learning according to claim 1, wherein said step (4) is specifically: calculating cosine similarity between each implicit characteristic output by the implicit coding branch and all explicit characteristics output by the explicit coding branch to obtain the weight of each explicit characteristic to the current implicit characteristic, carrying out weighted sum on all the explicit characteristics, and adding the weighted sum to the current implicit characteristic to obtain multi-face video characteristic representation.
7. The method for text-to-video cross-modal retrieval based on multi-faceted video representation learning according to claim 1, wherein said step (5) is specifically: carrying out position coding on the text, and inputting the text into a BERT model to obtain the output of the text word level; and selecting the first position [ CLS ] of the text features output by the BERT model as the output features of the text end.
8. The method for text-to-video cross-modal search based on multi-faceted video representation learning according to claim 1, wherein in said step (6), the method for learning the correlation between two modalities and training the model by using the common spatial learning algorithm is as follows:
(6-1) mapping the multi-face video feature representation obtained in the step (4) and the text feature obtained in the step (5) to a uniform public space through a full connection layer to be expressed, and using a batch normalization layer after the full connection layer;
(6-2) training the model in an end-to-end manner through ternary ordering loss, so that the model automatically learns the correlation between the two modalities.
9. The method for text-to-video cross-modal retrieval based on multi-faceted video representation learning according to claim 1, wherein said step (7) is specifically:
(7-1) mapping the input text query and all candidate videos to a public space through a trained model;
and (7-2) calculating the similarity of the text query and all candidate videos in a public space, sequencing the candidate videos according to the similarity, and returning a retrieval result.
CN202210425802.1A 2022-04-21 2022-04-21 Text-to-video cross-modal retrieval method based on multi-face video representation learning Pending CN114817627A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210425802.1A CN114817627A (en) 2022-04-21 2022-04-21 Text-to-video cross-modal retrieval method based on multi-face video representation learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210425802.1A CN114817627A (en) 2022-04-21 2022-04-21 Text-to-video cross-modal retrieval method based on multi-face video representation learning

Publications (1)

Publication Number Publication Date
CN114817627A true CN114817627A (en) 2022-07-29

Family

ID=82505449

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210425802.1A Pending CN114817627A (en) 2022-04-21 2022-04-21 Text-to-video cross-modal retrieval method based on multi-face video representation learning

Country Status (1)

Country Link
CN (1) CN114817627A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117493608A (en) * 2023-12-26 2024-02-02 西安邮电大学 Text video retrieval method, system and computer storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117493608A (en) * 2023-12-26 2024-02-02 西安邮电大学 Text video retrieval method, system and computer storage medium
CN117493608B (en) * 2023-12-26 2024-04-12 西安邮电大学 Text video retrieval method, system and computer storage medium

Similar Documents

Publication Publication Date Title
CN111309971B (en) Multi-level coding-based text-to-video cross-modal retrieval method
CN110119786B (en) Text topic classification method and device
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
KR101855597B1 (en) Systems and methods for video paragraph captioning using hierarchical recurrent neural networks
CN109002852B (en) Image processing method, apparatus, computer readable storage medium and computer device
CN111581510A (en) Shared content processing method and device, computer equipment and storage medium
CN115115913A (en) Data processing method and device, electronic equipment and storage medium
CN110083729B (en) Image searching method and system
CN111985239A (en) Entity identification method and device, electronic equipment and storage medium
CN113297370B (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN113836992B (en) Label identification method, label identification model training method, device and equipment
CN112446342A (en) Key frame recognition model training method, recognition method and device
CN114896434B (en) Hash code generation method and device based on center similarity learning
CN109933682B (en) Image hash retrieval method and system based on combination of semantics and content information
CN114549850A (en) Multi-modal image aesthetic quality evaluation method for solving modal loss problem
CN114418032A (en) Five-modal commodity pre-training method and retrieval system based on self-coordination contrast learning
CN114282059A (en) Video retrieval method, device, equipment and storage medium
CN116524593A (en) Dynamic gesture recognition method, system, equipment and medium
CN114817627A (en) Text-to-video cross-modal retrieval method based on multi-face video representation learning
CN114168773A (en) Semi-supervised sketch image retrieval method based on pseudo label and reordering
CN114003770A (en) Cross-modal video retrieval method inspired by reading strategy
CN115640449A (en) Media object recommendation method and device, computer equipment and storage medium
CN111259197B (en) Video description generation method based on pre-coding semantic features
CN111242114B (en) Character recognition method and device
CN111680190A (en) Video thumbnail recommendation method fusing visual semantic information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination