CN114612748A - Cross-modal video clip retrieval method based on feature decoupling - Google Patents

Cross-modal video clip retrieval method based on feature decoupling Download PDF

Info

Publication number
CN114612748A
CN114612748A CN202210296716.5A CN202210296716A CN114612748A CN 114612748 A CN114612748 A CN 114612748A CN 202210296716 A CN202210296716 A CN 202210296716A CN 114612748 A CN114612748 A CN 114612748A
Authority
CN
China
Prior art keywords
video
feature
matrix
adjacent
decoupling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210296716.5A
Other languages
Chinese (zh)
Inventor
杨金福
刘玉斌
闫雪
宋琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202210296716.5A priority Critical patent/CN114612748A/en
Publication of CN114612748A publication Critical patent/CN114612748A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a cross-modal video clip retrieval method based on feature decoupling, which relates to the field of cross-modal video clip retrieval and comprises the following steps: firstly, extracting video characteristics by using a three-dimensional convolutional neural network C3D model, and extracting the characteristics of a query text by using an LSTM network; then, constructing an adjacent characteristic matrix by using the video characteristics, and decoupling the adjacent characteristic matrix into a content characteristic matrix and a position characteristic matrix through an encoder; secondly, enhancing the expression of the video content characteristics and weighting different position characteristics, thereby reducing the influence of the long-tail distribution effect of the training set on the model; then, reconstructing an adjacent characteristic matrix to learn the context information of the video; and finally, fusing the reconstructed adjacent feature matrix and the text feature, and inputting the fused adjacent feature matrix and the text feature into a full convolution neural network to generate a retrieval result. The model uses Binary Cross Entropy focus loss (Binary Cross Entropy Focal-loss) as a loss function for retrieval, and the training is completed through a back propagation algorithm.

Description

Cross-modal video clip retrieval method based on feature decoupling
Technical Field
The invention relates to the field of cross-modal video clip retrieval, in particular to a cross-modal video clip retrieval method based on feature decoupling.
Technical Field
The video segment retrieval based on the query text is an important research direction in the field of cross-modal video retrieval, aims to find out segments conforming to text description from videos according to input of a section of query text and videos, and is widely applied to the fields of search engines, intelligent security and recommendation systems. However, training sets such as ActivityNet Caption and chardes-STA required to accomplish this task suffer from text labeling imbalance, with a long tail distribution effect. For example, in the training set, the text containing "turn on" and "open" is mainly focused on the head region of the video, and the text containing "turn off" and "close" is mainly labeled on the tail region of the video. The unbalanced labeling makes the model tend to learn the mapping relationship between the text and the video position during training, for example, for a query text containing "turn on", the model tends to predict the head position of the video, that is, overfitting is generated for the prediction of the head of the video, so as to obtain a higher prediction accuracy. During testing, because the labels of the test set are usually balanced, a model without considering the long-tail distribution effect often cannot obtain a better retrieval result.
The existing cross-modal video clip retrieval method does not consider the influence of the long tail distribution effect in the training set on the learner, and the quality of the retrieval result is limited. For example, R.Ge, J.Gao, K.Chen, and R.New.MAC Mining activity concepts for language-based localization in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Hawaii,2019 directly fuses video features with text features and computes the search results using a multi-layer perceptron; also, for example, in S.Y.Zhang, H.W.Peng, J.L.Fu, and J.B.Luo.Learing 2D-projected additional word Networks for motion Localization with Natural language.in: Proceedings of the AAAI Conference on Artificial intellectual interest, New York,2020.12870-12877, the video features and the text features are directly fused together, and the search result is generated by convolution operation. Although the retrieval tasks are completed by the methods, the methods are influenced by the long tail distribution effect of the training set, so that the learner learns the mapping relation between the positions of the text and the video instead of the cross-modal matching between the text and the video, and the methods have poor generalization performance and poor effect in practical application.
Disclosure of Invention
Aiming at the problem that a data set used by a model in a training stage has a long tail distribution effect, the invention aims to provide a cross-modal video segment retrieval method based on feature decoupling so as to improve the retrieval quality of the model.
A cross-modal video clip retrieval method based on feature decoupling comprises the following steps:
step S1, splitting a video by a fixed frame number, extracting the characteristics of the video by using a three-dimensional convolutional neural network, and then dividing the video characteristics into N candidate video segment characteristics which are not overlapped at any moment in an equal proportion manner to obtain a candidate video segment characteristic set;
step S2, extracting the characteristics of the query text by using a long and short memory network;
step S3, expanding the candidate video segment feature set obtained in the step S1 by utilizing maximum pooling operation, and then arranging the features in the set according to the starting and stopping moments to construct an adjacent feature matrix;
step S4, decoupling the adjacent characteristic matrix into a content characteristic matrix and a position characteristic matrix by using a content encoder and a position encoder, reducing the long tail distribution effect caused by a data set by weighting different position characteristics and enhancing the video content characteristic expression, and then reconstructing the adjacent characteristic matrix by using matrix addition to learn the context information of the video;
and step S5, fusing the characteristics of the video and the text through matrix dot product operation, inputting the characteristics into a full convolution neural network to calculate the score of the candidate video segment, generating a retrieval result, and calculating the retrieval loss of the model.
The method comprises the following steps of splitting a video by a fixed frame number, extracting the characteristics of the video by using a three-dimensional convolutional neural network, dividing the video characteristics into N candidate video segment characteristics which are not overlapped at any moment in an equal proportion to obtain a candidate video segment characteristic set, and specifically comprises the following steps:
the method comprises the steps of extracting video features by taking 16 frames as a group through a three-dimensional convolutional neural network C3D model, then carrying out standardization processing on each group of features, mapping a feature value to a 0-1 interval, and regarding a kth group of features, carrying out a standardization formula as follows:
Figure BDA0003561808750000021
wherein v iskIs the kth set of video segment characteristics,
Figure BDA0003561808750000022
is vkThe result of the normalization is carried out,
Figure BDA0003561808750000023
and
Figure BDA0003561808750000024
are each vkMaximum and minimum values in (b).
Further, each group of video segment characteristics is spliced in channel dimension to obtain complete video characteristics
Figure BDA0003561808750000025
Then will be
Figure BDA0003561808750000026
Splitting the same proportion into N parts which are not overlapped with each other to obtain a set
Figure BDA0003561808750000027
Wherein
Figure BDA0003561808750000028
Referred to as initiationThe time is (i-1) × N/T, the termination time is the characteristics of the candidate video segments of i × N/T, U is the characteristic set of the candidate video segments, T is the duration of the video, and N is determined by the average duration of all the videos in the training set:
Figure BDA0003561808750000029
wherein the content of the first and second substances,
Figure BDA0003561808750000031
to average the duration of all videos in the training set,
Figure BDA0003561808750000032
meaning that the calculation is greater than or equal to the minimum second power of x.
The method for extracting the characteristics of the query text by using the long and short memory network specifically comprises the following steps:
training a Glove word embedding model on a Wikipedia 2014 corpus, encoding a query text, then conveying the query text to a three-layer long and short memory network (LSTM) to extract query text characteristics, and taking the output of the last layer as text characteristics q.
The method for expanding the candidate video segment feature set obtained in the step S1 by using the maximum pooling operation, and then arranging the features in the set according to the start-stop time to construct an adjacent feature matrix includes the following steps:
respectively using sliding windows with the sizes of 1-N to perform sliding operation on the candidate video segment feature set U obtained in the step S1, performing inter-channel maximum pooling on the features covered by the windows, and generating more candidate video segment features, wherein when the size of the sliding window is Z (Z is more than or equal to 1 and less than or equal to N), the method is represented as follows:
Figure BDA0003561808750000033
wherein
Figure BDA0003561808750000034
Is at a starting time ofi-1) N/T, characteristics of candidate video segments terminating at time j N/T, PmaxIs the maximum pooling function, SZIs a function of a sliding window of size Z, VZIs the output of the maximum pooling function, containing N-Z +1 candidate video segment features.
Further, the candidate video segment features are arranged according to the start-stop time, and an adjacent feature matrix with dimension of N × d is constructed and expressed as:
Figure BDA0003561808750000035
where d is the channel dimension of the candidate video segment feature, the top right element of the matrix is the candidate video segment feature, and the bottom left is the invalid region, filled with 0. In the feature matrix, the line number and the column number of the candidate video segment feature are mapped to the start time and the end time in the original video respectively.
The method is characterized in that a content encoder and a position encoder are used for decoupling adjacent characteristic matrixes into a content characteristic matrix and a position characteristic matrix, different position characteristics are weighted, and video content characteristic expression is enhanced, so that a long tail distribution effect caused by a data set is reduced, and then the adjacent characteristic matrixes are reconstructed by using matrix addition to learn the context information of the video, and the method specifically comprises the following steps:
two 1 × 1 convolution layers are respectively used as a content encoder and a position encoder, and feature decoupling is performed on adjacent feature matrixes through convolution operation to obtain a content feature matrix and a position feature matrix, which are expressed as follows:
Vc=Ec(V);Vl=El(V)
wherein VcRepresenting a content feature matrix, VlA location feature matrix is represented. In order to maximally reduce the correlation between the content feature matrix and the location feature matrix, cosine similarity is used as a loss function for feature decoupling, expressed as:
Figure BDA0003561808750000041
furthermore, since the starting and ending time of each candidate video clip in the video can be obtained during training, the loss is reconstructed through the position
Figure BDA0003561808750000042
To ensure VlCan be decoupled from V more efficiently, expressed as:
Figure BDA0003561808750000043
and l is a position embedding vector of the starting time and the ending time of the candidate video clip and is realized by adopting triangular position coding.
Furthermore, the expression of the video content characteristics is enhanced, and different video position characteristics are weighted, so that the overfitting of a learner to the head part or the tail part of the video and other areas is reduced, the influence of the long tail distribution effect of a training set on the model is reduced, and the expression is as follows:
Figure BDA0003561808750000044
Figure BDA0003561808750000045
wherein
Figure BDA0003561808750000046
And
Figure BDA0003561808750000047
respectively an enhanced content feature matrix and a weighted location feature matrix, dkFor the scaling factor, Q ═ Vl×WQFor querying the vector, K ═ Vl×WKIs a key vector, L ═ Vl×WLIs a vector of values, WQ、WKAnd WLA parameter matrix is obtained for training.
Further, the adjacent feature matrix is reconstructed through matrix addition operation to learn the context information of the video, which is expressed as:
Figure BDA0003561808750000048
the method comprises the following steps of fusing the characteristics of a video and a text through a matrix dot product operation, inputting the characteristics into a full convolution neural network to calculate the score of a candidate video segment, generating a retrieval result, and calculating the retrieval loss of a model, and specifically comprises the following steps:
using matrix dot product operation to reconstruct the text feature q obtained in step S2 and the adjacent feature matrix reconstructed in step S4
Figure BDA00035618087500000412
And fusing in channel dimension, adding text semantics to the candidate video segment characteristics as retrieval guidance information, and expressing as follows:
Figure BDA0003561808750000049
where q is a feature of the text,
Figure BDA00035618087500000410
is the reconstructed neighboring feature matrix, M is the result of feature fusion,
Figure BDA00035618087500000411
is a matrix dot product operation.
Further, sequentially adding three layers of 3 × 3 convolutions and one layer of 1 × 1 convolution to the fusion feature M to calculate scores of the candidate video clips, finally generating a candidate video clip score map with dimension of NxN, and mapping the scores to a 0-1 interval by using a sigmoid function; through the score map, the candidate video segment with the highest score is indexed out as a retrieval result, and is represented as:
Figure BDA0003561808750000051
wherein phi1×1Represents a convolution layer having a convolution kernel size of 1 × 1,
Figure BDA0003561808750000052
represents a convolutional layer having a multilayer convolutional kernel size of 3 × 3.
Further, Binary Cross Entropy Focal loss (Binary Cross energy Focal-loss) was calculated
Figure BDA0003561808750000053
The search loss as a full convolution neural network is expressed as:
Figure BDA0003561808750000054
where K is the number of candidate segments, r is a balance parameter, piIs the score, y, of the ith candidate video segmentiAre training labels.
Has the advantages that:
the invention provides a cross-modal video segment retrieval method based on feature decoupling, which is characterized in that video features are used for constructing an adjacent feature matrix, the feature matrix is decoupled through a content encoder and a position encoder, then the expression of the video content features is enhanced, and different video position features are weighted, so that overfitting of a learner to regions such as the head or the tail of a video is reduced, the influence of the long tail distribution effect of a training set on the model is further reduced, the model focuses on the learning of video contents, the adjacent feature matrix is reconstructed to learn the context information among video segments, and a retrieval result is generated through a full convolution neural network. The method is mainly applied to a cross-modal video clip retrieval task, and has better robustness on a training set with a long tail distribution effect.
Drawings
Fig. 1 is a schematic diagram of a framework of a cross-modal video segment retrieval method based on feature decoupling according to an embodiment of the present invention;
FIG. 2 is a flow chart of construction, decoupling and reconstruction of an adjacent feature matrix according to the present invention in an exemplary embodiment;
FIG. 3 is a comparison graph of search results according to the present invention;
Detailed Description
The invention aims to provide a cross-modal video segment retrieval method based on feature decoupling, which comprises the steps of firstly extracting features of a video and a query text; then, dividing the video features into a plurality of candidate video segment features, and constructing an adjacent feature matrix according to the starting and stopping time of the video segments; secondly, decoupling the adjacent characteristic matrix by using a content encoder and a position encoder, enhancing the expression of content characteristics, and weighting different position characteristics, thereby reducing the influence of the long-tail distribution effect on the model; then, reconstructing an adjacent characteristic matrix and learning the context information of the video; and finally, generating scores of the candidate video segments by using a full convolutional network to obtain a retrieval result.
The present invention will be described in detail below with reference to the attached drawings, and it should be noted that the described embodiments are only for illustrating and explaining the present technology, and do not have any limitation thereto.
FIG. 1 is a flowchart of a cross-modal video segment retrieval method based on feature decoupling according to the present invention; FIG. 2 is a flow chart of construction, decoupling and reconstruction of a neighboring feature matrix provided by the present invention; FIG. 3 is a comparison graph of search results provided by the present invention. The invention mainly aims to enable a learner to still learn the content information of video clips on a training set with a long-tail distribution condition, so that the learning of the position mapping relation between a video and a text is avoided, and a better retrieval result is obtained. As shown in fig. 1, one core content of this embodiment lies in using a content encoder and a position encoder to decouple adjacent feature matrices into a content feature matrix and a position feature matrix, where the content feature matrix includes content information of a video, and the position feature includes video position information affected by a long-tail distribution effect; the embodiment enhances the video content characteristics and balances the video position characteristics, so that the neural network can make fair prediction on the score of each candidate video segment. Compared with the existing method, the method greatly reduces the influence of the long tail distribution effect of the training set on the learner.
The invention provides a cross-modal video clip retrieval method based on feature decoupling, which specifically comprises the following steps:
step S1, splitting the video by a fixed frame number, extracting the characteristics of the video by using a three-dimensional convolutional neural network, and then dividing the video characteristics into N candidate video segment characteristics which are not overlapped at any moment in an equal proportion to obtain a candidate video segment characteristic set;
the length and width dimensions of the video frames are set to be 112 × 112 by a reshape method, and the video features are extracted by using a C3D network by taking 16 frames as a group. The C3D network adopts a standard model, and comprises 5 convolution layers, 5 pooling layers, 2 full-link layers and 1 softmax layer, wherein the size of a convolution kernel is 3 multiplied by 3, the size of the pooling kernel is 2 multiplied by 2, the step size is 1, and the dimension of the full-link layer is 2048. The output of the second fully-connected layer is saved as a feature of the video clip.
Further, the dimensionality of the video segment features is changed from 2048 to 512 by using a maximum pooling layer with a pooling kernel of 4, normalization processing is performed, feature values are mapped to an interval of 0-1, and for a kth group of features, a normalization formula is as follows:
Figure BDA0003561808750000061
wherein v iskIs the kth set of video segment characteristics,
Figure BDA0003561808750000062
is vkThe result of the normalization is obtained by the following steps,
Figure BDA0003561808750000063
and
Figure BDA0003561808750000064
are each vkMaximum and minimum values of (a).
Further, each group of videos is divided into a plurality of groupsSplicing the segment features in channel dimension to obtain complete video features
Figure BDA0003561808750000065
Then will be
Figure BDA0003561808750000066
Splitting the data into N parts which are not overlapped at the moment in equal proportion to obtain a set
Figure BDA0003561808750000067
Wherein
Figure BDA0003561808750000068
The characteristics of the candidate video segments with the starting time (i-1) N/T and the ending time i N/T, U is a candidate video segment characteristic set, T is the duration of the video, and N is determined by the average duration of all the videos in the training set:
Figure BDA0003561808750000071
wherein the content of the first and second substances,
Figure BDA0003561808750000072
to average the duration of all videos in the training set,
Figure BDA0003561808750000073
meaning that the calculation is greater than or equal to the minimum second power of x. For the ActivityNet clipping dataset, N is 32, and for the Chardes-STA dataset, N is 16.
Step S2, extracting the characteristics of the query text by using a long and short memory network;
training a Glove word embedding model on a Wikipedia 2014 corpus, setting a word embedding dimension to be 300, and then coding a query text to obtain a 300-dimensional word vector; and (3) extracting features by using a long and short memory network comprising 3 hidden layers and the number of neurons in each layer being 512, and taking the output of the last layer as the feature q of the query text, wherein the dimensionality is 512.
Step S3, expanding the candidate video segment feature set obtained in the step S1 by utilizing maximum pooling operation, and then arranging the features in the set according to the starting and stopping moments to construct an adjacent feature matrix;
performing sliding operation on the candidate video segment feature set U obtained in step S1 by using sliding windows with the sizes of 1-N respectively, and performing maximum pooling operation among channels on the candidate video segment features covered by the windows, wherein when the sliding window size is Z (Z is more than or equal to 1 and less than or equal to N), the operation is expressed as:
Figure BDA0003561808750000074
wherein
Figure BDA0003561808750000075
Features of candidate video segments with starting time (i-1) N/T and ending time j N/T, PmaxIs the maximum pooling function, SZIs a sliding window function of size Z, VZIs the output result of the maximum pooling function;
further, an adjacent feature matrix with dimension N × d is constructed by using the candidate video segment features, the first two dimensions respectively correspond to the start time and the end time of the candidate video segment, and the third dimension d corresponds to the channel dimension of the candidate video segment features and is set to 512. In the feature matrix, the top-right element is a candidate video segment feature, and the bottom-left is an invalid region, which is filled with 0 and is represented as:
Figure BDA0003561808750000076
step S4, decoupling the adjacent feature matrix into a content feature matrix and a position feature matrix by using a content encoder and a position encoder, reducing the long tail distribution effect caused by a data set by weighting different position features and enhancing the video content feature expression, and then reconstructing the adjacent feature matrix by using a matrix addition to learn the context information of the video;
and performing characteristic decoupling on the adjacent characteristic matrixes through the content encoder and the position encoder to obtain a content characteristic matrix and a position characteristic matrix. Where both encoders consist of convolutional layers with a convolution kernel of 1 × 1, the ReLU is used as the activation function. In order to maximally complete feature decoupling, the cosine similarity between the content features and the position features is calculated as a loss function of feature decoupling, and is expressed as:
Figure BDA0003561808750000081
further, since the start and stop time of each candidate video segment in the video can be obtained during training, the loss is reconstructed through the position
Figure BDA0003561808750000082
To ensure that V can be decoupled more effectively from V, is expressed as:
Figure BDA0003561808750000083
wherein l is the position embedding vector of the starting time and the ending time of the candidate video clip, and the method is realized by adopting triangle position coding:
Figure BDA0003561808750000084
where d is 512, the channel dimension of the position feature is represented, pos represents the start-stop time of the candidate video segment,
Figure BDA0003561808750000085
are parity term coefficients.
Further, the weight of the video position feature is adjusted, and different video position features are weighted in a self-adaptive manner, which is expressed as:
Figure BDA0003561808750000086
Q=Vl×WQ;K=Vl×WK;L=Vl×WL
wherein
Figure BDA0003561808750000087
Representing a weighted location feature matrix, Q being a query vector, K being a key vector, L being a value vector, WQ、WKAnd WLThe parameter matrixes obtained for training are all set to 512 dimensions and are scaled by a factor dkIs 8.
Further, using a residual attention enhancing video content feature expression, using a softmax function to generate attention weight in channel dimension, and simultaneously using shortcut path to content feature VcIdentity mapping is performed, expressed as:
Figure BDA0003561808750000088
further, the adjacent feature matrix is reconstructed through matrix addition operation to learn the context information of the video, which is expressed as:
Figure BDA0003561808750000089
step S5, fusing the characteristics of the video and the text through matrix dot product operation, inputting the characteristics into a full convolution neural network to calculate the score of the candidate video clip, generating a retrieval result, and calculating the retrieval loss of the model;
using matrix dot product operation to reconstruct the text feature q obtained in step S2 and the adjacent feature matrix reconstructed in step S4
Figure BDA0003561808750000095
And performing feature fusion, and adding text semantics as retrieval guide information for the candidate video segment features, wherein feature dimensions of the text features and the candidate video segments are both 512.
Further, three layers of convolutional neural network learning context information with convolution kernels of 3 × 3 are sequentially added to the fusion features, then a layer of 1 × 1 convolutional neural network is used for adjusting a 512-dimensional channel to 1 dimension, and a candidate video segment score map with dimension of N × N is generated, wherein the abscissa corresponds to the starting time of the candidate video segment in the video, and the ordinate corresponds to the ending time; mapping scores to an interval range of 0-1 by using a sigmoid activation function, and in the score map, indexing a candidate video clip with the highest score as a retrieval result, wherein the retrieval result is represented as:
Figure BDA0003561808750000096
wherein phi1×1Represents a convolution layer having a convolution kernel of 1 × 1,
Figure BDA0003561808750000091
represents a convolutional layer having a multilayer convolution kernel of 3 × 3.
The whole model is trained through a back propagation algorithm, and the retrieval loss uses a binary cross entropy focus loss function
Figure BDA0003561808750000092
(Binary Cross Engine Focal-loss) expressed as:
Figure BDA0003561808750000093
where K is the number of candidate video segments and r is the balance parameter set to 2, piIs the score of the ith candidate video segment; y isiThe scaled training labels specifically include:
Figure BDA0003561808750000094
wherein o isminAnd omaxFor the preset label threshold, 0.5 and 1.0, o, respectivelyiIs a candidate video segment and a trueValue (Ground trout) intersection ratio (IoU).
In order to verify the superiority of the invention, experimental verification is carried out on two data sets, namely, Charides-STA and Activity Captions data sets with a long-tail distribution effect. The Charades-STA dataset contains 9948 indoor activity videos, containing 12408 training pairs and 3720 test pairs. The ActivityNet selection consists of nearly twenty thousand videos, contains 37417 training pairs, 17505 verification pairs and 17031 testing pairs, and is the largest data set in the current cross-modal video segment retrieval task. As with the current mainstream methods, recall is used as an evaluation index: { Rank @1, IoU ═ m }, i.e., calculating the recall ratio of the predicted video segment with the highest score at IoU ═ m, where m ∈ [0.5,0.7 ]. The experimental and comparative results are shown in table 1, and the black bold font is the best result, and it can be found that the evaluation indexes of the invention are all higher than those of the existing methods.
TABLE 1 Experimental results of the present invention on Activity Captions and Charides-STA data sets
Figure BDA0003561808750000101
Fig. 3 visually compares the partial search results. The background of the input video is a basketball field, and the content of the input video is that a male player is doing basketball sports, and the input video comprises various behaviors such as dribbling, shooting, buckling and picking up the basketball. When the query statement is "A man is shooting a basketball", the results of the method retrieval of S.Y.Zhang, H.W.Peng, J.L.Fu, and J.B.Luo.Learing 2D modular Adjacent Networks for motion Localization with Natural language in: Proceedings of the AAAI Conference on scientific intellectual insight, New York,2020.12870-12877 contain video segments of dribbling and dunking, and the invention accurately retrieves the video segments containing dunking.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications and substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims (7)

1. A cross-modal video segment retrieval method based on feature decoupling is characterized in that the method firstly utilizes a three-dimensional convolution neural network to extract video features, constructs an adjacent feature matrix, and simultaneously uses a long and short memory network to extract the features of a query text; secondly, decoupling the adjacent characteristic matrixes into a content characteristic matrix and a position characteristic matrix by using a content encoder and a position encoder, weighting different position characteristics, and enhancing the expression of the content characteristics at the same time, thereby reducing the long tail distribution effect caused by a data set; secondly, reconstructing an adjacent characteristic matrix by using matrix addition to learn the context information of the video segment; and finally, fusing the reconstructed adjacent feature matrix with the text feature by using matrix dot product operation, and generating a retrieval result by using a full convolution neural network, wherein the method specifically comprises the following steps:
step S1, splitting the video by a fixed frame number, extracting the characteristics of the video by using a three-dimensional convolutional neural network, and then dividing the video characteristics into N candidate video segment characteristics which are not overlapped at any moment in an equal proportion to obtain a candidate video segment characteristic set;
step S2, extracting the characteristics of the query text by using a long and short memory network;
step S3, expanding the candidate video segment feature set obtained in the step S1 by utilizing maximum pooling operation, and then arranging the features in the set according to the starting and stopping moments to construct an adjacent feature matrix;
step S4, decoupling the adjacent characteristic matrix into a content characteristic matrix and a position characteristic matrix by using a content encoder and a position encoder, reducing the long tail distribution effect caused by a data set by weighting different position characteristics and enhancing the video content characteristic expression, and then reconstructing the adjacent characteristic matrix by using matrix addition to learn the context information of the video;
and step S5, fusing the characteristics of the video and the text through matrix dot product operation, inputting the characteristics into a full convolution neural network to calculate the score of the candidate video segment, and generating a retrieval result.
2. The method for retrieving the cross-modal video clip based on the feature decoupling as claimed in claim 1, wherein the step S1 is specifically as follows:
1) extracting video segment characteristics by taking 16 frames as a group by using a three-dimensional convolutional neural network C3D model, then carrying out standardization processing on each group of characteristics, mapping a characteristic value to a 0-1 interval, and regarding the kth group of characteristics, carrying out a standardization formula as follows:
Figure FDA0003561808740000011
wherein v iskIs the kth set of video segment characteristics,
Figure FDA0003561808740000012
is vkThe result of the normalization is carried out,
Figure FDA0003561808740000013
and
Figure FDA0003561808740000014
are each vkMaximum and minimum values of;
2) splicing the video segment characteristics after the standardized processing extracted in the step 1) in the channel dimension to obtain complete video characteristics
Figure FDA0003561808740000015
3) The complete video characteristics obtained in the step 2) are used
Figure FDA0003561808740000021
Splitting the data into N parts which are not overlapped at the moment in equal proportion to obtain a set
Figure FDA0003561808740000022
Wherein
Figure FDA0003561808740000023
Features of candidate video segments with starting time (i-1) × N/T and ending time i × N/T, U is a set of candidate video segment features, T is the duration of the video, and N is determined by the average duration of all videos in the training set:
Figure FDA0003561808740000024
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003561808740000025
to average the duration of all videos in the training set,
Figure FDA0003561808740000026
meaning that the calculation is greater than or equal to the minimum second power of x.
3. The method for retrieving the cross-modal video clip based on the feature decoupling as claimed in claim 1, wherein the step S2 is specifically as follows:
training a Glove word embedding model on a Wikipedia 2014 corpus, setting a word embedding dimension to be 300, then coding a query text, transmitting an obtained word vector to three layers of long and short memory networks (LSTMs) for feature extraction, wherein the number of neurons in each layer is 512, and the output of the last layer is used as a text feature q.
4. The method for retrieving the cross-modal video clip based on the feature decoupling as claimed in claim 1, wherein the step S3 specifically includes:
1) performing sliding operation on the candidate video segment feature set U obtained in step S1 by using sliding windows with the sizes of 1-N respectively, and performing maximum pooling operation among channels on the candidate video segment features covered by the windows, wherein when the sliding window size is Z (Z is more than or equal to 1 and less than or equal to N), the operation is expressed as:
Figure FDA0003561808740000027
wherein
Figure FDA0003561808740000028
Features of candidate video segments with starting time (i-1) N/T and ending time j N/T, PmaxIs the maximum pooling function, SZIs a sliding window function of size Z, VZIs the output result of the maximum pooling function;
2) arranging the candidate video segment features obtained in the step 1) according to the starting and stopping moments to form an adjacent feature matrix, specifically:
Figure FDA0003561808740000029
the line number and the column number of the candidate video clip feature are respectively mapped to the starting time and the ending time in the original video.
5. The method for retrieving the cross-modal video clip based on the feature decoupling as claimed in claim 1, wherein the step S4 specifically includes:
1) respectively using a content encoder and a position encoder to perform characteristic decoupling on adjacent characteristic matrixes to obtain a content characteristic matrix VcAnd a position feature matrix VlWherein, the content encoder and the position encoder are both 1 × 1 convolution layers;
2) calculating VcAnd VlAs a loss function of feature decoupling, the cosine similarity of (a) is expressed as:
Figure FDA0003561808740000031
3) computing location reconstruction loss
Figure FDA0003561808740000032
To ensure VlCan be decoupled from V more efficiently, expressed as:
Figure FDA0003561808740000033
wherein l is a position embedding vector of the starting and stopping moments of the candidate video clips, and the position embedding vector is realized by adopting triangular position coding;
4) adjusting the weight of the video position characteristic and enhancing the expression of the video content characteristic, wherein the expression is represented as:
Figure FDA0003561808740000034
Figure FDA0003561808740000035
wherein
Figure FDA0003561808740000036
And
Figure FDA0003561808740000037
respectively, a weighted location feature matrix and an enhanced content feature matrix, dkFor the scaling factor, Q ═ Vl×WQFor querying the vector, K ═ Vl×WKIs a key vector, L ═ Vl×WLIs a vector of values, WQ、WKAnd WLA parameter matrix obtained for training;
5) and reconstructing the adjacent characteristic matrix by using matrix addition to learn the context information of the video, wherein the expression is as follows:
Figure FDA0003561808740000038
6. the method for retrieving the cross-modal video clip based on the feature decoupling as claimed in claim 1, wherein the step S5 specifically includes:
1) using matrix dot product operation to reconstruct the text feature q obtained in step S2 and the adjacent feature matrix reconstructed in step S4
Figure FDA0003561808740000039
Carrying out fusion;
2) adding serial 3 layers of 3 multiplied by 3 convolutions and one layer of 1 multiplied by 1 convolution on the fusion characteristics to calculate scores of the candidate video clips, mapping the scores to a range of 0-1 by using a sigmoid function, and indexing the candidate video clip with the maximum score as a retrieval result.
7. The method according to claim 1, wherein the full convolution neural network is trained by a back propagation algorithm in step S5, wherein a Binary Cross Entropy focus loss function (Binary Cross Entropy Focal-loss) is used as the search loss of the network, and specifically:
Figure FDA0003561808740000041
where K is the number of candidate video segments, r is a balance parameter, piIs the score, y, of the ith candidate video segmentiThe scaled training labels specifically include:
Figure FDA0003561808740000042
wherein o isminAnd omaxIs a predetermined tag threshold, oiIs the cross-over ratio of the candidate video segment to the true value (Ground true).
CN202210296716.5A 2022-03-24 2022-03-24 Cross-modal video clip retrieval method based on feature decoupling Pending CN114612748A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210296716.5A CN114612748A (en) 2022-03-24 2022-03-24 Cross-modal video clip retrieval method based on feature decoupling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210296716.5A CN114612748A (en) 2022-03-24 2022-03-24 Cross-modal video clip retrieval method based on feature decoupling

Publications (1)

Publication Number Publication Date
CN114612748A true CN114612748A (en) 2022-06-10

Family

ID=81865025

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210296716.5A Pending CN114612748A (en) 2022-03-24 2022-03-24 Cross-modal video clip retrieval method based on feature decoupling

Country Status (1)

Country Link
CN (1) CN114612748A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115187917A (en) * 2022-09-13 2022-10-14 山东建筑大学 Unmanned vehicle historical scene detection method based on video clip retrieval
CN116186329A (en) * 2023-02-10 2023-05-30 阿里巴巴(中国)有限公司 Video processing, searching and index constructing method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115187917A (en) * 2022-09-13 2022-10-14 山东建筑大学 Unmanned vehicle historical scene detection method based on video clip retrieval
CN116186329A (en) * 2023-02-10 2023-05-30 阿里巴巴(中国)有限公司 Video processing, searching and index constructing method, device, equipment and storage medium
CN116186329B (en) * 2023-02-10 2023-09-12 阿里巴巴(中国)有限公司 Video processing, searching and index constructing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
CN108875074B (en) Answer selection method and device based on cross attention neural network and electronic equipment
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
US11238093B2 (en) Video retrieval based on encoding temporal relationships among video frames
CN113297364B (en) Natural language understanding method and device in dialogue-oriented system
CN114612748A (en) Cross-modal video clip retrieval method based on feature decoupling
CN116775922A (en) Remote sensing image cross-modal retrieval method based on fusion of language and visual detail characteristics
CN110188819A (en) A kind of CNN and LSTM image high-level semantic understanding method based on information gain
CN113822264A (en) Text recognition method and device, computer equipment and storage medium
CN113011172B (en) Text processing method, device, computer equipment and storage medium
CN115310448A (en) Chinese named entity recognition method based on combining bert and word vector
CN114627162A (en) Multimodal dense video description method based on video context information fusion
CN112988970A (en) Text matching algorithm serving intelligent question-answering system
CN110852089A (en) Operation and maintenance project management method based on intelligent word segmentation and deep learning
Cao et al. Deep multi-view learning to rank
CN114241191A (en) Cross-modal self-attention-based non-candidate-box expression understanding method
CN112528136A (en) Viewpoint label generation method and device, electronic equipment and storage medium
CN110659392B (en) Retrieval method and device, and storage medium
Ko et al. Paraphrase bidirectional transformer with multi-task learning
CN112329441A (en) Legal document reading model and construction method
CN114925232B (en) Cross-modal time domain video positioning method under text segment question-answering framework
CN116561305A (en) False news detection method based on multiple modes and transformers
CN113254575B (en) Machine reading understanding method and system based on multi-step evidence reasoning
CN113239678B (en) Multi-angle attention feature matching method and system for answer selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination