CN114612748A

CN114612748A - Cross-modal video clip retrieval method based on feature decoupling

Info

Publication number: CN114612748A
Application number: CN202210296716.5A
Authority: CN
Inventors: 杨金福; 刘玉斌; 闫雪; 宋琳
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2022-06-10

Abstract

The invention discloses a cross-modal video clip retrieval method based on feature decoupling, which relates to the field of cross-modal video clip retrieval and comprises the following steps: firstly, extracting video characteristics by using a three-dimensional convolutional neural network C3D model, and extracting the characteristics of a query text by using an LSTM network; then, constructing an adjacent characteristic matrix by using the video characteristics, and decoupling the adjacent characteristic matrix into a content characteristic matrix and a position characteristic matrix through an encoder; secondly, enhancing the expression of the video content characteristics and weighting different position characteristics, thereby reducing the influence of the long-tail distribution effect of the training set on the model; then, reconstructing an adjacent characteristic matrix to learn the context information of the video; and finally, fusing the reconstructed adjacent feature matrix and the text feature, and inputting the fused adjacent feature matrix and the text feature into a full convolution neural network to generate a retrieval result. The model uses Binary Cross Entropy focus loss (Binary Cross Entropy Focal-loss) as a loss function for retrieval, and the training is completed through a back propagation algorithm.

Description

Cross-modal video clip retrieval method based on feature decoupling

Technical Field

The invention relates to the field of cross-modal video clip retrieval, in particular to a cross-modal video clip retrieval method based on feature decoupling.

Technical Field

The video segment retrieval based on the query text is an important research direction in the field of cross-modal video retrieval, aims to find out segments conforming to text description from videos according to input of a section of query text and videos, and is widely applied to the fields of search engines, intelligent security and recommendation systems. However, training sets such as ActivityNet Caption and chardes-STA required to accomplish this task suffer from text labeling imbalance, with a long tail distribution effect. For example, in the training set, the text containing "turn on" and "open" is mainly focused on the head region of the video, and the text containing "turn off" and "close" is mainly labeled on the tail region of the video. The unbalanced labeling makes the model tend to learn the mapping relationship between the text and the video position during training, for example, for a query text containing "turn on", the model tends to predict the head position of the video, that is, overfitting is generated for the prediction of the head of the video, so as to obtain a higher prediction accuracy. During testing, because the labels of the test set are usually balanced, a model without considering the long-tail distribution effect often cannot obtain a better retrieval result.

The existing cross-modal video clip retrieval method does not consider the influence of the long tail distribution effect in the training set on the learner, and the quality of the retrieval result is limited. For example, R.Ge, J.Gao, K.Chen, and R.New.MAC Mining activity concepts for language-based localization in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Hawaii,2019 directly fuses video features with text features and computes the search results using a multi-layer perceptron; also, for example, in S.Y.Zhang, H.W.Peng, J.L.Fu, and J.B.Luo.Learing 2D-projected additional word Networks for motion Localization with Natural language.in: Proceedings of the AAAI Conference on Artificial intellectual interest, New York,2020.12870-12877, the video features and the text features are directly fused together, and the search result is generated by convolution operation. Although the retrieval tasks are completed by the methods, the methods are influenced by the long tail distribution effect of the training set, so that the learner learns the mapping relation between the positions of the text and the video instead of the cross-modal matching between the text and the video, and the methods have poor generalization performance and poor effect in practical application.

Disclosure of Invention

Aiming at the problem that a data set used by a model in a training stage has a long tail distribution effect, the invention aims to provide a cross-modal video segment retrieval method based on feature decoupling so as to improve the retrieval quality of the model.

A cross-modal video clip retrieval method based on feature decoupling comprises the following steps:

step S1, splitting a video by a fixed frame number, extracting the characteristics of the video by using a three-dimensional convolutional neural network, and then dividing the video characteristics into N candidate video segment characteristics which are not overlapped at any moment in an equal proportion manner to obtain a candidate video segment characteristic set;

step S2, extracting the characteristics of the query text by using a long and short memory network;

step S3, expanding the candidate video segment feature set obtained in the step S1 by utilizing maximum pooling operation, and then arranging the features in the set according to the starting and stopping moments to construct an adjacent feature matrix;

step S4, decoupling the adjacent characteristic matrix into a content characteristic matrix and a position characteristic matrix by using a content encoder and a position encoder, reducing the long tail distribution effect caused by a data set by weighting different position characteristics and enhancing the video content characteristic expression, and then reconstructing the adjacent characteristic matrix by using matrix addition to learn the context information of the video;

and step S5, fusing the characteristics of the video and the text through matrix dot product operation, inputting the characteristics into a full convolution neural network to calculate the score of the candidate video segment, generating a retrieval result, and calculating the retrieval loss of the model.

The method comprises the following steps of splitting a video by a fixed frame number, extracting the characteristics of the video by using a three-dimensional convolutional neural network, dividing the video characteristics into N candidate video segment characteristics which are not overlapped at any moment in an equal proportion to obtain a candidate video segment characteristic set, and specifically comprises the following steps:

the method comprises the steps of extracting video features by taking 16 frames as a group through a three-dimensional convolutional neural network C3D model, then carrying out standardization processing on each group of features, mapping a feature value to a 0-1 interval, and regarding a kth group of features, carrying out a standardization formula as follows:

wherein v is_kIs the kth set of video segment characteristics,

is v_kThe result of the normalization is carried out,

and

are each v_kMaximum and minimum values in (b).

Further, each group of video segment characteristics is spliced in channel dimension to obtain complete video characteristics

Then will be

Splitting the same proportion into N parts which are not overlapped with each other to obtain a set

Wherein

Referred to as initiationThe time is (i-1) × N/T, the termination time is the characteristics of the candidate video segments of i × N/T, U is the characteristic set of the candidate video segments, T is the duration of the video, and N is determined by the average duration of all the videos in the training set:

wherein the content of the first and second substances,

to average the duration of all videos in the training set,

meaning that the calculation is greater than or equal to the minimum second power of x.

The method for extracting the characteristics of the query text by using the long and short memory network specifically comprises the following steps:

training a Glove word embedding model on a Wikipedia 2014 corpus, encoding a query text, then conveying the query text to a three-layer long and short memory network (LSTM) to extract query text characteristics, and taking the output of the last layer as text characteristics q.

The method for expanding the candidate video segment feature set obtained in the step S1 by using the maximum pooling operation, and then arranging the features in the set according to the start-stop time to construct an adjacent feature matrix includes the following steps:

respectively using sliding windows with the sizes of 1-N to perform sliding operation on the candidate video segment feature set U obtained in the step S1, performing inter-channel maximum pooling on the features covered by the windows, and generating more candidate video segment features, wherein when the size of the sliding window is Z (Z is more than or equal to 1 and less than or equal to N), the method is represented as follows:

wherein

Is at a starting time ofi-1) N/T, characteristics of candidate video segments terminating at time j N/T, P_maxIs the maximum pooling function, S_ZIs a function of a sliding window of size Z, V_ZIs the output of the maximum pooling function, containing N-Z +1 candidate video segment features.

Further, the candidate video segment features are arranged according to the start-stop time, and an adjacent feature matrix with dimension of N × d is constructed and expressed as:

where d is the channel dimension of the candidate video segment feature, the top right element of the matrix is the candidate video segment feature, and the bottom left is the invalid region, filled with 0. In the feature matrix, the line number and the column number of the candidate video segment feature are mapped to the start time and the end time in the original video respectively.

The method is characterized in that a content encoder and a position encoder are used for decoupling adjacent characteristic matrixes into a content characteristic matrix and a position characteristic matrix, different position characteristics are weighted, and video content characteristic expression is enhanced, so that a long tail distribution effect caused by a data set is reduced, and then the adjacent characteristic matrixes are reconstructed by using matrix addition to learn the context information of the video, and the method specifically comprises the following steps:

two 1 × 1 convolution layers are respectively used as a content encoder and a position encoder, and feature decoupling is performed on adjacent feature matrixes through convolution operation to obtain a content feature matrix and a position feature matrix, which are expressed as follows:

V_c＝E_c(V)；V_l＝E_l(V)

wherein V_cRepresenting a content feature matrix, V_lA location feature matrix is represented. In order to maximally reduce the correlation between the content feature matrix and the location feature matrix, cosine similarity is used as a loss function for feature decoupling, expressed as:

furthermore, since the starting and ending time of each candidate video clip in the video can be obtained during training, the loss is reconstructed through the position

To ensure V_lCan be decoupled from V more efficiently, expressed as:

and l is a position embedding vector of the starting time and the ending time of the candidate video clip and is realized by adopting triangular position coding.

Furthermore, the expression of the video content characteristics is enhanced, and different video position characteristics are weighted, so that the overfitting of a learner to the head part or the tail part of the video and other areas is reduced, the influence of the long tail distribution effect of a training set on the model is reduced, and the expression is as follows:

wherein

And

respectively an enhanced content feature matrix and a weighted location feature matrix, d_kFor the scaling factor, Q ═ V_l×W^QFor querying the vector, K ═ V_l×W^KIs a key vector, L ═ V_l×W^LIs a vector of values, W^Q、W^KAnd W^LA parameter matrix is obtained for training.

Further, the adjacent feature matrix is reconstructed through matrix addition operation to learn the context information of the video, which is expressed as:

the method comprises the following steps of fusing the characteristics of a video and a text through a matrix dot product operation, inputting the characteristics into a full convolution neural network to calculate the score of a candidate video segment, generating a retrieval result, and calculating the retrieval loss of a model, and specifically comprises the following steps:

using matrix dot product operation to reconstruct the text feature q obtained in step S2 and the adjacent feature matrix reconstructed in step S4

And fusing in channel dimension, adding text semantics to the candidate video segment characteristics as retrieval guidance information, and expressing as follows:

where q is a feature of the text,

is the reconstructed neighboring feature matrix, M is the result of feature fusion,

is a matrix dot product operation.

Further, sequentially adding three layers of 3 × 3 convolutions and one layer of 1 × 1 convolution to the fusion feature M to calculate scores of the candidate video clips, finally generating a candidate video clip score map with dimension of NxN, and mapping the scores to a 0-1 interval by using a sigmoid function; through the score map, the candidate video segment with the highest score is indexed out as a retrieval result, and is represented as:

wherein phi_1×1Represents a convolution layer having a convolution kernel size of 1 × 1,

represents a convolutional layer having a multilayer convolutional kernel size of 3 × 3.

Further, Binary Cross Entropy Focal loss (Binary Cross energy Focal-loss) was calculated

The search loss as a full convolution neural network is expressed as:

where K is the number of candidate segments, r is a balance parameter, p_iIs the score, y, of the ith candidate video segment_iAre training labels.

Has the advantages that:

the invention provides a cross-modal video segment retrieval method based on feature decoupling, which is characterized in that video features are used for constructing an adjacent feature matrix, the feature matrix is decoupled through a content encoder and a position encoder, then the expression of the video content features is enhanced, and different video position features are weighted, so that overfitting of a learner to regions such as the head or the tail of a video is reduced, the influence of the long tail distribution effect of a training set on the model is further reduced, the model focuses on the learning of video contents, the adjacent feature matrix is reconstructed to learn the context information among video segments, and a retrieval result is generated through a full convolution neural network. The method is mainly applied to a cross-modal video clip retrieval task, and has better robustness on a training set with a long tail distribution effect.

Drawings

Fig. 1 is a schematic diagram of a framework of a cross-modal video segment retrieval method based on feature decoupling according to an embodiment of the present invention;

FIG. 2 is a flow chart of construction, decoupling and reconstruction of an adjacent feature matrix according to the present invention in an exemplary embodiment;

FIG. 3 is a comparison graph of search results according to the present invention;

Detailed Description

The invention aims to provide a cross-modal video segment retrieval method based on feature decoupling, which comprises the steps of firstly extracting features of a video and a query text; then, dividing the video features into a plurality of candidate video segment features, and constructing an adjacent feature matrix according to the starting and stopping time of the video segments; secondly, decoupling the adjacent characteristic matrix by using a content encoder and a position encoder, enhancing the expression of content characteristics, and weighting different position characteristics, thereby reducing the influence of the long-tail distribution effect on the model; then, reconstructing an adjacent characteristic matrix and learning the context information of the video; and finally, generating scores of the candidate video segments by using a full convolutional network to obtain a retrieval result.

The present invention will be described in detail below with reference to the attached drawings, and it should be noted that the described embodiments are only for illustrating and explaining the present technology, and do not have any limitation thereto.

FIG. 1 is a flowchart of a cross-modal video segment retrieval method based on feature decoupling according to the present invention; FIG. 2 is a flow chart of construction, decoupling and reconstruction of a neighboring feature matrix provided by the present invention; FIG. 3 is a comparison graph of search results provided by the present invention. The invention mainly aims to enable a learner to still learn the content information of video clips on a training set with a long-tail distribution condition, so that the learning of the position mapping relation between a video and a text is avoided, and a better retrieval result is obtained. As shown in fig. 1, one core content of this embodiment lies in using a content encoder and a position encoder to decouple adjacent feature matrices into a content feature matrix and a position feature matrix, where the content feature matrix includes content information of a video, and the position feature includes video position information affected by a long-tail distribution effect; the embodiment enhances the video content characteristics and balances the video position characteristics, so that the neural network can make fair prediction on the score of each candidate video segment. Compared with the existing method, the method greatly reduces the influence of the long tail distribution effect of the training set on the learner.

The invention provides a cross-modal video clip retrieval method based on feature decoupling, which specifically comprises the following steps:

step S1, splitting the video by a fixed frame number, extracting the characteristics of the video by using a three-dimensional convolutional neural network, and then dividing the video characteristics into N candidate video segment characteristics which are not overlapped at any moment in an equal proportion to obtain a candidate video segment characteristic set;

the length and width dimensions of the video frames are set to be 112 × 112 by a reshape method, and the video features are extracted by using a C3D network by taking 16 frames as a group. The C3D network adopts a standard model, and comprises 5 convolution layers, 5 pooling layers, 2 full-link layers and 1 softmax layer, wherein the size of a convolution kernel is 3 multiplied by 3, the size of the pooling kernel is 2 multiplied by 2, the step size is 1, and the dimension of the full-link layer is 2048. The output of the second fully-connected layer is saved as a feature of the video clip.

Further, the dimensionality of the video segment features is changed from 2048 to 512 by using a maximum pooling layer with a pooling kernel of 4, normalization processing is performed, feature values are mapped to an interval of 0-1, and for a kth group of features, a normalization formula is as follows:

wherein v is_kIs the kth set of video segment characteristics,

is v_kThe result of the normalization is obtained by the following steps,

and

are each v_kMaximum and minimum values of (a).

Further, each group of videos is divided into a plurality of groupsSplicing the segment features in channel dimension to obtain complete video features

Then will be

Splitting the data into N parts which are not overlapped at the moment in equal proportion to obtain a set

Wherein

The characteristics of the candidate video segments with the starting time (i-1) N/T and the ending time i N/T, U is a candidate video segment characteristic set, T is the duration of the video, and N is determined by the average duration of all the videos in the training set:

wherein the content of the first and second substances,

to average the duration of all videos in the training set,

meaning that the calculation is greater than or equal to the minimum second power of x. For the ActivityNet clipping dataset, N is 32, and for the Chardes-STA dataset, N is 16.

training a Glove word embedding model on a Wikipedia 2014 corpus, setting a word embedding dimension to be 300, and then coding a query text to obtain a 300-dimensional word vector; and (3) extracting features by using a long and short memory network comprising 3 hidden layers and the number of neurons in each layer being 512, and taking the output of the last layer as the feature q of the query text, wherein the dimensionality is 512.

performing sliding operation on the candidate video segment feature set U obtained in step S1 by using sliding windows with the sizes of 1-N respectively, and performing maximum pooling operation among channels on the candidate video segment features covered by the windows, wherein when the sliding window size is Z (Z is more than or equal to 1 and less than or equal to N), the operation is expressed as:

wherein

Features of candidate video segments with starting time (i-1) N/T and ending time j N/T, P_maxIs the maximum pooling function, S_ZIs a sliding window function of size Z, V_ZIs the output result of the maximum pooling function;

further, an adjacent feature matrix with dimension N × d is constructed by using the candidate video segment features, the first two dimensions respectively correspond to the start time and the end time of the candidate video segment, and the third dimension d corresponds to the channel dimension of the candidate video segment features and is set to 512. In the feature matrix, the top-right element is a candidate video segment feature, and the bottom-left is an invalid region, which is filled with 0 and is represented as:

step S4, decoupling the adjacent feature matrix into a content feature matrix and a position feature matrix by using a content encoder and a position encoder, reducing the long tail distribution effect caused by a data set by weighting different position features and enhancing the video content feature expression, and then reconstructing the adjacent feature matrix by using a matrix addition to learn the context information of the video;

and performing characteristic decoupling on the adjacent characteristic matrixes through the content encoder and the position encoder to obtain a content characteristic matrix and a position characteristic matrix. Where both encoders consist of convolutional layers with a convolution kernel of 1 × 1, the ReLU is used as the activation function. In order to maximally complete feature decoupling, the cosine similarity between the content features and the position features is calculated as a loss function of feature decoupling, and is expressed as:

further, since the start and stop time of each candidate video segment in the video can be obtained during training, the loss is reconstructed through the position

To ensure that V can be decoupled more effectively from V, is expressed as:

wherein l is the position embedding vector of the starting time and the ending time of the candidate video clip, and the method is realized by adopting triangle position coding:

where d is 512, the channel dimension of the position feature is represented, pos represents the start-stop time of the candidate video segment,

are parity term coefficients.

Further, the weight of the video position feature is adjusted, and different video position features are weighted in a self-adaptive manner, which is expressed as:

Q＝V_l×W^Q；K＝V_l×W^K；L＝V_l×W^L

wherein

Representing a weighted location feature matrix, Q being a query vector, K being a key vector, L being a value vector, W^Q、W^KAnd W^LThe parameter matrixes obtained for training are all set to 512 dimensions and are scaled by a factor d_kIs 8.

Further, using a residual attention enhancing video content feature expression, using a softmax function to generate attention weight in channel dimension, and simultaneously using shortcut path to content feature V_cIdentity mapping is performed, expressed as:

step S5, fusing the characteristics of the video and the text through matrix dot product operation, inputting the characteristics into a full convolution neural network to calculate the score of the candidate video clip, generating a retrieval result, and calculating the retrieval loss of the model;

And performing feature fusion, and adding text semantics as retrieval guide information for the candidate video segment features, wherein feature dimensions of the text features and the candidate video segments are both 512.

Further, three layers of convolutional neural network learning context information with convolution kernels of 3 × 3 are sequentially added to the fusion features, then a layer of 1 × 1 convolutional neural network is used for adjusting a 512-dimensional channel to 1 dimension, and a candidate video segment score map with dimension of N × N is generated, wherein the abscissa corresponds to the starting time of the candidate video segment in the video, and the ordinate corresponds to the ending time; mapping scores to an interval range of 0-1 by using a sigmoid activation function, and in the score map, indexing a candidate video clip with the highest score as a retrieval result, wherein the retrieval result is represented as:

wherein phi_1×1Represents a convolution layer having a convolution kernel of 1 × 1,

represents a convolutional layer having a multilayer convolution kernel of 3 × 3.

The whole model is trained through a back propagation algorithm, and the retrieval loss uses a binary cross entropy focus loss function

(Binary Cross Engine Focal-loss) expressed as:

where K is the number of candidate video segments and r is the balance parameter set to 2, p_iIs the score of the ith candidate video segment; y is_iThe scaled training labels specifically include:

wherein o is_minAnd o_maxFor the preset label threshold, 0.5 and 1.0, o, respectively_iIs a candidate video segment and a trueValue (Ground trout) intersection ratio (IoU).

In order to verify the superiority of the invention, experimental verification is carried out on two data sets, namely, Charides-STA and Activity Captions data sets with a long-tail distribution effect. The Charades-STA dataset contains 9948 indoor activity videos, containing 12408 training pairs and 3720 test pairs. The ActivityNet selection consists of nearly twenty thousand videos, contains 37417 training pairs, 17505 verification pairs and 17031 testing pairs, and is the largest data set in the current cross-modal video segment retrieval task. As with the current mainstream methods, recall is used as an evaluation index: { Rank @1, IoU ═ m }, i.e., calculating the recall ratio of the predicted video segment with the highest score at IoU ═ m, where m ∈ [0.5,0.7 ]. The experimental and comparative results are shown in table 1, and the black bold font is the best result, and it can be found that the evaluation indexes of the invention are all higher than those of the existing methods.

TABLE 1 Experimental results of the present invention on Activity Captions and Charides-STA data sets

Fig. 3 visually compares the partial search results. The background of the input video is a basketball field, and the content of the input video is that a male player is doing basketball sports, and the input video comprises various behaviors such as dribbling, shooting, buckling and picking up the basketball. When the query statement is "A man is shooting a basketball", the results of the method retrieval of S.Y.Zhang, H.W.Peng, J.L.Fu, and J.B.Luo.Learing 2D modular Adjacent Networks for motion Localization with Natural language in: Proceedings of the AAAI Conference on scientific intellectual insight, New York,2020.12870-12877 contain video segments of dribbling and dunking, and the invention accurately retrieves the video segments containing dunking.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications and substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A cross-modal video segment retrieval method based on feature decoupling is characterized in that the method firstly utilizes a three-dimensional convolution neural network to extract video features, constructs an adjacent feature matrix, and simultaneously uses a long and short memory network to extract the features of a query text; secondly, decoupling the adjacent characteristic matrixes into a content characteristic matrix and a position characteristic matrix by using a content encoder and a position encoder, weighting different position characteristics, and enhancing the expression of the content characteristics at the same time, thereby reducing the long tail distribution effect caused by a data set; secondly, reconstructing an adjacent characteristic matrix by using matrix addition to learn the context information of the video segment; and finally, fusing the reconstructed adjacent feature matrix with the text feature by using matrix dot product operation, and generating a retrieval result by using a full convolution neural network, wherein the method specifically comprises the following steps:

and step S5, fusing the characteristics of the video and the text through matrix dot product operation, inputting the characteristics into a full convolution neural network to calculate the score of the candidate video segment, and generating a retrieval result.

2. The method for retrieving the cross-modal video clip based on the feature decoupling as claimed in claim 1, wherein the step S1 is specifically as follows:

1) extracting video segment characteristics by taking 16 frames as a group by using a three-dimensional convolutional neural network C3D model, then carrying out standardization processing on each group of characteristics, mapping a characteristic value to a 0-1 interval, and regarding the kth group of characteristics, carrying out a standardization formula as follows:

wherein v is_kIs the kth set of video segment characteristics,

is v_kThe result of the normalization is carried out,

and

are each v_kMaximum and minimum values of;

2) splicing the video segment characteristics after the standardized processing extracted in the step 1) in the channel dimension to obtain complete video characteristics

3) The complete video characteristics obtained in the step 2) are used

Wherein

Features of candidate video segments with starting time (i-1) × N/T and ending time i × N/T, U is a set of candidate video segment features, T is the duration of the video, and N is determined by the average duration of all videos in the training set:

wherein, the first and the second end of the pipe are connected with each other,

to average the duration of all videos in the training set,

3. The method for retrieving the cross-modal video clip based on the feature decoupling as claimed in claim 1, wherein the step S2 is specifically as follows:

training a Glove word embedding model on a Wikipedia 2014 corpus, setting a word embedding dimension to be 300, then coding a query text, transmitting an obtained word vector to three layers of long and short memory networks (LSTMs) for feature extraction, wherein the number of neurons in each layer is 512, and the output of the last layer is used as a text feature q.

4. The method for retrieving the cross-modal video clip based on the feature decoupling as claimed in claim 1, wherein the step S3 specifically includes:

1) performing sliding operation on the candidate video segment feature set U obtained in step S1 by using sliding windows with the sizes of 1-N respectively, and performing maximum pooling operation among channels on the candidate video segment features covered by the windows, wherein when the sliding window size is Z (Z is more than or equal to 1 and less than or equal to N), the operation is expressed as:

wherein

2) arranging the candidate video segment features obtained in the step 1) according to the starting and stopping moments to form an adjacent feature matrix, specifically:

the line number and the column number of the candidate video clip feature are respectively mapped to the starting time and the ending time in the original video.

5. The method for retrieving the cross-modal video clip based on the feature decoupling as claimed in claim 1, wherein the step S4 specifically includes:

1) respectively using a content encoder and a position encoder to perform characteristic decoupling on adjacent characteristic matrixes to obtain a content characteristic matrix V_cAnd a position feature matrix V_lWherein, the content encoder and the position encoder are both 1 × 1 convolution layers;

2) calculating V_cAnd V_lAs a loss function of feature decoupling, the cosine similarity of (a) is expressed as:

3) computing location reconstruction loss

To ensure V_lCan be decoupled from V more efficiently, expressed as:

wherein l is a position embedding vector of the starting and stopping moments of the candidate video clips, and the position embedding vector is realized by adopting triangular position coding;

4) adjusting the weight of the video position characteristic and enhancing the expression of the video content characteristic, wherein the expression is represented as:

wherein

And

respectively, a weighted location feature matrix and an enhanced content feature matrix, d_kFor the scaling factor, Q ═ V_l×W^QFor querying the vector, K ═ V_l×W^KIs a key vector, L ═ V_l×W^LIs a vector of values, W^Q、W^KAnd W^LA parameter matrix obtained for training;

5) and reconstructing the adjacent characteristic matrix by using matrix addition to learn the context information of the video, wherein the expression is as follows:

6. the method for retrieving the cross-modal video clip based on the feature decoupling as claimed in claim 1, wherein the step S5 specifically includes:

1) using matrix dot product operation to reconstruct the text feature q obtained in step S2 and the adjacent feature matrix reconstructed in step S4

Carrying out fusion;

2) adding serial 3 layers of 3 multiplied by 3 convolutions and one layer of 1 multiplied by 1 convolution on the fusion characteristics to calculate scores of the candidate video clips, mapping the scores to a range of 0-1 by using a sigmoid function, and indexing the candidate video clip with the maximum score as a retrieval result.

7. The method according to claim 1, wherein the full convolution neural network is trained by a back propagation algorithm in step S5, wherein a Binary Cross Entropy focus loss function (Binary Cross Entropy Focal-loss) is used as the search loss of the network, and specifically:

where K is the number of candidate video segments, r is a balance parameter, p_iIs the score, y, of the ith candidate video segment_iThe scaled training labels specifically include:

wherein o is_minAnd o_maxIs a predetermined tag threshold, o_iIs the cross-over ratio of the candidate video segment to the true value (Ground true).