Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
Fig. 1 is a schematic flowchart of a video text retrieval method according to an embodiment of the present invention.
As shown in fig. 1, a video text retrieval method includes the following steps:
importing a plurality of videos and a plurality of natural language text descriptions which are in one-to-one correspondence with the videos, and randomly dividing all the videos into a training set and a test set;
respectively preprocessing each video in the training set and the corresponding natural language text description to obtain a plurality of target video picture block sequences of each video in the training set;
constructing a video encoder and a visual semantic surveillance encoder, and training the video encoder by using the visual semantic surveillance encoder and a plurality of target video picture block sequences of each video in the training set to obtain video text distances of the trained video encoder and each target video picture block sequence;
respectively coding the video text distance of each target video picture block sequence by using the trained video coder to obtain the video characteristics of each target video picture block sequence;
respectively coding each target video picture block sequence by using a text coder to obtain text characteristics of each target video picture block sequence;
performing loss function analysis according to the video characteristics and the text characteristics of each target video picture block sequence respectively to obtain a plurality of loss functions of each target video picture block sequence;
respectively updating parameters of the visual semantic surveillance encoder and the trained video encoder according to a plurality of loss functions of each target video picture block sequence to obtain an updated visual semantic surveillance encoder and an updated video encoder;
and performing video text retrieval processing on the test set by using the updated visual semantic surveillance encoder and the updated video encoder to obtain a video text retrieval result.
It should be understood that the raw data (i.e., the plurality of videos and the plurality of natural language text descriptions in one-to-one correspondence with each of the videos) is collected and divided into a test set and a training set.
Specifically, the original data at least needs 10000 videos and the natural language text descriptions corresponding to the videos, and the natural language text description corresponding to each video describes at least 20 sentences; the original data (i.e., the plurality of videos and the plurality of natural language text descriptions corresponding to the videos one by one) are split, and 9000 videos are used for the training process and 1000 videos are used for the testing process.
Specifically, the video encoder and the visual semantic surveillance encoder are both of the same architecture and are composed of a stack of space-time self-attention modules (i.e., bert encoders).
It should be understood that the test set is input into the trained model (i.e., the updated visual semantic surveillance encoder and the updated video encoder) to enable video text retrieval.
It should be understood that during the testing process, the visual semantic surveillance encoder (i.e. the updated visual semantic surveillance encoder) will be frozen.
It should be understood that the video coding module (i.e. the visual semantic surveillance encoder and the video encoder) is used to perform spatio-temporal information coding on the video picture block sequence (i.e. the target video picture block sequence), so as to obtain global event characteristics, local entity characteristics and action characteristics (i.e. the video characteristics) of the video.
In particular, the video encoder (i.e., the trained video encoder) passes a global transformation matrix W g 、W e And W a Obtaining a global feature V of the video g Local entity feature V e And an action characteristic V a (i.e., the video feature). Wherein, the first and the second end of the pipe are connected with each other,
V g =V*W g
V e =V*W e
V a =V*W a
V g ∈{v g1 ,v g2 ,…,v gk }
V e ∈{v e1 ,v e2 ,…,v ek }
V a ∈{v a1 ,v a2 ,…,v ak }
where k refers to the number of frames of a segment of video.
It should be appreciated that by updating the visual semantic supervised encoder and calculating the loss function, the entire model (i.e. the updated visual semantic supervised encoder and the updated video encoder) is trained on the test set data (i.e. the test set).
In the embodiment, the video is randomly divided into the training set and the test set, the video and natural language text description is preprocessed to obtain the target video picture block sequence, the video encoder and the video text distance after training are obtained by utilizing the visual semantic surveillance encoder and the target video picture block sequence to train the video encoder, the video feature is obtained by utilizing the video encoder to encode the video text distance after training, the text feature is obtained by utilizing the text encoder to encode the target video picture block sequence, the loss function is obtained by analyzing the loss function of the video feature and the text feature, the parameters of the visual semantic surveillance encoder and the video encoder after training are updated according to the loss function to obtain the updated visual semantic surveillance encoder and the updated video encoder, the video text retrieval result is obtained by utilizing the updated visual semantic surveillance encoder and the updated video encoder to retrieve the video text of the test set, the high efficiency of the encoder is ensured, the spatio-temporal information of the video data and the context information of the text data can be effectively mined, more accurate semantic alignment is realized, the video text retrieval effect can be effectively improved, and the reliability and stability of the model are improved.
Optionally, as an embodiment of the present invention, the step of respectively preprocessing each video in the training set and the corresponding natural language text description to obtain a plurality of target video picture block sequences of each video includes:
taking the natural language text description corresponding to each video in the training set as a preset text description as a segmentation line, and respectively and correspondingly segmenting each video in the training set to obtain a plurality of video picture block sequences to be mapped of each video in the training set;
and mapping the video picture block sequences to be mapped of each video in the training set respectively to obtain a plurality of target video picture block sequences of each video.
It should be appreciated that the video of the training set is segmented and projected as a series of video picture block sequences (i.e., the target video picture block sequence).
In the above embodiment, each video in the training set and the corresponding natural language text description are respectively preprocessed to obtain a plurality of target video picture block sequences of each video, so that a foundation is laid for subsequent data processing, the high efficiency of the encoder is ensured, the spatio-temporal information of the video data and the context information of the text data can be effectively mined, and more accurate semantic alignment is realized.
Optionally, as an embodiment of the present invention, the training of the video encoder by using the visual semantic surveillance encoder and the plurality of target video picture block sequences of each video in the training set to obtain the video text distance between the trained video encoder and each target video picture block sequence includes:
respectively performing mask processing on each target video picture block sequence of each video in the training set to obtain a masked video picture block sequence of each target video picture block sequence;
respectively carrying out position coding on the masked video picture block sequences of the target video picture block sequences to obtain coded video picture block sequences of the target video picture block sequences;
respectively encoding the encoded video picture block sequence of each target video picture block sequence by using the video encoder to obtain first encoded video characteristics of each target video picture block sequence;
respectively encoding each target video picture block sequence of each video by using the visual semantic surveillance encoder to obtain second encoded video characteristics of each target video picture block sequence;
based on a first equation, calculating a video text distance according to the first coded video feature and the second coded video feature of each target video picture block sequence to obtain the video text distance of each target video picture block sequence, where the first equation is:
L=||V-Q||,
wherein L is a video text distance, V is a first coded video characteristic, and Q is a second coded video characteristic;
and updating parameters of the video encoder according to the video text distances of all the target video picture block sequences to obtain the trained video encoder.
It should be understood that the sequence of video picture blocks (i.e., the sequence of target video picture blocks) is partially masked out in the spatial and temporal dimensions.
It is to be understood that the sequence of video picture blocks (i.e. the sequence of masked video picture blocks) is position-embedded in the spatial and temporal dimensions resulting in a sequence of input video picture blocks of the video (i.e. the sequence of encoded video picture blocks).
Specifically, the input mark sequence (i.e. the coded video picture block sequence) is input into the video encoder for encoding, and the video encoder automatically learns the feature V (i.e. the first coded video feature) that masks a part of the video picture block by the visible video picture blocks of the spatial and temporal neighbors.
Specifically, the unmasked original video picture block sequence (i.e. the target video picture block sequence) is input into the visual semantic supervised encoder to obtain the feature Q (i.e. the second encoded video feature) that masks a part of the video blocks.
It should be appreciated that by minimizing the distance between V and Q as described, visual semantic surveillance of the video encoder is achieved at the video picture block level, thereby obtaining spatiotemporal information of the video pictures.
Specifically, the encoder model is optimized with the goal of minimizing the distance between V and Q as described. The concrete formula is as follows:
L=||V-Q||。
in the above embodiment, the video encoder is trained by using the visual semantic surveillance encoder and the target video picture block sequence to obtain the trained video encoder and the video text distance, so that the visual semantic surveillance on the video encoder is realized, and the spatio-temporal information of the video picture is acquired.
Optionally, as an embodiment of the present invention, the text encoder includes a multi-layer bidirectional transformer encoder;
the process of respectively encoding the video text distance of each target video picture block sequence by using the trained video encoder to obtain the video characteristics of each target video picture block sequence comprises the following steps:
respectively coding each target video picture block sequence by using a graph reasoning mechanism algorithm and the multilayer bidirectional transformer coder to obtain a text global characteristic, a text entity characteristic and a text action characteristic of each target video picture block sequence;
the video features of the target video picture block sequence include a text global feature, a text entity feature, and a text action feature of the target video picture block sequence.
It should be understood that, the context information encoding of the text data (i.e. the target video picture block sequence) by the text encoder results in a global event feature (i.e. the text global feature), a local entity feature (i.e. the text entity feature) and an action feature (i.e. the text action feature) of the text.
Specifically, the text encoder adopts a multi-layer bidirectional transformer structure encoder to obtain text feature representation with context information, and a graph inference mechanism is used to obtain a global feature C of the text feature representation g (i.e., the global text feature), local entity feature C e (i.e., the textual entity characteristics) and action characteristics C a (i.e., the text action feature). Wherein the content of the first and second substances,
C e ∈{c e1 ,c e2 ,…,c ek },
C a ∈{c a1 ,c a2 ,…,c ak }。
in the above embodiment, the trained video encoder is used to encode the video text distance of each target video picture block sequence to obtain the video feature of each target video picture block sequence, so as to obtain the text feature representation with context information.
Optionally, as an embodiment of the present invention, the video feature includes a plurality of video sub-features, and the text feature includes a plurality of text sub-features;
the process of analyzing the loss function according to the video features and the text features of each target video picture block sequence to obtain a plurality of loss functions of each target video picture block sequence comprises the following steps:
analyzing video text similarity according to the video sub-features and the text sub-features of the target video picture block sequences respectively to obtain a plurality of video text similarities of the target video picture block sequences;
based on a second formula, calculating a loss function according to the similarity of a plurality of video texts of each target video picture block sequence to obtain a plurality of loss functions of each target video picture block sequence, wherein the second formula is as follows:
Loss(v a ,v a ,c b ,c b ,α)=[β+S(v a ,c b )-S(v a ,c a )] + +[β+S(v b ,c a )-
S(v a ,c a )] + ,
wherein, loss (v) a ,v a ,c b ,c b α) is a loss function of the a-th video sub-feature and the b-th text sub-feature, S (v) a ,c b ) Video text similarity of the a-th video sub-feature and the b-th text sub-feature, S (v) a ,c a ) Video text similarity for the a-th video sub-feature and the a-th text sub-feature, S (v) b ,c a ) The video text similarity of the b-th video sub-feature and the a-th text sub-feature, beta is a preset hyper-parameter, and a belongs to [1, i ]],b∈[1,j]I is the video sub-feature quantity, j is the text sub-feature quantity, [.] + Is max (·, 0).
Understandably, v a May be the a-th video sub-feature, c a May be the a-th text sub-feature, v b May be the b-th video sub-feature, c b May be the b-th text sub-feature.
It should be appreciated that using a coherent ranking loss as a training target, the similarity between video-text combinations is maximized in positive samples and minimized in negative samples.
It should be understood that the distance between the randomly sampled negative and positive samples is greater than a fixed edge β, which is a predetermined hyperparameter.
It is understood that i is not equal to j.] + Max (·, 0) is expressed, i.e., the model is updated by maximizing the similarity between positive sample combinations of the video text and minimizing the similarity between negative sample combinations.
In the embodiment, the loss function is obtained by analyzing the loss function according to the video characteristics and the text characteristics, so that the similarity between video text combinations can be maximized in the positive sample, and the similarity between the combinations can be minimized in the negative sample, the video text retrieval effect is effectively improved, certain generalization capability is realized, and the reliability and stability of the retrieval model are improved.
Optionally, as an embodiment of the present invention, the video sub-features include a video global sub-feature, a video entity sub-feature, and a video action sub-feature, and the text sub-features include a text global sub-feature, a text entity sub-feature, and a text action sub-feature;
the process of analyzing the video text similarity according to each video sub-feature and each text sub-feature of each target video picture block sequence respectively to obtain a plurality of video text similarities of each target video picture block sequence includes:
based on a third formula, calculating global matching scores according to the video global sub-features and the text global sub-features of the target video picture block sequences to obtain a plurality of global matching scores of the target video picture block sequences, wherein the third formula is as follows:
S g(i,j) =cos(v g,i ,c g,j ),
wherein S is g(i,j) Is the global matching score, v, of the ith video global sub-feature and the jth text global sub-feature g,i For the ith video global sub-feature, c g,j Is the jth text global sub-feature;
based on a fourth formula, calculating entity matching scores according to the video entity sub-features and the text entity sub-features of the target video picture block sequences to obtain a plurality of entity matching scores of the target video picture block sequences, wherein the fourth formula is as follows:
S e(i,j) =cos(v e,i ,c e,j ),
wherein S is e(i,j) An entity matching score, v, for the ith video entity sub-feature and the jth text entity sub-feature e,i For the ith video entity sub-feature, c e,j Is the jth text entity sub-feature;
based on a fifth expression, calculating motion matching scores according to the video motion sub-features and the text motion sub-features of the target video picture block sequences to obtain a plurality of motion matching scores of the target video picture block sequences, wherein the fifth expression is as follows:
S a(i,j) =cos(v a,i ,c a,j ),
wherein S is a(i,j) A motion matching score, v, for the ith video motion sub-feature and the jth text motion sub-feature a,i For the ith video motion sub-feature, c a,j Is the jth text action sub-feature;
respectively carrying out normalization processing on each global matching score, each entity matching score and each action matching score of each target video picture block sequence, and correspondingly obtaining a global attention weight parameter of each global matching score, an entity attention weight parameter of each entity matching score and an action attention weight parameter of each action matching score;
based on a sixth expression, calculating a target global matching score according to each global matching score of each target video picture block sequence and a global attention weight parameter of each global matching score to obtain a plurality of target global matching scores of each target video picture block sequence, where the sixth expression is:
S g,i,j =r g(i,j) S g(i,j) ,
wherein S is g,i,j A target global matching score, S, for the ith video global sub-feature and the jth text global sub-feature g(i,j) Is the global matching score of the ith video global sub-feature and the jth text global sub-feature, r g(i,j) A global attention weight parameter for the ith video global sub-feature and the jth text global sub-feature;
based on a seventh expression, calculating a target entity matching score according to each entity matching score of each target video picture block sequence and an entity attention weight parameter of each entity matching score to obtain a plurality of target entity matching scores of each target video picture block sequence, where the seventh expression is:
S e,i,j =r e(i,j) S e(i,j) ,
wherein S is e,i,j Matching scores for the target entities of the ith video entity sub-feature and the jth text entity sub-feature, S e(i,j) An entity matching score, r, for the ith video entity sub-feature and the jth text entity sub-feature e(i,j) An entity attention weight parameter for the ith video entity sub-feature and the jth text entity sub-feature;
based on an eighth expression, calculating a target motion matching score according to each motion matching score of each target video picture block sequence and a motion attention weight parameter of each motion matching score to obtain a plurality of target motion matching scores of each target video picture block sequence, where the eighth expression is:
S a,i,j =r a(i,j) S a(i,j) ,
wherein S is a,i,j Matching scores for the target motion of the ith video motion sub-feature and the jth text motion sub-feature, S a(i,j) A motion matching score, r, for the ith video motion sub-feature and the jth text motion sub-feature a(i,j) The motion attention weight parameter of the ith video motion sub-feature and the jth text motion sub-feature is obtained;
based on a ninth expression, performing video text similarity calculation according to each target global matching score, each target entity matching score, and each target action matching score of each target video picture block sequence to obtain a plurality of video text similarities of each target video picture block sequence, where the ninth expression is:
S(v i ,c j )=(S g,i,j +S e,i,j +S a,i,j )/3,
wherein, S (v) i ,c j ) Video text similarity of ith video motion sub-feature and jth text motion sub-feature, S g,i,j A target global matching score, S, for the ith video global sub-feature and the jth text global sub-feature e,i,j Matching scores for the target entities of the ith video entity sub-feature and the jth text entity sub-feature, S a,i,j Matching scores for the target actions of the ith video action sub-feature and the jth text action sub-feature.
It should be appreciated that the video and text global and local features described above are projected onto an aligned public network, and the similarity between the video and text features (i.e., the video-text similarity) is calculated.
Specifically, the cosine similarity is used to calculate the similarity between the global event features (i.e. the video global sub-features and the text global sub-features), and the obtained global matching score is obtained according to the following formula:
S g(i,j) =cos(v g,i ,c g,j ),
calculating the similarity between the local entity features (namely the video entity sub-features and the text entity sub-features) by using cosine similarity to obtain the matching score (namely the entity matching score) of the local entity features, wherein the formula is as follows:
S e(i,j) =cos(v e,i ,c e,j ),
calculating the similarity between the local action features (namely the video action sub-features and the text action sub-features) by using cosine similarity to obtain the matching score of the local action features (namely the action matching score), wherein the formula is as follows:
S a(i,j) =cos(v a,i ,c a,j )。
specifically, S is g(i,j) 、S e(i,j) And S a(i,j) Respectively carrying out normalization to obtain an attention weight parameter r e(i,j) (i.e., the entity attention weight parameter), r g(i,j) (i.e., the global attention weight parameter) and r a(i,j) (i.e., the action attention weight parameter), text-to-video semantic alignment is dynamically achieved on local features, wherein,
it should be understood that the matching score S is stated e(i,j) 、S g(i,j) And S a(i,j) Performing weighted average to obtain a final matching score (i.e. the target global matching score, the target action matching score and the target action matching score), wherein the formula is as follows:
S g,i,j =r g(i,j) S g(i,j) ,
S e,i,j =r e(i,j) S e(i,j) ,
S a,i,j =r a(i,j) S a(i,j) 。
it should be understood that, in the training, the average value S of the feature matching scores of the three levels (i.e., the target global matching score, the target action matching score, and the target action matching score) is regarded as the final video text similarity (i.e., the video text similarity).
In the embodiment, the video text similarity is obtained by analyzing the video text similarity according to the video sub-features and the text sub-features, so that the video text retrieval effect can be effectively improved, certain generalization capability is realized, and the reliability and stability of the retrieval model are improved.
Optionally, as an embodiment of the present invention, the process of respectively performing parameter updating on the visual semantic surveillance encoder and the trained video encoder according to the plurality of loss functions of each target video picture block sequence to obtain an updated visual semantic surveillance encoder and an updated video encoder includes:
based on an exponential moving average mechanism, the trained video encoder is utilized to update parameters of the visual semantic surveillance encoder, and the updated visual semantic surveillance encoder is obtained;
and updating parameters of the trained video encoder according to the loss functions of the target video picture block sequences to obtain an updated video encoder.
It should be understood that, in the training, the video encoder of the last epoch (the video encoder after the training) is used as the visual semantic monitor encoder, and the update of the visual semantic monitor encoder is realized.
Specifically, the encoder is frozen at an epoch and its parameters are updated at the kth epoch based on an Exponential Moving Average (EMA) mechanism, denoted as
{θ q } k =β{θ q } k +(1-β){θ v } k-1
Where { theta } v } k-1 Parameter, { θ ] representing the video encoder at the end of the (k-1) th epoch q } k-1 Representing the parameters of the video surveillance encoder at the end of the (k-1) th epoch.
In the embodiment, the parameter of the visual semantic surveillance encoder and the trained video encoder is updated respectively according to the loss function to obtain the updated visual semantic surveillance encoder and the updated video encoder, so that the efficiency of the encoder is ensured, the spatio-temporal information of the video data and the context information of the text data can be effectively mined, more accurate semantic alignment is realized, the video text retrieval effect can be effectively improved, certain generalization capability is realized, and the reliability and the stability of the retrieval model are improved.
Alternatively, as another embodiment of the present invention, the present invention is implemented by collecting raw data and dividing the data into a test set and a training set; segmenting and projecting the video of the test set into a series of video picture block sequences; performing space-time information coding on a video picture block sequence by using a video coding module to obtain global event characteristics, local entity characteristics and action characteristics of a video; encoding context information of the text data by using a text encoder to obtain global event characteristics, local entity characteristics and action characteristics of the text; projecting the global features and the local features of the videos and the texts to an aligned public network, and calculating the similarity between the video features and the text features; training the whole model on the test set data by updating a visual semantic supervision encoder and calculating a loss function; and inputting the test set into the trained model to realize video text retrieval. The invention ensures the high efficiency of the encoder, can effectively excavate the spatio-temporal information of the video data and the context information of the text data, realizes more accurate semantic alignment, can effectively improve the video text retrieval effect, has certain generalization capability and improves the reliability and the stability of a retrieval model.
Fig. 2 is a block diagram of a video text retrieval apparatus according to an embodiment of the present invention.
Alternatively, as another embodiment of the present invention, as shown in fig. 2, a video text retrieval apparatus includes:
the system comprises a dividing module, a training set and a testing set, wherein the dividing module is used for importing a plurality of videos and a plurality of natural language text descriptions which are in one-to-one correspondence with the videos, and randomly dividing all the videos into the training set and the testing set;
a preprocessing module, configured to preprocess each video in the training set and the corresponding natural language text description, respectively, to obtain a plurality of target video frame block sequences of each video in the training set;
the training module is used for constructing a video encoder and a visual semantic surveillance encoder, and training the video encoder by using the visual semantic surveillance encoder and a plurality of target video picture block sequences of each video in the training set to obtain a video text distance of the trained video encoder and each target video picture block sequence;
the video coding module is used for coding the video text distance of each target video picture block sequence by utilizing the trained video coder to obtain the video characteristics of each target video picture block sequence;
the text coding module is used for coding each target video picture block sequence by utilizing a text coder to obtain text characteristics of each target video picture block sequence;
the loss function analysis module is used for performing loss function analysis according to the video characteristics and the text characteristics of each target video picture block sequence to obtain a plurality of loss functions of each target video picture block sequence;
the parameter updating module is used for respectively updating parameters of the visual semantic surveillance encoder and the trained video encoder according to a plurality of loss functions of each target video picture block sequence to obtain an updated visual semantic surveillance encoder and an updated video encoder;
and the retrieval result acquisition module is used for performing video text retrieval processing on the test set by using the updated visual semantic surveillance encoder and the updated video encoder to obtain a video text retrieval result.
Alternatively, another embodiment of the present invention provides a video text retrieval system, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the computer program is executed by the processor, the video text retrieval method as described above is implemented. The system may be a computer or the like.
Alternatively, another embodiment of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the video text retrieval method as described above.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is only a logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.