CN116127132A

CN116127132A - Time sequence language positioning method based on cross-modal text related attention

Info

Publication number: CN116127132A
Application number: CN202310199160.2A
Authority: CN
Inventors: 何立火; 邓夏迪; 黄子涵; 唐杰浩; 王笛; 高新波; 路文
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-03-03
Filing date: 2023-03-03
Publication date: 2023-05-16

Abstract

The invention discloses a time sequence language positioning method based on cross-modal text related attention, which mainly solves the problem that the prior art lacks semantic relevance in cross-modal fusion of text and video. The scheme is as follows: acquiring a training data set and a test data set, and extracting video and text characteristics of the training data set; constructing a time sequence language positioning model based on cross-modal text related attention, utilizing text features and video features to fuse and acquire fusion features, and utilizing text semantic information and interaction of the fusion feature attention to realize time sequence positioning of the video; training the time sequence language positioning model by using video and text characteristics of the training data set; inputting the test data set into the trained time sequence language positioning model to obtain a time sequence language positioning result of the cross-modal text related attention. The invention can search out rich relevant characteristic information in various complicated cross-mode videos, improves the searching precision, and can be used for searching the fragments corresponding to the texts in the videos.

Description

Time sequence language positioning method based on cross-modal text related attention

Technical Field

The invention belongs to the technical field of multi-mode video processing, and particularly relates to a cross-mode time sequence language positioning method which can be used for searching fragments corresponding to texts in videos.

Background

With the rapid development of internet technology in recent years, video data has also exhibited an exponential growth as an important component of multimedia data. Existing video understanding technologies have reached a level that enables a preliminary understanding of video content, but as demand increases, time-series language localization is becoming an important and urgent issue in the video understanding field. The time sequence language positioning task finds a segment which corresponds to the text meaning most from a long video according to a given text, and returns the starting time and the ending time of the segment. The time sequence language positioning task has wide application prospect and has attracted wide attention in the industry and academia.

However, timing language localization tasks also have certain challenges. First, the great modal difference between the search text and the video greatly increases the difficulty of aligning the search text with the video segment. Secondly, the overlapping video clips often have similar video features, which brings great interference to distinguishing the similarity of different video clips and the search text. Thirdly, different people have different understandings of the moment when the action occurs, which leads to inaccuracy of the data annotation.

The patent document with the application publication number of CN 115238130A discloses a time sequence language positioning method and device based on modal customization collaborative attention interaction, which comprises the steps of firstly obtaining paired unchupped video-text query data, constructing a data set of a time sequence language positioning task, then extracting video characterization from video, combining word-level query characterization extracted from text and sentence query characterization to obtain multi-granularity query characterization, then inputting the video extraction characterization and the multi-granularity query characterization together into a modal customization collaborative attention interaction module, obtaining semantically aligned video characterization after cross-modal fusion of video-text, and finally obtaining corresponding time sequence language positioning results according to the semantically aligned video characterization after cross-modal fusion by using multi-branch tasks comprising dense time sequence boundary regression, semantic matching score prediction and cross-union regression. According to the method, the performance of a time sequence language positioning task is improved by utilizing cooperative attention, but text query characterization with semantic limitation is used in cooperative attention interaction, so that the method cannot well correspond to video content, enough cross-modal information cannot be acquired by video-text fused video characterization, and the video retrieval precision is low.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a time sequence language positioning method based on cross-modal text related attention so as to acquire enough cross-modal information and improve the video retrieval precision.

In order to achieve the above purpose, the technical scheme of the invention comprises the following steps:

(1) Raw, unclamped video and text query data are acquired and processed as per 3:1 into a training data set and a test data set;

(2) Extracting video characteristics V corresponding to videos in the training data set through a video encoder, and extracting text characteristics S corresponding to texts in the training data set through a word segmentation and word embedding encoder;

(3) Constructing a time sequence language positioning model based on the cross-modal text related attention;

(3a) Respectively constructing corresponding full connection layers L for video features V and text features S _v 、L _S Self-attention feature extractor N _v 、N _s Passing the video feature V through L in turn _v N _v Obtaining video self-attention feature V _t Sequentially passing the text features S through L _S N _s Obtaining text self-attention features S _t ；

(3b) According to video self-attention feature V _t And text self-attention feature S _t Calculating a cross-modal text related attention matrix E;

(3c) Building full connection layer L with dimension of 300 multiplied by 384 _E Cross-modality fusion self-attention feature extractor N _E The cross-modal text related attention matrix E sequentially passes through the full-connection layer L _E Cross-modal fusion self-attention feature extractor N _E Obtaining cross-modal related attention feature codes E _t ；

(3d) Encoding cross-modality dependent attention features E _t Inputting the time positioning characteristic E into the existing bidirectional gating circulation unit network _tg ；

(3e) Establishing initial time positioning full-connection layer L with size of 768 multiplied by 1 respectively _Q And end time positioning full connection layer L _J ；

(3f) Positioning the moment of time feature E _tg Respectively through L _Q 、L _J Obtaining the corresponding initial time positioning characteristic E _Q And end time locating feature E _J Then, respectively corresponding starting time and ending time are selected through normalization to form a time sequence language positioning model based on the cross-modal text related attention;

(4) Setting a Kullback-Leibler divergence as a loss function of the time sequence language positioning model constructed in the step 3, inputting a training data set into the constructed time sequence language positioning model, and updating model parameters by using an optimizer until the loss function converges to obtain a trained time sequence language positioning model based on cross-modal text related attention;

(5) Inputting the test data set into a trained time sequence language positioning model based on cross-modal text related attention for testing, and outputting a target segment time sequence boundary regression value with the highest confidence value as a time sequence language positioning result of the test data set.

Compared with the prior art, the invention has the following advantages:

according to the invention, the text features and the video features are fused to obtain the fusion features, the interaction between the text semantic information and the attention of the fusion features is used to realize the time sequence positioning of the video, and the start frame and the end frame are generated, so that rich relevant feature information can be obtained in various complex cross-mode video retrieval, and the retrieval precision is improved.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

FIG. 2 is a block diagram of a temporal language localization model of cross-modal text-dependent attention of the present invention.

Fig. 3 is a block diagram of a self-attention feature extractor in the present invention.

Detailed Description

The invention will now be described in further detail with reference to the drawings and to specific embodiments.

Referring to fig. 1, this example includes the following implementation steps:

and step 1, acquiring original unclamped video and text query data, and dividing a training data set and a testing data set.

In this embodiment, the video and text input used is from the Charades-STA dataset. The data set comprises 6672 videos shot in daily life and 16128 video-text annotation pairs;

video-text annotation pairs were annotated according to about 3: the scale of 1 is divided into a training data set and a test data set, wherein the training set contains 12408 pairs and the test set contains 3720 pairs.

And 2, respectively extracting video features V corresponding to videos and text features S corresponding to texts in the training data set.

(2.1) inputting the video in the training data set into the existing video encoder, extracting video features according to the number of 4 feature blocks per second, and obtaining the length v corresponding to the video _l Video feature V of the same dimension 1024;

(2.2) firstly word segmentation is carried out on text query sentences in the training data set, then the obtained words are input into the existing word embedding encoder, and the length s corresponding to the text is obtained _l The same text feature S with dimension 300.

And 3, constructing a time sequence language positioning model based on the cross-modal text related attention.

Referring to fig. 2, the present step is specifically implemented as follows:

(3.1) respectively constructing corresponding full connection layers L for the video features V and the text features S _v 、L _S Self-attention feature extractor N _v 、N _s ：

Self-attention feature extractor N of the video features _v And a self-attention feature extractor N for text features _s The structural parameters are the same, namely each self-attention feature extractor comprises a position coding layer, a discarding layer, a multi-head self-attention encoder P and a normalization layer which are connected in sequence;

the multi-head self-attention encoder P is composed of 6 self-attention encoding layers T which are sequentially connected, each self-attention encoding layer T comprises a first normalization layer, a self-attention module F, a first random depth layer, a second normalization layer and a forward propagation layer Y which are sequentially connected, the input of the first normalization layer is connected with the output residual error of the first random depth layer, and the output of the second normalization layer is connected with the output residual error of the forward propagation layer, as shown in fig. 3.

The self-attention module F comprises a first linear layer, an attention discarding layer, a second linear layer and a first discarding layer which are connected in sequence;

the forward propagation layer Y includes a third linear layer, a GELU activation function, a second drop layer, a fourth linear layer, a third drop layer, and a second random depth layer connected in sequence.

In this embodiment, the first random depth layerThe discard rate of (2) is 0.3; full connection layer L of video feature V _v Full connection layer L of size 1024×128, text feature S _S The size is 300 multiplied by 128; the first linear layer size is 128×384, the attention discard layer discard rate is 0.3, the second linear layer size is 128×128, and the first discard layer discard rate is 0.3; the third linear layer size is 128×128, the discard rate of the second discard layer is 0.3, the fourth linear layer size is 128×128, the discard rate of the third discard layer is 0.3, and the discard rate of the second random depth layer is 0.2;

(3.2) passing video feature V sequentially through L _v N _v Obtaining video self-attention feature V _t Sequentially passing the text features S through L _S N _s Obtaining text self-attention features S _t ：

(3.2.1) passing the video feature V through the full connection layer L of the video feature V _v Feature dimension compression is performed to make the dimension number of the feature dimension compressed to 128, and then the video self-attention feature extractor N is used for processing the video _v The size of 128 Xv was obtained _l Video self-attention feature encoding V _t ；

(3.2.2) passing the text feature S through the full connection layer L of the text feature S _S Compressing the feature dimension to 128, and passing through text self-attention feature extractor N _s The size of 128×s is obtained ₁ Video self-attention feature encoding S _t ；

(3.3) according to video self-attention feature V _t And text self-attention feature S _t Calculating a cross-modal text related attention matrix E:

(3.3.1) according to video self-attention feature V _t And text self-attention feature S _t And calculating to obtain a transition matrix M:

M＝S _t ×V _t ^T

(3.3.2) calculating a cross-modal text related attention matrix E according to the transition matrix M and the text features S:

E＝(S ^T ×M) ^T

wherein E has a size v _l ×300，v ₁ For the length of the video,t is matrix transposition;

(3.4) building the full connection layer L of the cross-modal text-dependent attention matrix E _E Cross-modality fusion self-attention feature extractor N _E ；

Full connection layer L of the cross-modal text-dependent attention matrix E _E The size is 300 multiplied by 384;

the cross-modal fusion self-attention feature extractor N _E The multi-head self-attention encoder comprises a position coding layer, a discarding layer, a multi-head self-attention encoder P' and a normalization layer which are sequentially connected; the structure of the multi-head self-attention encoder P ' is composed of 6 self-attention encoding layers T ' which are sequentially connected, each self-attention encoding layer T ' comprises a first normalization layer, a self-attention module F ', a first random depth layer, a second normalization layer and a forward propagation layer Y ' which are sequentially connected, wherein the input of the first normalization layer is connected with the output residual error of the first random depth layer, and the output of the second normalization layer is connected with the output residual error of the forward propagation layer, as shown in figure 3;

the self-attention module F' comprises a first linear layer, an attention discarding layer, a second linear layer and a first discarding layer which are sequentially connected;

the forward propagation layer Y' comprises a third linear layer, a GELU activation function, a second discarding layer, a fourth linear layer, a third discarding layer and a second random depth layer which are sequentially connected;

in this embodiment, the discarding rate of the first random depth layer is 0.1; the first linear layer size is 384×1152, the discard rate of the attention discard layer is 0.2, the second linear layer size is 384×384, and the discard rate of the first discard layer is 0.2; the third linear layer size is 384×384, the discard rate of the second discard layer is 0.2, the fourth linear layer size is 1152×384, the discard rate of the third discard layer is 0.2, and the discard rate of the second random depth layer is 0.1;

(3.5) passing the cross-modal text-related attention matrix E sequentially through the full connection layer L _E Cross-modal fusion self-attention feature extractor N _E Obtaining cross-modal related attention feature codes E _t ：

Cross-modal text-dependent attention matrix EFull connection layer L through cross-modal text-dependent attention matrix E _E Performing feature dimension compression with the dimension number changed to 384;

followed by cross-modal fusion of the self-attention feature extractor N _E The size of 384 Xv was obtained _l Cross-modality dependent attention feature encoding E _t ；

(3.6) encoding the cross-modality dependent attention profile E _t Inputting the time positioning characteristic E into the existing bidirectional gating circulation unit network _tg ：

(3.6.1) encoding the cross-modality dependent attention profile E _t Input into the existing two-way gating circulation unit network, the current gating circulation unit G ^t According to the last gated loop unit G ^t-1 And input x of the current node ^t Acquiring reset gating r and updating gating z;

r＝σ(W ^r ·[h ^t-1 ,x ^t ])

z＝σ(W ^z ·[h ^t-1 ,x ^t ])

wherein W is ^r To reset the gate weight, W ^z To update the gate weights, σ is the activation function.

(3.6.2) hiding the state h according to the reset gating r and the previous moment ^t-1 Obtaining reset state h after reset ^t-1 '：

h ^t-1 '＝h ^t-1 ⊙r

(3.6.3) reset State h ^t-1 ' and current input x ^t Splicing, scaling data to [ -1,1 ] by tanh activation function]In the range, an intermediate quantity h':

h′＝tanh(W·[h ^t-1' ,x ^t ])

(3.6.4) obtaining the output h of the current gating circulating unit according to the updated gating z and the intermediate quantity h ^t ：

h ^t ＝(1-z)⊙h ^t-1 +z⊙h′

(3.6.5) output h ^t Obtaining a time positioning characteristic E by linear combination _tg In the present embodiment, the time alignment feature E _tg Is 768 Xv in size _l ；

(3.7) establishing the initial time positioning full connection layer L _Q And end time positioning full connection layer L _J In this embodiment, the full link layer L is positioned at the start time _Q And end time positioning full connection layer L _J The sizes of (2) are 768 multiplied by 1;

(3.8) time-of-day positioning feature E _tg Positioning the full connection layer L by the starting time _Q And end time positioning full connection layer L _J Obtaining the corresponding initial time positioning characteristic E _Q And end time locating feature E _J ：

Time positioning feature E _tg Positioning the full link layer L by the starting time _Q Compressing the characteristic dimension to obtain a size v ₁ Start time locating feature E of (2) _Q ；

Time positioning feature E _tg Positioning the full connection layer L by the end time _J Compressing the characteristic dimension to obtain a size v ₁ End time locating feature E _J ；

(3.9) Start time positioning feature E _Q And end time locating feature E _J And respectively corresponding starting time and ending time are selected through normalization, so that a time sequence language positioning model based on the cross-modal text related attention is formed.

In this embodiment, the normalization function uses a Softmax function, as follows:

where i denotes a certain classification in n classes, g _i A value representing the classification. The multi-class output values can be converted to a range of [0,1 by a Softmax function]And the sum is a probability distribution of 1.

And step 4, training the constructed time sequence language positioning model based on the cross-modal text related attention.

(4.1) setting a Kullback-Leibler divergence as a loss function of the time-series language positioning model, which is expressed as follows:

where L is the final calculated positioning loss,

for the actual starting probability distribution, +.>

For the actual ending probability distribution, P _start P for a predicted onset probability distribution _end D for the predicted ending probability distribution _KL Calculating a function for the Kullback-Leibler divergence;

and (4.2) inputting the training data set into the constructed time sequence language positioning model, and updating model parameters by using an optimizer until the loss function converges to obtain the trained time sequence language positioning model based on the cross-modal text related attention.

And step 5, obtaining a time sequence language positioning result of the cross-modal text related attention.

Inputting the test data set into a trained time sequence language positioning model based on the cross-modal text related attention, and outputting a target segment time sequence boundary regression value with the highest confidence value, namely the time sequence language positioning result of the test data set.

The technical effects of the invention can be further illustrated by the following simulation experiments:

1. simulation conditions

The hardware platform of the simulation experiment of the invention is: the processor is an Intel (R) Core (TM) i7-7800X CPU, the main frequency is 3.50GH, the memory is 48GB, and the display card is NVIDIA GeForce TITAN XP X2.

The software platform of the simulation experiment of the invention is: ubuntu 16.04,Pytorch 1.12.0,Python 3.9.

The video and text inputs used in the simulation experiments of the present invention were from the Charades-STA dataset containing 6672 videos taken in daily life, most of the video content being in-house activity, the average duration of the videos being 29.76 seconds, each video having approximately 2.4 annotated target videos, the average duration of the target videos being 8.2 seconds. Also included in the dataset are 16128 video-text annotation pairs, which are divided into training and testing portions, with the training portion having 12408 pairs and the testing portion having 3720 pairs.

2. Simulation experiment contents

The invention and four existing time sequence language positioning methods DRN, LGI, BPNet and ACRM are used for time sequence language positioning on a Charades-STA data set, and the statistics and the comparison are respectively carried out on the percentage of the video quantity which is larger than 0.5 and 0.7 by using a Recall@1 evaluation index, so that the respective time sequence language positioning capability is evaluated, and the result is shown in Table 1:

table 1 comparative table of evaluation results of the present invention and the prior art

The DRN method is a dense regression network proposed by R.Zeng et al in "Dense Regression Network for Video Grounding, in2020IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)";

the LGI method is a Local-global video-text interaction method proposed by MUN J et al in Local-global video-text interactions for temporal grounding, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition;

the BPnet method suggests a network for the boundaries proposed by shaoming Xiao et al in "Boundary Proposal Network for Two-Stage Natural Language Video Localization, in Proceedings of the AAAI Conference on Artificial Intelligence";

the ACRM method is a Frame-by-Frame Cross-modal matching method proposed by H.Tang et al in Frame-Wise Cross-Modal Matching for Video Moment Retrieval, in IEEE Transactions on Multimedia;

as can be seen from table 1, when the result with the highest confidence is selected under the recall@1 evaluation index, the video quantity of the invention with the cross ratio greater than 0.5 and 0.7 respectively accounts for higher percentages than those of the prior art DRN, LGI, BPNet and ACRM, which shows that the invention has better time sequence language positioning capability on the Charades-STA data set.

Claims

1. A time sequence language positioning method based on cross-modal text related attention is characterized by comprising the following steps:

(3c) Building full connection layer L with dimension of 300 multiplied by 384 _E Cross-modality fusion self-attention feature extractor N _E The cross-modal text related attention matrix E is sequentially connected with the full linksJunction layer L _E Cross-modal fusion self-attention feature extractor N _E Obtaining cross-modal related attention feature codes E _t ；

2. The method according to claim 1, wherein the step (2) of extracting the video feature V corresponding to the video in the training dataset by the video encoder is to input the video without clipping into the video encoder, and extract the video feature according to the number of 4 feature blocks per second to obtain the length V corresponding to the video _l The video features V are the same, 1024 in dimension.

3. The method according to claim 1, characterized in that theExtracting text features S corresponding to texts in a training set through a word segmentation and word embedding encoder, firstly segmenting a text query sentence, inputting the obtained word into the word embedding encoder to obtain a text length S corresponding to the text _l The same text feature S with dimension 300.

4. The method according to claim 1, wherein the full link layer L of the video feature in step (3 a) _v Full connection layer L with text features _S The parameters are as follows:

full connection layer L of video feature V _v Size 1024×128;

full connection layer L of the text feature S _S The size is 300×128.

5. The method of claim 1, wherein the self-attention feature extractor N of the video features in step (3 a) _v And a self-attention feature extractor N for text features _s The structural parameters are the same, namely each self-attention feature extractor comprises a position coding layer, a discarding layer, a multi-head self-attention encoder P and a normalization layer which are connected in sequence;

the multi-head self-attention encoder P consists of 6 self-attention encoding layers T which are sequentially connected, wherein each self-attention encoding layer T comprises a first normalization layer, a self-attention module F, a first random depth layer, a second normalization layer and a forward propagation layer Y which are sequentially connected, the input of the first normalization layer is connected with the output residual error of the first random depth layer, and the output of the second normalization layer is connected with the output residual error of the forward propagation layer;

the discard rate of the first random depth layer is 0.3.

6. The method according to claim 5, wherein:

the self-attention module F comprises a first linear layer, an attention discarding layer, a second linear layer and a first discarding layer which are sequentially connected; the first linear layer size is 128 x 384, the attention discard layer discard rate is 0.3, the second linear layer size is 128 x 128, and the first discard layer discard rate is 0.3;

the forward propagation layer Y comprises a third linear layer, a GELU activation function, a second discarding layer, a fourth linear layer, a third discarding layer and a second random depth layer which are connected in sequence; the third linear layer has a size of 128×128, the second discard layer has a discard rate of 0.3, the fourth linear layer has a size of 128×128, the third discard layer has a discard rate of 0.3, and the second random depth layer discard rate is 0.2.

7. The method according to claim 1, wherein the video self-attention feature V obtained in step (3 a) _t And text self-attention feature S _t The parameters of (2) are as follows:

the video self-attention feature V _t Is 128 Xv in size _l ；

The text self-attention feature S _t Is 128 x s in size _l 。

8. The method of claim 1, wherein the step (3 b) of calculating a cross-modal text-related attention matrix E is implemented as follows:

3b1) According to video self-attention feature V _t And text self-attention feature S _t Obtaining a transition matrix M:

M＝S _t ×V _t ^T

3b2) According to the transition matrix M and the text feature S, a cross-modal text related attention matrix E is obtained:

E＝(S ^T ×M) ^T

wherein E has a size v _l ×300，v ₁ For video length, T is the matrix transpose.

9. The method of claim 1, wherein the cross-modality fusion self-attention feature extractor N in step (3 c) _E The multi-head self-attention encoder comprises a position coding layer, a discarding layer, a multi-head self-attention encoder P' and a normalization layer which are sequentially connected; the multi-head self-attention encoderP ' is composed of 6 self-attention coding layers T ' which are sequentially connected, each self-attention coding layer T ' comprises a first normalization layer, a self-attention module F ', a first random depth layer, a second normalization layer and a forward propagation layer Y ' which are sequentially connected, wherein the input of the first normalization layer is connected with the output residual error of the first random depth layer, and the output of the second normalization layer is connected with the output residual error of the forward propagation layer; the discard rate of the first random depth layer is 0.1.

10. The method according to claim 9, wherein:

the self-attention module F' comprises a first linear layer, an attention discarding layer, a second linear layer and a first discarding layer which are sequentially connected; the first linear layer size is 384×1152, the discard rate of the attention discard layer is 0.2, the second linear layer size is 384×384, and the discard rate of the first discard layer is 0.2;

the forward propagation layer Y' comprises a third linear layer, a GELU activation function, a second discarding layer, a fourth linear layer, a third discarding layer and a second random depth layer which are sequentially connected; the third linear layer size is 384×384, the discard rate of the second discard layer is 0.2, the fourth linear layer size is 1152×384, the discard rate of the third discard layer is 0.2, and the discard rate of the second random depth layer is 0.1.

11. The method of claim 1, wherein step (3 d) encodes E across modality-dependent attention feature codes _t Inputting the time positioning characteristic E into the existing bidirectional gating circulation unit network _tg The implementation steps of (a) are as follows:

3d1) Encoding cross-modality dependent attention features E _t Input into the existing two-way gating circulation unit network, the current gating circulation unit G ^t According to the last gated loop unit G ^t-1 And input x of the current node ^t Acquiring reset gating r and updating gating z;

r＝σ(W ^r ·[h ^t-1 ,x ^t ])

z＝σ(W ^z ·[h ^t-1 ,x ^t ])

3d2) According to the reset gate r and the hidden state h at the last moment ^t-1 Obtaining reset state h after reset ^t-1' ：

h ^t-1' ＝h ^t-1 ⊙r；

3d3) Will reset state h ^t-1' With current input x ^t Splicing, and scaling data to [ -1,1 by tanh activation function]In the range, an intermediate quantity h':

h′＝tanh(W·[h ^t-1' ,x ^t ])

3d4) Obtaining the output h of the current gating circulating unit according to the updated gating z and the intermediate quantity h ^t ：

h ^t ＝(1-z)⊙h ^t-1 +z⊙h′

3d5) Will output h ^t Obtaining a time positioning characteristic E by linear combination _tg 。

12. The method of claim 1, wherein the Kullback-Leibler divergence set in step (4) is expressed as follows:

where L is the final calculated positioning loss,

for the actual starting probability distribution, +.>

For the actual ending probability distribution, P _start P for a predicted onset probability distribution _end D for the predicted ending probability distribution _KL A function is calculated for Kullback-Leibler divergence. />