CN112860945B

CN112860945B - Method for multi-mode video question answering by using frame-subtitle self-supervision

Info

Publication number: CN112860945B
Application number: CN202110017595.1A
Authority: CN
Inventors: 张宏达; 胡若云; 沈然; 叶上维; 丁麒; 王庆娟; 陈金威; 熊剑峰; 丁莹; 赵洲; 陈哲乾; 李一夫; 丁丹翔; 姜伟昊
Original assignee: Hangzhou Yizhi Intelligent Technology Co ltd; Zhejiang University ZJU; State Grid Zhejiang Electric Power Co Ltd; Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Current assignee: Hangzhou Yizhi Intelligent Technology Co ltd; Zhejiang University ZJU; State Grid Zhejiang Electric Power Co Ltd; Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2022-07-08
Anticipated expiration: 2041-01-07
Also published as: CN112860945A

Abstract

The invention belongs to the field of video question answering, and particularly relates to a method for performing multi-mode video question answering by using frame-subtitle self-supervision. The method comprises the following steps: extracting video frame characteristics, question and answer characteristics, caption characteristics and caption suggestion characteristics; obtaining the attention frame characteristics and the attention subtitle characteristics to obtain fusion characteristics; calculating a time attention score based on the fusion characteristics; calculating a time boundary of the problem by using the time attention score; calculating to obtain a question answer by utilizing the fusion characteristics and the time attention score; training a neural network by using the time boundary of the question and the answer of the question; network parameters of the neural network are optimized, video question answering is carried out by utilizing the optimal neural network, and a time boundary is defined. The invention does not use time labels with expensive labels, but generates the time boundary related to the problem according to the self-designed time attention score. In addition, the invention obtains more accurate answers by mining the relation between the subtitles and the corresponding video contents.

Description

Method for multi-mode video question answering by using frame-subtitle self-supervision

Technical Field

The invention belongs to the field of video question answering, and particularly relates to a method for performing multi-mode video question answering by using frame-subtitle self-supervision.

Background

The multi-modal video question-answering task is a challenging task and currently attracts the attention of many people. The task designs two fields of computer vision and natural language processing, and needs a system which can provide answers to questions aiming at a specific video and define corresponding time boundaries of the questions in the video. At present, the video question-answering task is still a novel task, and the research on the video question-answering task is not mature.

The existing multi-mode video question-answer task generally uses a convolutional neural network to encode videos, and uses a cyclic neural network to encode a question-answer and subtitles in the videos to respectively fuse the question-answer encoding, the subtitle encoding and the video encoding to obtain fusion characteristics. Designing a decoder, and training the decoder through the question-answering labels and the time labels to obtain question answers and time boundaries.

This scheme requires time-stamp training decoders to improve the effect, but time-stamp labeling is empirical and expensive. In addition, the method splits the video frame and the subtitle, and ignores the corresponding relation between the frame and the subtitle.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method for multi-mode video question answering by utilizing frame-subtitle self-supervision.

In order to solve the technical problems, the invention adopts the following technical scheme: the method for multi-mode video question answering by utilizing frame-subtitle self-supervision comprises the following steps:

s1: extracting video frame characteristics, question and answer characteristics, caption characteristics and caption suggestion characteristics from input videos, question and answer texts and caption texts;

s2: the video frame characteristics and the question-answer characteristics are introduced into an attention mechanism to be fused to obtain frame characteristics with attention; the caption suggestion features and the question and answer features are introduced into an attention mechanism to be fused to obtain the caption features with attention; stacking all the attention frame features and the attention subtitle features to obtain fusion features; calculating a time attention score based on the fusion characteristics;

s3: calculating a time boundary of the problem by using the time attention score;

s4: calculating to obtain a question answer by utilizing the fusion characteristics and the time attention score;

s5: training a neural network by using the time boundary of the question and the answer of the question;

s6: optimizing network parameters of the neural network to obtain an optimal neural network, and performing video question answering and time boundary planning by using the optimal neural network.

Preferably, the extracting the video frame features, the question and answer features, the caption features and the caption suggestion features from the input video, the question and answer text and the caption text comprises:

for an input video, frames are extracted from the video according to a set frequency, for each frame, 20 candidate objects are segmented out by using a FasterR-CNN pre-training model to obtain feature expressions of the candidate objects, the candidate objects are reduced to 300 dimensions by a principal component analysis method, and then the candidate objects are projected to a 128-dimensional space through a layer of full-connection network to obtain video frame feature

Where T represents the number of video frames, N_oRepresenting the number of object regions;

for a video question-answer input comprising a question q and 5 candidate answers

It is formed into 5 groups of question-answer pairs h_k＝[q,a_k]Embedding question-answer pairs and subtitles into 768-dimensional vectors by using a BERT word embedding model, and projecting the 768-dimensional vectors to a 128-dimensional space through a layer of fully-connected network to obtain question-answer characteristics

And caption features

Wherein L is_sRepresenting the number of words of each question-answer pair, and T representing the total number of frames;

two sections of captionsSplicing the features to be used as caption suggestions, calculating corresponding video frames according to timestamps of captions, setting corresponding video frame labels to be 1, setting other labels to be 0, using the 01 string as the caption suggestions, embedding the caption suggestions into 768-dimensional vectors by using a BERT word embedding model, projecting the 768-dimensional vectors to a 128-dimensional space through a layer of full-connection network to obtain caption suggestion features

Wherein T is_spRepresenting the total number of feature subtitle suggestions, L_spRepresenting the number of words in the subtitle suggestion.

Preferably, the step of fusing the video frame features and the question-answer features into an attention mechanism to obtain the attention frame features includes:

for a given coded video frame characteristic

And individual question-answer characteristics

Wherein N is_oNumber of representative object regions, L_hRepresenting the number of words, the similarity matrix of all word region pairs is calculated using a word-level attention mechanism:

wherein Sim_vSimilarity matrix representing frame features, sim_vAnd

multiplying v and h respectively, and using maximum pooling dimension reduction:

v_att＝max(Sim_vv),v_att∈R¹²⁸，

fusing the obtained results through the following formula to obtain the attention frame characteristic v_f∈R₁₂₈：

v_f＝([v_att；h_att；v_att⊙h_att；v_att+h_att])W₁+b₁，

Wherein W₁And b₁Is the parameter to be trained and reduces the fused feature dimension to 128 dimensions, which represents a Hadamard product.

Preferably, the step of introducing the caption advising feature and the question and answer feature into the attention mechanism for fusion to obtain the caption feature with attention includes:

suggesting features for a single subtitle

And individual question-answer characteristics

The similarity matrix is calculated for all word region pairs using a word-level attention mechanism:

wherein

A similarity matrix representing the features of the frame

And

multiplying v and h respectively, and using maximum pooling dimensionality reduction in words and regions respectively:

fusing the obtained results through the following formula to obtain the character screen characteristic s with attention_f∈R₁₂₈：

Preferably, the stacking all attention frame features and attention caption features to obtain a fusion feature comprises:

all the attention frame characteristics and the attention caption characteristics are stacked to obtain V_f∈R^T×128And S_f∈R^T×128；

Will V_fAnd S_fMultiplying to obtain a similar matrix Sim_f∈R^T×T；

Respectively matching the similarity matrix corresponding to the video frame and the similarity matrix corresponding to the subtitle with the frame characteristic V with attention_fAnd caption feature S with attention mechanism_fMultiplication:

S_fatt＝Sim_fS_f,S_fatt∈R^T×128，

the obtained results are fused to obtain fusion characteristics F epsilon R^T×128：

F＝([V_fatt；s_fatt；V_fatt⊙S_fatt；V_fatt+S_fatt])W₂+b₂，

Wherein W₂And b₂Representing the weight matrix to be trained.

Preferably, the calculating the time attention score based on the fusion features comprises:

for the resulting fusion signature F ∈ R^T×128Calculating to obtain time attention score A through a full connection layer and sigmoid function_k∈R^TThe temporal attention score is used to reflect the degree of correlation of the video frame and the problem,

A_k＝sigmoid(WF+b)

wherein W is a parameter matrix of F, and is responsible for making F be equal to R^T×128Projection to R^TSpace, b is an offset term, and sigmoid represents a sigmoid function.

Preferably, the calculating the time boundary of the problem by using the time attention score includes:

the resulting temporal attention score is A_k∈R^TSetting a threshold A_tIs greater than a threshold value A_tAs candidates for time boundaries;

for each time-bounded candidate, compute the refinement score A from start to finish_P：

Where st represents the start time and ed represents the end time; and selecting the highest refined partition score as a problem time boundary partition scheme { st, ed }.

Preferably, the calculating the answer to the question by using the fused feature and the time attention score includes:

set the start and end times st, ed, score A for time attention_kComputing local question-answer representations using maximally pooled computational representations

Using maximum pooling at all times simultaneouslyInter-computing global representations

Stitching a local representation and a global representation

Obtaining answer probability through the softmax layer, and taking the candidate answer with the highest answer probability as the question answer.

Preferably, the network parameters of the optimized neural network include:

acquiring caption feature time attention A and a caption suggestion tag based on the question and answer features, the caption features and the caption suggestion features;

calculating binary cross entropy loss and sequencing loss according to the caption feature time attention A and the caption suggestion label, wherein the sequencing loss formula is as follows:

loss_rank＝1+avg(A_out)-avg(A_in)

wherein A is_outSuggesting a subtitle feature temporal attention with a tag of 1 for a subtitle, A_inSuggesting the caption feature time attention corresponding to the caption with the label of 0 for the caption;

the binary cross entropy loss formula is:

wherein T is_inIs the number of frames with caption recommended label of 1, T_outThe caption suggested label is 0 frame number;

the total unsupervised loss function is established as follows:

loss_self＝loss_rank+loss_bce；

after the total self-supervision loss value is calculated through the total self-supervision loss function, the gradient corresponding to each input data is calculated through a chain rule, then the gradient of the next moment is obtained through a gradient descending formula, the new total self-supervision loss value is calculated and output through forward propagation again by combining the gradient of the next moment, and the steps are repeated until the function is converged.

The technical scheme adopted by the invention has the following beneficial effects: the invention does not use time labels with expensive labels, but generates the time boundary related to the problem according to the self-designed time attention score. In addition, the invention obtains more accurate answers by mining the relation between the subtitles and the corresponding video contents.

The following detailed description of the present invention will be provided in conjunction with the accompanying drawings.

Drawings

The invention is further described with reference to the accompanying drawings and the detailed description below:

fig. 1 is a flow chart illustrating a method for multi-modal video question answering using frame-subtitle auto-surveillance according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a method for performing multi-mode video question answering by using frame-subtitle self-supervision, which is shown by referring to fig. 1 and comprises the following steps:

s1: and extracting video frame characteristics, question and answer characteristics, caption characteristics and caption suggestion characteristics from the input video, question and answer texts and the caption texts.

For an input video, extracting frames from the video according to a set frequency, segmenting 20 candidate objects to obtain characteristic expressions of each frame by using a FasterR-CNN pre-training model, reducing the dimension of each candidate object to 300 dimensions by a principal component analysis method, and passing through a layer of full-scaleProjecting the connecting network to 128-dimensional space to obtain video frame characteristics

It is formed into 5 groups of question and answer pairs h_k＝[q,a_k]Embedding question-answer pairs and subtitles into 768-dimensional vectors by using a BERT word embedding model, and projecting the 768-dimensional vectors to a 128-dimensional space through a layer of fully-connected network to obtain question-answer characteristics

And caption features

splicing two sections of caption features to be used as caption suggestions, calculating corresponding video frames according to timestamps of captions, setting corresponding video frame labels to be 1, setting the labels to be 0, using the 01 string as the caption suggestions, embedding the caption suggestions into 768-dimensional vectors by using a BERT word embedding model, and projecting the 768-dimensional vectors to a 128-dimensional space through a layer of full-connection network to obtain the caption suggestion features

Wherein T is_spRepresenting the total number of feature subtitle suggestions, L_spRepresenting the number of words in the caption suggestion.

S2: the video frame characteristics and the question-answer characteristics are introduced into an attention mechanism to be fused to obtain frame characteristics with attention; the caption suggestion features and the question and answer features are introduced into an attention mechanism to be fused to obtain the caption features with attention; stacking all the attention frame features and the attention subtitle features to obtain fusion features; a temporal attention score is calculated based on the fused features.

For a given coded video frame characteristic

And individual question-answer characteristics

Wherein N is_oNumber of representative object regions, L_hRepresenting the number of words, the similarity matrix for all word region pairs is calculated using a word-level attention mechanism:

wherein Sim_vSimilarity matrix representing frame features, sim_vAnd

v_att＝max(Sim_vv),v_att∈R¹²⁸，

v_f＝([v_att；h_att；v_att⊙h_att；v_att+h_att])W₁+b₁，

Suggesting features for a single subtitle

And individual question-answer characteristics

wherein

A similarity matrix representing the features of the frame

And

multiplying with v and h respectively, and using maximum pooling dimensionality reduction in words and regions respectively:

Will V_fAnd S_fMultiplying to obtain a similar matrix Sim_f∈R^T×T；

S_fatt＝Sim_fS_f,S_fatt∈R^T×128，

F＝([V_fatt；s_fatt；V_fatt⊙S_fatt；V_fatt+S_fatt])W₂+b₂，

Wherein W₂And b₂Representing the weight matrix to be trained.

For the resulting fusion signature F ∈ R^T×128Calculating to obtain time attention score A through a full connection layer and sigmoid function_k∈R^TThe temporal attention score is used to reflect the degree of correlation of the video frame and the question,

A_k＝sigmoid(WF+b)

wherein W is a parameter matrix of F, and is responsible for making F epsilon R^T×128Projection to R^TSpace, b is an offset term, and sigmoid represents a sigmoid function.

S3: the time boundary of the problem is calculated using the time attention score.

for each time-bounded candidate, compute from start to finishRefinement score of bundle A_P：

S4: and calculating to obtain the answer of the question by utilizing the fusion characteristics and the time attention score.

Computing global representations at all times using maximal pooling simultaneously

Stitching a local representation and a global representation

S5: the neural network is trained using the time boundaries of the questions and answers to the questions.

loss_rank＝1+avg(A_out)-avg(A_in)

wherein A is_outSuggesting a tag of 1 correspondence for subtitlesFeature time attention of subtitles, A_inSuggesting the caption feature time attention corresponding to the caption with the label of 0 for the caption;

the binary cross entropy loss formula is:

the total unsupervised loss function is established as follows:

loss_self＝loss_rank+loss_bce；

The invention does not use time labels with expensive labels, but generates the time boundary related to the problem according to the self-designed time attention score. In addition, the invention obtains more accurate answers by mining the relation between the subtitles and the corresponding video contents.

The embodiment of the invention tests in TVQA and TVQA + data sets, wherein TVQA is a large-scale multi-mode video data set and comprises 15.25 ten thousand question and answer pairs and 2.18 ten thousand videos. TVQA is a subset of TVQA, and the data set is augmented with a problem-related label box at the frame level based on TVQA. This example uses challenge-response pairs for supervision, and the results are shown in tables 1 and 2 (where "-" indicates that the method cannot count the data) in comparison with the prior art STAGE. Aiming at the question answering performance, the invention counts the classification accuracy, namely the proportion of samples with correct answers to total samples. For time positioning, the invention counts average time intersection ratio (T.mIoU), namely the average of the intersection of the predicted time boundary and the real time boundary of all samples in the proportion of a union set, and answer-duration joint accuracy rate (ASA), namely the proportion of the time intersection ratio (IoU) in the proportion of total samples when the answer is correct.

Table 1 test results of the present invention and technique-on TVQA data set

Name of method	Acc.	T.mIoU	ASA
				STAGE	66.38％	-	-
WSQG	69.13％	29.28％	20.78％

Table 2 test results of the present invention and technique-on TVQA + data set

Name of method	Acc.	T.mIoU	ASA
				STAGE	68.31％	-	-
WSQG	67.88％	30.30％	21.98％

As can be seen from tables 1 and 2, the method of the present invention has higher accuracy than the prior art STAGE.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that the invention is not limited thereto, and may be embodied in other forms without departing from the spirit or essential characteristics thereof. Any modifications which do not depart from the functional and structural principles of the present invention are intended to be included within the scope of the claims.

Claims

1. The method for multi-mode video question answering by utilizing frame-subtitle self-supervision is characterized by comprising the following steps of:

s2: the video frame characteristics and the question-answer characteristics are introduced into an attention mechanism to be fused to obtain frame characteristics with attention; introducing the caption suggestion features and the question and answer features into an attention mechanism for fusion to obtain the caption features with attention; stacking all the attention frame features and the attention subtitle features to obtain fusion features; calculating a time attention score based on the fusion characteristics;

s6: optimizing network parameters of the neural network to obtain an optimal neural network, and performing video question answering and defining a time boundary by using the optimal neural network;

the stacking all attention frame features and attention caption features to obtain a fusion feature comprises:

Will V_fAnd S_fMultiplying to obtain a similar matrix Sim_f∈R^T×T；

S_fatt＝Sim_fS_f,S_fatt∈R^T×128，

F＝([V_fatt；s_fatt；V_fatt⊙S_fatt；V_fatt+S_fatt])W₂+b₂，

Wherein W₂And b₂Representing a weight matrix to be trained;

the calculating the time attention score based on the fusion features comprises:

A_k＝sigmoid(WF+b)

wherein W is a parameter matrix of F, and is responsible for making F be equal to R^T×128Projection to R^TSpace, b is an offset item, and sigmoid represents a sigmoid function;

the calculating the time boundary of the problem by using the time attention score comprises the following steps:

2. The method of claim 1, wherein the extracting video frame features, question and answer features, caption features and caption suggestion features from the input video, question and answer text and caption text comprises:

for an input video, frames are extracted from the video according to a set frequency, for each frame, 20 candidate objects are segmented out by using a FasterR-CNN pre-training model to obtain characteristic expressions of the candidate objects, the candidate objects are reduced to 300 dimensions through a principal component analysis method, and then the candidate objects are projected to the video through a layer of full-connection network128-dimensional space, obtaining video frame characteristics

And caption features

3. The method for multi-modal video question answering using frame-subtitle auto-surveillance as claimed in claim 1, wherein the step of fusing the video frame features and the question answering features into an attention mechanism to obtain the frame features with attention comprises:

for a given coded viewFrequency frame characteristics

And individual question-answer characteristics

wherein Sim_vSimilarity matrix representing frame features, sim_vAnd

v_att＝max(Sim_vv),v_att∈R¹²⁸，

v_f＝([v_att；h_att；v_att⊙h_att；v_att+h_att])W₁+b₁，

Wherein W₁And b₁Is the parameter to be trained and reduces the fused feature dimension to 128 dimensions, which indicates a Hadamard product.

4. The method for multi-modal video question answering using frame-caption self-supervision as claimed in claim 1, wherein the step of fusing the caption proposal feature and the question answering feature into an attention mechanism to obtain the attention caption feature comprises the steps of:

suggesting features for a single subtitle

And individual question-answer characteristics

wherein

A similarity matrix representing the features of the frame

And

5. The method of claim 1, wherein the computing the answer to the question using the fused feature and the temporal attention score comprises:

Stitching a local representation and a global representation

6. The method for multimodal video question answering using frame-subtitle auto-supervision according to claim 1, wherein the optimizing network parameters of the neural network comprises:

loss_rank＝1+avg(A_out)-avg(A_in)

wherein A is_outSuggesting a subtitle feature temporal attention with a tag of 1 for a subtitle, A_inSuggesting a tag for the caption of0 corresponding to the caption feature time attention;

the binary cross entropy loss formula is:

the total unsupervised loss function is established as follows:

loss_self＝loss_rank+loss_bce；