CN114625849A

CN114625849A - Context-aware progressive attention video question-answering method and system

Info

Publication number: CN114625849A
Application number: CN202210192397.3A
Authority: CN
Inventors: 周凡; 张富为; 林谋广; 王若梅
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2022-06-14

Abstract

The invention discloses a context-aware progressive attention video question-answering method and system. The method comprises the following steps: the method comprises the steps of extracting feature information in a video by a feature coding unit, constructing a context sensing unit, carrying out multi-module fusion and alignment by adopting three attention modules, constructing a model training unit, carrying out model training, generating an answer prediction score, and finally inputting a target video to obtain a prediction answer. The method extracts multi-modal features by constructing three units, progressively fuses multi-modal information related to the problems in the video through an attention mechanism, constructs context information among the multi-modalities by using a BilSTM network, obtains the multi-modal information most related to the problems by using the attention mechanism, updates the multi-modal information in a memory unit by using the BilSTM network, and predicts answers by using a Softmax function, so that the performance and the accurate positioning capability of a video question-answering model are improved.

Description

Context-aware progressive attention video question-answering method and system

Technical Field

The invention relates to the field of video question answering, in particular to a context-aware progressive attention video question answering method and system.

Background

Video question answering (VideoQA) is a fine-grained video understanding task following video description, and relative to generalized description in the video description task, the video question answering not only needs to understand visual content, text information and voice information, but also needs to establish a connection among three modal data and carry out reasoning, so that the video question answering process needs more detailed description information and a complex reasoning process than the video description process, and therefore, it is important to research how to extract effective information from a large amount of growing videos. The video question-answering method is divided into a rule-based video question-answering method and a deep learning-based video question-answering method, wherein the rule-based video question-answering method starts in 2003 at first, the early video question-answering method takes the video question-answering method as query content and a question as a query son, relevant video content information is positioned in a retrieval mode, research objects are mainly concentrated in the news video field, video content is modeled in a video content structuring mode, an HMM (hidden Markov model) is used for constructing a reasoning mechanism, and the video question-answering method is important and valuable for acquiring information from videos, especially because a large number of videos are manufactured at present. The current video question and answer method starts in 2016 at first, research objects are mainly concentrated on corresponding data sets, and due to the space characteristics of video questions and answers, video question and answer data are constructed and integrated into a challenging task, so that the progress of the field of video questions and answers is delayed. With the gradual improvement of data sets in recent years, the video question-answering research has also made new progress. Some works are explored on space attention and time attention, some have breakthroughs on the aspect of fusion of static features and dynamic features, and the dynamic memory network model in the visual question-answering is expanded. The networks can better extract useful video information and carry out interaction, and good performance is achieved. However, because of the complexity of the task, the overall performance still has a space for greatly improving, and more work in the field of video question answering still focuses on integrating video dynamic time sequence information and video multi-modal feature fusion at present.

One of the prior arts at present, Pendurkar S et al, constructs an extensible attention-based multimodal fusion framework, in order to obtain information about relation between dynamic information in a video frame and words in a question, uses dense bidirectional LSTM to continuously focus on the dynamic information and the question information and perform multimodal fusion, then adds additional weight information to control the same dimensionality, finally uses LSTM to focus on and filter output words again, and uses softmax to perform answer classification prediction. The model can alleviate the problem of insufficient modeling in video information for a long time through dense bidirectional LSTM. The disadvantages are: because the attention features of different modes are extracted respectively, the noise information of later-stage fusion vectors is increased, and the final question answering result is influenced.

Another current prior art is that Kim J constructs an attention transfer network (MSAN) consisting of the following components: (a) using BERT for embedded video and text representation, (b) moment suggestion network to locate time moments of interest for answering questions, (c) heterogeneous network inference to infer location-based time moments for correct answers, and (d) modulating the output of (b) and (c) of importance weights according to different importance. The model constructs a heterogeneous attention reasoning mechanism by combining self-attention and co-attention, relieves the problem of intra-modal and inter-modal fusion, and dynamically relieves the problem of different model information corresponding to different problem types by using importance modulation. The disadvantages are: because the modal information is processed by adopting a vector dot product or addition processing mode, the problem of fuzzy modal semantic feature fusion still exists.

The third prior art is Kim J, Ma M, Kim K, etc. that constructs a Progressive Attention Memory Network (PAMN) consisting of the following components: (a) the question and candidate answer are first embedded into a common space. Video and subtitles are embedded into a dual memory, which facilitates independent memory for each modality. (b) To infer an answer, the progressive attention mechanism determines a portion of time associated with answering the question. (c) The dynamic modality fusion mechanism adaptively integrates the output of each memory by considering the contribution of the modalities. (d) The confidence-corrected answer mechanism corrects the confidence of each answer in turn from the confidence of the possible initializations. The model utilizes the question and the answer to filter the modal information of the video and the subtitle, and then adopts a soft attention mechanism to dynamically fuse the filtered modal information, wherein the progressive attention is to utilize the answer and the question information to correct the confidence coefficient of the video and the subtitle memory. The model carries out modal semantic representation through questions and answers, the problem that the modal semantic representation is inaccurate is greatly relieved, a soft attention mechanism is used for designing a dynamic fusion module to be combined with a confidence correction model to form a reasoning mechanism, and the reasoning capability of the model is greatly improved. The disadvantages are: the use of questions and answers for progressive attention results in a bias in the model, i.e., the model is less generalizable.

Disclosure of Invention

The invention aims to overcome the defects of the existing method and provides a context-aware progressive attention video question-answering method. The invention solves the main problems that: firstly, the existing model has the problems of prejudice and weak generalization capability; secondly, the problem of fuzzy modal semantic feature fusion; thirdly, attention features of different modes are extracted, and the problem of noise information of later-stage fusion vectors is increased.

In order to solve the above problem, the present invention provides a method for video question answering with context-aware progressive attention, comprising:

the method comprises the steps of constructing a feature coding unit, inputting a video question and answer related data set, and outputting static features, dynamic features, question text features and caption text features in a video;

a context perception unit is constructed, the static feature, the dynamic feature, the question text feature and the caption text feature are input, and the corresponding multi-modal context enhancement feature is output;

a model training unit is constructed, the multi-mode context enhancement features are input, and a cross entropy loss function is adopted for training to obtain a pre-training model;

and constructing an answer prediction unit, inputting the target question-answer video into the pre-training model, and outputting the target question-answer video as corresponding predicted answer information.

Preferably, the constructed feature encoding unit inputs a video question and answer related data set, and outputs a static feature, a dynamic feature, a question text feature and a caption text feature in a video, specifically:

the feature coding unit consists of a video dynamic feature extraction module, a video static feature extraction module, a caption text feature extraction module and a problem text feature extraction module, word embedding is carried out on text features by utilizing a pre-trained BERT network model, coding is carried out by utilizing a bidirectional circulation network BilSTM, static frame information in a video is extracted by utilizing a pre-trained VGG16 network model, coding is carried out by utilizing the bidirectional circulation network BilSTM, dynamic information in the video is extracted by utilizing a pre-trained C3D network model, and coding is carried out by utilizing the bidirectional circulation network BilSTM;

the method comprises the steps of constructing a video dynamic feature extraction module, firstly defining a video as V, extracting dynamic features of the video through a pre-trained C3D network, and obtaining dynamic features m of the video from a full connection layer of the last layer of the video_iWherein m is_i(i ═ 1, 2.., N) is the ith video motion feature, resulting in video motion feature V^m＝[m₁,m₂,...,m_N]∈R^4096×NWherein N is the frame number of the video, and the dynamic characteristics of the video are encoded by using a 250-dimensional bidirectional circulation network BilSTM

Obtaining the dynamic characteristics of the coded video

Wherein

The feature vector after the ith video dynamic feature coding is obtained, N is the frame number of the video, and m represents the video dynamic feature;

constructing a video static feature extraction module, extracting static features in a video by adopting a pre-trained VGG16 network model, extracting video static frame features by using 1FPS, and obtaining the static features a of the seventh full-connection layer fc7 of the VGG16 network model_iWherein a is_i(i 1, 2.., N) is the ith video static feature, resulting in video static feature V^a＝[a₁,a₂,...,a_N]∈R^4096×NWherein N is the frame number of the video, and the static characteristics of the video are coded by using a 250-dimensional bidirectional circulation network BilSTM

Obtaining the static characteristics of the coded video

Wherein

The feature vector after the ith video static feature coding is obtained, N is the frame number of the video, and a represents the video static feature;

constructing a caption text feature extraction module, extracting caption text features by adopting a 12-layer pre-trained BERT network model, and obtaining caption text features c from the second last layer of the BERT network model_iWherein c is_i(i ═ 1, 2.., N) is the ith subtitle text feature, resulting in video subtitle text feature V^c＝[c₁,c₂,...,c_N]∈R^768×NWherein N is the frame number of the video, and the caption text features are coded by using a 250-dimensional bidirectional circulation network BilSTM

Obtaining the character of the caption text after coding

Wherein

The feature vector is coded for the ith subtitle text feature, N is the frame number of the video, and c represents the video subtitle text feature;

the method comprises the steps of constructing a problem text feature extraction module, firstly defining a problem as q, selecting a 12-layer pre-trained BERT network model to extract problem text features, and obtaining problem text features q from the second last layer of the BERT network model_jWherein q is_j(i 1, 2.. n) is the text feature corresponding to the jth question, n is the number of questions corresponding to each video in the data set, and the video question text feature V is obtained^q＝[q₁,q₂,...,q_n]∈R^768×nWherein n is the number of problems corresponding to each video in different data sets, and a 250-dimensional bidirectional circulation network BilSTM is used for coding the problem text characteristics

Get its coded question text features

Wherein

And (4) encoding the feature vector of the jth question, wherein n is the number of questions corresponding to each video in different data sets, and q represents the text feature of the questions.

Preferably, the constructing context sensing unit inputs the dynamic feature, the caption text feature, the static feature and the question text feature, and outputs a corresponding multi-modal context enhancement feature, specifically:

constructing a context sensing unit, and constructing three Attention modules Q2A-Attention, QA2C-Attention and QAC2M-Attention by utilizing a Co-Attention mechanism;

whereinNote that the module Q2A-Attention is to apply the question text feature vector to

And the static feature vector

As an input, where j represents the corresponding jth question in each video, i represents the ith static feature in each video, and the attention model is represented as

Co-Attention mechanism for aligning multi-modal information Attention performs multi-modal fusion and alignment using soft-alignment matrices, each soft-alignment matrix S_r,cIs the product of ReLU function, where r is the row of the matrix and c is the column of the matrix, then makes correlation attention to the static feature of the video, and generates the feature by using attention weight

And

finally, the generated characteristics are connected

The attention mechanism is shown as the following formula:

wherein S_r,cIs a soft alignment matrix, r is the row of the matrix, c is the column of the matrix, w_a，w_qIs the training weight of the joint learning,

and

is the feature vector of the input and,

and

features of the corresponding feature vector that have undergone the associated attention, [,]for the connection operation, performing correlation attention on the feature vectors through the soft alignment matrix to obtain the feature vectors after attention, and then connecting the feature vectors by using the connection operation;

wherein the Attention Module QA2C-Attention is to the features after the last Attention layer

And the feature vector of the caption text

As input, where i represents the feature after the ith first layer attention and the feature of the caption text in each video, the attention model is represented as

Then, the character screen text features are learned for relevant attention, and the attention weight is used for generating the features

And

finally, the generated characteristics are connected

The formula can be expressed as:

therein, note that the module QAC2M-Attention characterizes the output of the second layer

And the dynamic feature vector

As input to the Attention module QAC2M _ Attention, where i denotes the output features and static features after the corresponding i-th Attention in each video, the Attention model is expressed as

Then, the video dynamic characteristics are learned with the correlation attention weight, and the attention weight is used for generating the characteristics

And

finally, the generated features are connected

The formula can be expressed as:

gradually paying attention to the multi-modal characteristics in the video to obtain the correlation characteristics of the multi-modal characteristics, and finally modeling the context information among the multi-modal characteristics by utilizing a bidirectional circulation network BilSTM to obtain the multi-modal context enhancement characteristics;

the method for modeling the context information among the multiple modes by utilizing the bidirectional circulation network BilSTM comprises the steps of firstly respectively inputting the relevant information of different modes into the corresponding bidirectional circulation network BilSTM for autocorrelation updating, then taking the updated information as the input of the bidirectional circulation network BilSTM for updating the context information of the multiple modes, wherein a formula can be expressed as follows:

d_aq＝BiLSTM_aq(u_aq),d_aqc＝BiLSTM_aqc(u_aqc)

d_aqcm＝BiLSTM_aqcm(u_aqcm),d_aq；aqc＝BiLSTM_aqc(d_aq)

d_aq；aqcm＝BiLSTM_aqcm(d_aq),d_aqc；aq＝BiLSTM_aq(d_aqc)

d_aqc；aqcm＝BiLSTM_aqcm(d_aqc),d_aqcm；aq＝BiLSTM_aq(d_aqcm)

d_aqcm；aqc＝BiLSTM_aqc(d_aqcm)

and finally, connecting the bi-directional circulation network BilSTM coding layer output by using a connection operation, wherein the formula is as follows:

H＝Concatenate([d_aq；aqc,d_aq；aqcm,d_aqc；aqcm,d_aqc；aq,d_aqcm；aqc,d_aqcm；aq])。

preferably, the input of the model building training unit is the multi-modal context enhancement feature, and a cross entropy loss function is used for training to obtain a pre-training model, specifically:

constructing a model training unit, constructing a fourth Attention module by utilizing a Co-Attention mechanism, and constructing the problem feature vector

And feature H after connection_iAs input to said Attention module Q2A _ Attention, where j denotes the corresponding jth question in each video,i represents the ith output characteristic after passing through the context sensing unit, and the attention model is represented as

Then, the multi-modal context information is subjected to learning of correlation attention weight, and the attention weight is utilized to generate features

And

finally, the generated features are connected

And the memory cell is updated by using the bidirectional circulation network BilSTM, and the formula can be expressed as follows:

B_qH＝BiLSTM(u_qH)

finally, a final modal vector h is obtained after passing through a full connection layer_FC＝FC(B_qH) Using softmax function to convert the modal vector h_FCConverting into answer prediction scores, selecting the maximum prediction score from the final answer prediction scores as an answer, and calculating by using a calculation formula: pro ═ softmax (h)_FC) Obtaining the prediction scores of the candidate answers, and finally adopting a cross entropy loss function to train

Obtaining a pre-training model, wherein

Representing the kth candidate answer.

Accordingly, the present invention also provides a video question-answering system of context-aware progressive attention, comprising:

the feature coding unit is used for extracting video dynamic features, video static features, subtitle text features and problem text features and coding by utilizing a bidirectional circulation network BilSTM;

the context enhancement unit is used for performing multi-mode fusion and alignment on the extracted features to obtain multi-mode context enhancement features;

and the model training and predicting unit is used for training the model and predicting the answer.

The implementation of the invention has the following beneficial effects:

first, the present invention provides a context-aware progressive attention video question-answering method. Capturing relevance among the multiple modes through progressive attention to the multiple modes of information in the video, and modeling context information among the multiple modes through bidirectional BilSTM; secondly, the invention utilizes the correlation among multi-mode information to improve the performance of the video question-answering model; third, inspired by the attention mechanism and the two-way LSTM network, the present invention employs the attention mechanism to build a fusion alignment module between the multiple modalities and utilizes the two-way LSTM network to model context information between the modalities.

Drawings

FIG. 1 is a general flow diagram of a method for context-aware progressive attention video question answering in accordance with an embodiment of the present invention;

FIG. 2 is a diagram of a context-aware progressive-attention video question-answer model according to an embodiment of the present invention;

fig. 3 is a block diagram of a video context-aware progressive-attention question-answering system in accordance with an embodiment of the present invention.

Detailed Description

Technical inventions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

Fig. 1 is a general flowchart of a video question answering method for context-aware progressive attention according to an embodiment of the present invention, as shown in fig. 1, the method includes:

s1, constructing a feature coding unit, inputting the feature coding unit into a video question and answer related data set, and outputting the feature coding unit into a static feature, a dynamic feature, a question text feature and a caption text feature in a video;

s2, constructing a context sensing unit, inputting the static feature, the dynamic feature, the question text feature and the caption text feature, and outputting corresponding multi-modal context enhancement features;

s3, constructing a model training unit, inputting the multi-modal context enhancement features, and training by adopting a cross entropy loss function to obtain a pre-training model;

s4, constructing an answer prediction unit, inputting the target question answering video into the pre-training model, and outputting the target question answering video as corresponding prediction answer information.

Step S1 is specifically as follows:

s1-1: fig. 2 shows a video question-answering model of context-aware progressive attention. Firstly, a video question and answer data set in the current video question and answer field, such as TVQA, LifeQA and the like, is input, and coding features of different modes are output. The invention utilizes a pretrained BERT network model to embed words into text information, utilizes BilSTM to carry out feature coding, constructs a text feature extraction network of the invention, simultaneously utilizes pretrained VGG16 and C3D network models to respectively extract static feature information and dynamic feature information in videos, and utilizes BilSTM to carry out feature coding, and constructs the video static feature extraction network and the video dynamic feature extraction network of the invention.

S1-2: constructing a video dynamic feature extraction module, firstly defining a video as V, extracting the dynamic feature of the video through a pre-trained C3D network, and obtaining the dynamic feature m of the video from the full connection layer of the last layer_iWherein m is_i(i ═ 1, 2.., N) is the ith video motion feature, resulting in video motion feature V^m＝[m₁,m₂,...,m_N]∈R^4096×NWhere N is the number of frames in the video, using 250 dimensionsBidirectional cyclic network BilsTM of video coding dynamic characteristics

Obtaining the dynamic characteristics of the coded video

Wherein

The feature vector after the motion feature coding of the ith video is obtained, N is the frame number of the video, and m represents the motion feature of the video.

S1-3: constructing a video static feature extraction module, extracting static features in a video by adopting a pre-trained VGG16 network model, extracting video static frame features by using 1FPS, and obtaining static features a of the model from a seventh full-connection layer fc7 of a VGG16 network model_iWherein a is_i(i ═ 1, 2., N) is the ith video static feature, resulting in video static feature V^a＝[a₁,a₂,...,a_N]∈R^4096×NWherein N is the frame number of the video, and the static characteristics of the video are coded by using a 250-dimensional bidirectional circulation network BilSTM

Obtaining the static characteristics of the coded video

Wherein

And (4) coding the feature vector of the ith video static feature, wherein N is the frame number of the video, and a represents the video static feature.

S1-4: constructing a caption text feature extraction module, extracting caption text features by adopting a 12-layer pre-trained BERT network model, and obtaining caption text features c from the second last layer of the BERT network model_iWherein c is_i(i ═ 1, 2.., N) is the ith subtitle text feature, resulting in video subtitle text feature V^c＝[c₁,c₂,...,c_N]∈R^768×NWherein N is the frame number of the video, and the caption text features are coded by using a 250-dimensional bidirectional circulation network BilSTM

Obtaining the character of the caption text after coding

Wherein

And (4) a feature vector after the ith subtitle text feature is coded, N is the frame number of the video, and c represents the subtitle text feature of the video.

S1-5: a problem text feature extraction module is constructed, firstly, the problem is defined as q, 12 layers of pre-trained BERT network models are selected to extract problem text features, and the problem text features q are obtained from the second last layer of the BERT network models_jWherein q is_j(i 1, 2.. n) is the text feature corresponding to the jth question, n is the number of questions corresponding to each video in the data set, and the video question text feature V is obtained^q＝[q₁,q₂,...,q_n]∈R^768×nWherein n is the number of problems corresponding to each video in different data sets, and a 250-dimensional bidirectional circulation network BilSTM is used for coding the problem text characteristics

Get its coded question text features

Wherein

Step S2 is specifically as follows:

s2-1: the context sensing unit receives the features encoded in S1 as input, and outputs the features as context enhancement attention. In order to capture the multi-modal correlation in video, the invention designs three Attention modules in the module by utilizing the Co-Attention mechanism. In order to pay progressive Attention to multi-modal features, the problem is firstly paid Attention to in a correlation mode with the static features of the video, the problem corresponds to a module Q2F-Attention, then the related information after Attention is utilized to pay Attention to the video subtitles in a correlation mode, the problem corresponds to QF2C-Attention, finally the related information after Attention is utilized to pay Attention to the dynamic features of the video, the related information among the modalities corresponds to QFC2M, therefore, the related information among the modalities is obtained, then two-way BilSTM layer coding is constructed to model the context information among the multi-modal information, and finally all the outputs are connected.

S2-2: first Attention to the module Q2A-Attention, the present invention addresses the problem feature vector

And video static features

As input to the Attention module Q2A-Attention, where j denotes the corresponding jth question in each video, i denotes the ith static feature in each video, and the Attention model is expressed as

And

finally, the generated features are connected

The attention mechanism is shown as the following formula:

wherein S_r,cIs a soft alignment matrix, r is a row of the matrix, c is a column of the matrix, w_a，w_qIs the training weight of the joint learning,

and

is the feature vector of the input and,

and

features of the corresponding feature vector that have undergone the associated attention, [,]for the connection operation, the feature vectors are subjected to correlation attention through the soft alignment matrix, the feature vectors after attention are obtained, and then the feature vectors are connected by using the connection operation to ensure the output alignment of the feature vectors.

S2-3: the second attention guiding module of the invention also adopts the same attention mechanism to perform fusion and alignment operation on the subtitle text characteristics, firstlyFeature after last attention layer

And caption text features

As QA2C _ Attention input, where i represents the features after the ith first layer Attention and the caption text features in each video, and the Attention model is represented as

And

finally, the generated features are connected

The formula can be expressed as:

s2-4: the invention designs a third Attention module by adopting the same method, performs fusion and alignment operation on the dynamic characteristics of the video by utilizing a Co-Attention mechanism, and first characterizes the output characteristics of a second layer

And video motion features

As QAC2M _ Attention input, where i denotes the output features and static features after the corresponding ith Attention in each video, Attention model is expressed as

And

finally, the generated features are connected

The formula can be expressed as:

s2-5: by gradually paying attention to the multi-modal characteristics in the video to obtain the correlation characteristics, the invention utilizes the BilSTM to model the context information among the multiple modes, firstly, the correlation information of different modes is respectively input to the corresponding BilSTM to carry out autocorrelation updating, then, the updated information is used as the input of the BilSTM to update the multi-modal context information, and the formula can be expressed as follows:

d_aq＝BiLSTM_aq(u_aq),d_aqc＝BiLSTM_aqc(u_aqc)

d_aqcm＝BiLSTM_aqcm(u_aqcm),d_aq；aqc＝BiLSTM_aqc(d_aq)

d_aq；aqcm＝BiLSTM_aqcm(d_aq),d_aqc；aq＝BiLSTM_aq(d_aqc)

d_aqc；aqcm＝BiLSTM_aqcm(d_aqc),d_aqcm；aq＝BiLSTM_aq(d_aqcm)

d_aqcm；aqc＝BiLSTM_aqc(d_aqcm)

and finally, connecting the output of the BilSTM coding layer by using a connection operation, wherein the formula is as follows:

H＝Concatenate([d_aq；aqc,d_aq；aqcm,d_aqc；aqcm,d_aqc；aq,d_aqcm；aqc,d_aqcm；aq])

step S3 is specifically as follows:

s3-1: and the model training unit inputs the context features after attention enhancement of the S2 and outputs the corresponding candidate answer indexes. In order to obtain the multi-modal context information most relevant to the problem, in the module, the invention constructs a fourth Attention module by utilizing a C.o-Attention mechanism, and the invention leads the problem feature vector to be used

And feature H after connection_iAs the input of the Attention module Q2A _ Attention, where j represents the corresponding jth question in each video, i represents the ith output characteristic after passing through the context sensing unit, and the Attention model is expressed as

Then, the multi-modal context information is subjected to learning of relevant attention weight, and the attention weight is utilized to generate features

And

finally, the generated features are connected

And the BilSTM is used for updating the memory cell, and the formula can be expressed as follows:

B_qH＝BiLSTM(u_qH)

s3-2: finally, a final modal vector h is obtained after passing through a full connection layer_FC＝FC(B_qH) Feature vector h is transformed using softmax_FCConverting the answer prediction scores into answer prediction scores, selecting the maximum prediction score from the final answer prediction scores as an answer, and calculating by using a calculation formula: pro ═ softmax (h)_FC) Obtaining the predicted scores of the candidate answers, and finally, the invention adopts a cross entropy loss function to train

Wherein

Representing the kth candidate answer.

Step S4 is specifically as follows:

s4-1: and the answer prediction unit obtains a pre-training model through the training, and predicts the final answer index information by using argmax.

Accordingly, the present invention also provides a video question-answering system with context-aware progressive attention, as shown in fig. 3, comprising:

and the feature coding unit 1 is used for extracting video dynamic features, video static features, subtitle text features and problem text features and coding by using a bidirectional circulation network BilSTM.

The method specifically comprises a video dynamic feature extraction module, a video static feature extraction module, a caption text feature extraction module and a problem text feature extraction module, wherein word embedding is carried out on text features by utilizing a pre-trained BERT network model, coding is carried out by utilizing a bidirectional circulation network BilSTM, static frame information in the video is extracted by utilizing a pre-trained VGG16 network model, coding is carried out by utilizing the bidirectional circulation network BilSTM, dynamic information in the video is extracted by utilizing a pre-trained C3D network model, and coding is carried out by utilizing the bidirectional circulation network BilSTM.

And the context enhancement unit 2 is used for performing multi-mode fusion and alignment on the extracted features to obtain multi-mode context enhancement features.

Specifically, a context sensing unit is constructed, three Attention modules Q2A-Attention, QA2C-Attention and QAC2M-Attention are constructed by utilizing a Co-Attention mechanism, the dynamic feature, the subtitle text feature, the static feature and the question text feature are input, and the dynamic feature, the subtitle text feature, the static feature and the question text feature are output as corresponding multi-modal context enhancement features.

And the model training and predicting unit 3 is used for training the model and predicting the answer.

Specifically, a model training unit is constructed, a fourth Attention module is constructed by using a Co-Attention mechanism to obtain a final modal vector, a softmax function is used for converting the modal vector into an answer prediction score, the maximum prediction score is selected from the final answer prediction scores to serve as an answer, and a formula is calculated: pro ═ softmax (h)_FC) And finally, training by adopting a cross entropy loss function to obtain a pre-training model, inputting the target question-answer video into the pre-training model, and outputting the target question-answer video as corresponding predicted answer information.

Therefore, the invention firstly uses a pre-trained network to respectively extract static information and dynamic information characteristics in a video, utilizes a BERT network to embed caption information and question-answer pair information in a data set, then respectively utilizes a BilSTM network to encode multi-mode information, then utilizes a Co-Attention mechanism to construct a progressive Attention module, performs fusion alignment operation on the multi-mode information, utilizes a bidirectional BilSTM network to construct a context coding layer, acquires correlation information between modes, uses the Attention mode again to obtain the multi-mode information most relevant to a question, and finally utilizes the BilSTM to update a memory unit and connects a full connection layer and a Softmax layer to predict an answer.

The above detailed description is provided for the context-aware progressive attention video question-answering method and system according to the embodiments of the present invention, and specific examples are applied herein to illustrate the principles and embodiments of the present invention, and the above description of the embodiments is only used to help understanding the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for context-aware progressive video question-answering, the method comprising:

constructing a model training unit, inputting the multi-modal context enhancement features, and training by adopting a cross entropy loss function to obtain a pre-training model;

2. The method as claimed in claim 1, wherein the constructed feature encoding unit inputs a video question-answer related data set and outputs a static feature, a dynamic feature, a question text feature and a caption text feature in the video, and specifically comprises:

the video dynamic feature extraction module is constructed, a video is defined as V, the dynamic features of the video are extracted through a pre-trained C3D network, and finally the dynamic features are extractedOne layer of the full connection layer obtains the dynamic characteristic m of the full connection layer_iWherein m is_i(i ═ 1, 2.., N) is the ith video motion feature, resulting in video motion feature V^m＝[m₁,m₂,...,m_N]∈R^4096×NWherein N is the frame number of the video, and the dynamic characteristics of the video are encoded by using a 250-dimensional bidirectional circulation network BilSTM

Obtaining the dynamic characteristics of the coded video

Wherein

The feature vector is coded for the ith video dynamic feature, N is the frame number of the video, and m represents the video dynamic feature;

constructing a video static feature extraction module, extracting static features in a video by adopting a pre-trained VGG16 network model, extracting video static frame features by using 1FPS, and obtaining the static features a of the seventh full-connection layer fc7 of the VGG16 network model_iWherein a is_i(i ═ 1, 2., N) is the ith video static feature, resulting in video static feature V^a＝[a₁,a₂,...,a_N]∈R^4096×NWherein N is the frame number of the video, and the static characteristics of the video are coded by using a 250-dimensional bidirectional circulation network BilSTM

Obtaining the static characteristics of the coded video

Wherein

The feature vector after the i-th video static feature coding, N is the frame number of the video, and a represents the videoA static characteristic;

constructing a caption text feature extraction module, extracting caption text features by adopting a 12-layer pre-trained BERT network model, and obtaining caption text features c from the second last layer of the BERT network model_iWherein c is_i(i ═ 1, 2., N) is the ith subtitle text feature, resulting in video subtitle text feature V^c＝[c₁,c₂,...,c_N]∈R^768×NWherein N is the frame number of the video, and the caption text characteristics are coded by using a 250-dimensional bidirectional circulation network BilSTM

Obtaining the character of the caption text after coding

Wherein

Get its coded question text features

Wherein

3. The method as claimed in claim 2, wherein the construction context-aware unit inputs the dynamic feature, the caption text feature, the static feature and the question text feature, and outputs a corresponding multi-modal context enhancement feature, specifically:

therein, note that the question text feature vector is transformed by the module Q2A-Attention

And the static feature vector

And

finally, the generated features are connected

The attention mechanism is shown as the following formula:

wherein S_r,cIs a soft alignment matrix, r is a row of the matrix, c is a column of the matrix, w_a，w_qIs the training weight for the joint learning,

and with

Is the feature vector of the input and,

and with

Features of the corresponding feature vector that have undergone the associated attention, [,]for join operation, by soft-alignment matrix pairsCarrying out relevance attention on the feature vectors to obtain the feature vectors after attention, and then connecting the feature vectors by utilizing a connecting operation;

And the feature vector of the caption text

And

finally, the generated features are connected

The formula can be expressed as:

And the dynamic feature vector

QAC2M _ Attention as Attention ModuleWherein i represents the output features and static features after the corresponding i-th attention in each video, and the attention model is represented as

Then, the video dynamic characteristics are subjected to learning of correlation attention weight, and characteristics are generated by using the attention weight

And

finally, the generated features are connected

The formula can be expressed as:

d_aq＝BiLSTM_aq(u_aq),d_aqc＝BiLSTM_aqc(u_aqc)

d_aqcm＝BiLSTM_aqcm(u_aqcm),d_aq；aqc＝BiLSTM_aqc(d_aq)

d_aq；aqcm＝BiLSTM_aqcm(d_aq),d_aqc；aq＝BiLSTM_aq(d_aqc)

d_aqc；aqcm＝BiLSTM_aqcm(d_aqc),d_aqcm；aq＝BiLSTM_aq(d_aqcm)

d_aqcm；aqc＝BiLSTM_aqc(d_aqcm)

4. the method according to claim 3, wherein the input of the model building training unit is the multi-modal context enhancement feature, and the model building training unit is trained by using a cross entropy loss function to obtain a pre-training model, specifically:

constructing a model training unit, constructing a fourth Attention module by using a Co-Attention mechanism, and constructing the problem feature vector

With feature H after connection_iAs the input of the Attention module Q2A _ Attention, where j represents the corresponding jth question in each video, i represents the ith output feature after passing through the context sensing unit, and the Attention model is expressed as

And

finally, the generated features are connected

B_qH＝BiLSTM(u_qH)

Obtaining a pre-training model, wherein

Representing the kth candidate answer.

5. A video question-answering system for context-aware progressive attention, the system comprising:

6. The system of claim 5, wherein the feature encoding unit comprises a video dynamic feature extraction module, a video static feature extraction module, a caption text feature extraction module and a question text feature extraction module, and the feature encoding unit performs word embedding on text features by using a pre-trained BERT network model and encodes by using a bi-directional circulation network BilsTM, and simultaneously extracts static frame information in the video by using a pre-trained VGG16 network model and encodes by using a bi-directional circulation network BilsTM, and extracts dynamic information in the video by using a pre-trained C3D network model and encodes by using a bi-directional circulation network BilsTM.

7. The video question-answering system of a context-aware progressive Attention according to claim 5, wherein said context enhancement unit, constructing a context-aware unit, constructing three Attention modules Q2A-Attention, QA2C-Attention, QAC2M-Attention using a Co-Attention mechanism, inputting said dynamic feature, said subtitle text feature, said static feature and said question text feature, and outputting as corresponding multi-modal context enhancement features.

8. The system of claim 5, wherein the model training and prediction unit is configured to build a model training unit, build a fourth Attention module using a Co-Attention mechanism to obtain a final modal vector, convert the modal vector into an answer prediction score using a softmax function, select a maximum prediction score from the final answer prediction scores as an answer, and compute the formula: pro ═ softmax (h)_FC) And finally, training by adopting a cross entropy loss function to obtain a pre-training model, inputting the target question-answer video into the pre-training model, and outputting the target question-answer video as corresponding predicted answer information.