CN112488055A

CN112488055A - Video question-answering method based on progressive graph attention network

Info

Publication number: CN112488055A
Application number: CN202011501849.9A
Authority: CN
Inventors: 杨阳; 彭亮
Original assignee: Guizhou University; Guangdong Electronic Information Engineering Research Institute of UESTC
Current assignee: Guizhou University; Guangdong Electronic Information Engineering Research Institute of UESTC
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-03-12
Anticipated expiration: 2040-12-18
Also published as: CN112488055B

Abstract

The invention discloses a video question-answering method based on a progressive graph attention network, wherein a novel progressive graph attention network is sampled, and various visual interactions at a target level, a video frame level and a video clip level are explored in a progressive mode. In the progressive graph attention network, a graph structure of a target layer is mainly used for obtaining a spatiotemporal relation between targets in the same frame or different frames, a graph structure of a video frame layer researches a mutual relation between video frames, and a graph structure of a video segment layer constructs a time sequence relation between different actions in the graph structure. The present invention also uses a focus mechanism to focus on the vertices and edges of the graph associated with the problem and to connect these different levels of graph features in a progressive manner. In this way, each graph can focus on its spatiotemporal neighboring vertices and finer grained visual content based on visual relevance. This improves the accuracy of predicting the answer to the question.

Description

Video question-answering method based on progressive graph attention network

Technical Field

The invention belongs to the technical field of Video Question Answering (Video-QA), and particularly relates to a Video Question Answering method based on a progressive graph attention network.

Background

In the prior art, Video Question Answering (Video-QA) is mainly aimed at Answering natural language questions related to Video contents. And thus is crucial for the understanding of the video content. The classic video question-answering method mainly comprises three steps: 1) respectively extracting video features and question sentence features by using a Convolutional Neural Network (CNN) model and a Recurrent Neural Network (RNN) model; 2) under the guidance of the question features, focusing on the relevant parts in the video features and answering the questions, thereby obtaining a video representation with more expressive power; 3) and fusing the video characteristics and the question characteristics to obtain multi-modal characteristic representation, and predicting answers of the questions through a question-answering module.

Based on this classical framework, existing video question-answering methods mainly focus on visual reasoning from the temporal and spatial dimensions. Some methods utilize a spatio-Temporal Attention mechanism (Spatial-Temporal Attention) to focus on spatio-Temporal information in the video that is relevant to the question and has value. Other methods explore the visual relationship features existing in the video, thereby providing more effective semantic information for reasoning of the answer.

Most of the existing methods utilize Attention mechanism (Attention) or graph network structure (GCN) to explore single interactions between objects or frames in video, however, these interactions are often not enough to represent complex scenes in video, because in video, not only the spatio-temporal relationship between objects and the interrelation between video frames but also the temporal relationship of actions therein are involved, and therefore the accuracy of the answer to the prediction problem is low.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a video question-answering method based on a progressive graph attention network so as to improve the accuracy of the answer of the predicted question.

In order to achieve the above object, the video question-answering method based on the progressive graph attention network of the present invention is characterized by comprising the following steps:

(1) visual feature extraction

Dividing a video consisting of a frame sequence into N video segments, wherein each segment comprises L frames;

firstly, extracting the hierarchical feature c of each video segment by using a 3D CNN network (three-dimensional convolutional neural network)_nN1, 2,.., N, video clip level feature c_nHas a dimension of d^cBy C ═ C₁，c₂，...，c_NRepresenting the hierarchical characteristics of the N video clips;

then extracting the hierarchical feature f of each video frame by using a 2D CNN network (two-dimensional convolutional neural network)_n，lN1, 2, L, video frame level feature f_n，lHas a dimension of d^fWith F ═ F_1，1，f_1，2，...，f_N，LRepresenting the level characteristics of the NxL video frames;

then using fast R-CNN (Faster regional convolutional neural network) to extract the hierarchical feature o of each target in each video frame_n，l，kN1, 2, L, K1, 2, K, K is the number of extracted objects in each video frame, and the object level feature o is the object level feature_n，l，kHas a dimension of d^oBy O ═ O_1，1，1，o_1，1，2，...，o_N，L，KRepresenting NxL multiplied by K target level features;

finally, the question is encoded using a Long Short Term Memory (LSTM) network to obtain a representation of the question:

all words in the question are firstly coded into a word vector sequence by using a word embedding model, and then the word vector sequence is input into an LSTM network to obtain a hidden output sequence H ═ { H ═ of the word vector sequence₁，h₂，...，h_SIn which, the characteristic h_sS has a dimension d of 1, 2^qS is the length of the question, and finally, a self-attention mechanism is used for focusing on important words in the question to obtain the questionIs expressed by the following formula:

wherein,

is a parameter of the learning that is,

representing a matrix of real numbers, d^hIn the case of the number of rows,

is the weight in question of the s-th word, v^qIs a representation of a question (question representation);

(2) constructing a progressive graph attention network (comprising three graph attention networks with different levels)

2.1) constructing a target hierarchical graph attention network for obtaining the space-time relation between targets

Building a target hierarchy G^o＝{V^o，ε^o，A^oIn which V^oIs a collection of vertices in the graph, one vertex representing a detected object, ε^oIs a collection of edges in the graph, representing the relationship of all objects in each video frame,

is a related adjacency matrix;

combined use of question representation and target hierarchy features o_n，l，k(for simplicity, object level feature o_n，l，kRepresents o_iNLK) generates a suitable adjacency matrix:

firstly, the question feature v^qAnd each object level feature o_iCarrying out polymerization:

wherein phi '(. cndot.) and phi' (. cndot.) are all fully connected networks with ReLu activation function, which converts the feature into d^hA vector of dimensions, where is a dot product, and NLK is nxlxlxk;

then, the adjacency matrix A^oThe dependency value between the ith target and the jth target

The following formula is given:

wherein T represents transpose;

updating each target level feature o_iIs o'_i：

Concatenating each updated target hierarchy feature o'_iTo obtain tensor

NL＝N×L；

An attention mechanism is used for focusing on the target related to the question in the video frame, and the process of the attention mechanism is expressed by the following formula:

v^o＝Attention(O′，v^q) (7)

wherein,

is a target feature of the polymerization,

2.2) constructing a video frame hierarchy graph attention network for obtaining the correlation among the video frames

Constructing a video frame level map G^f＝{V^f，ε^f，A^fIn which V^fIs a collection of vertices in the graph, each vertex representing a video frame, ε^fIs a collection of edges in the graph, representing the relationship of each video frame,

is a related adjacency matrix;

for simplicity, video frame level feature f_n，lDenotes f_i′NL, obtain NL fused video frame level features F '═ F'₁，f′₂，...，f′_NL}＝{f′_i′1, 2., NL }, where the video frame features f 'are fused'_i′Comprises the following steps:

wherein,

representing a bit-wise addition of the signals,

for fully connected networks with ReLu activation function, converting features to d^fA vector of dimensions;

firstly, the question feature v^qAnd each fused video frame feature f'_i′Carrying out polymerization:

wherein,

all fully connected networks having ReLu activation functions to convert features to d^hA vector of dimensions;

then, the adjacency matrix A^fThe dependency value between the ith 'video frame and the jth' video frame

The following formula is given:

updating each fused video frame level feature f'_i′Is f ″)_i′：

Then serially connecting each updated video frame level characteristic f ″_i′To obtain tensor

Under the guidance of question features, an attention mechanism is used to obtain aggregated video frame features v^f：

v^f＝Attention(F″，v^q) (13)

Wherein the aggregated video frame characteristics

2.3) constructing a video clip hierarchy map attention network for establishing time sequence and semantic relation between actions in the video clip

Constructing a video clip hierarchy G^c＝{V^c，ε^c，A^cIn which V is^cRepresenting a collection of video segments, epsilon^cIs a collection of edges in the graph representing the relationship of each video segment, A^cIs a related adjacency matrix;

merging hierarchical feature C of N video segments and aggregated video frame feature v^fGenerating a fused video clip level feature C ═ C'₁，c′₂，...，c′_N}＝{c′_n1, 2., N }, wherein the video segment level feature c 'is fused'_nComprises the following steps:

where ω' (-) is a fully connected network with a ReLu activation function, transforming the features into vectors of dc dimensions;

firstly, the question feature v^qAnd aggregate video frame features

Carrying out polymerization:

wherein, ω '(. cndot.) and ω' (. cndot.) are all fully connected networks with ReLu activation function, so that the feature is converted into d^hA vector of dimensions;

then, the adjacency matrix A^cThe dependency value between the nth video segment and the kth video segment

The following formula is given:

updating each fused video segment level feature c'_nIs c_n：

Then connecting each updated fusion video clip level feature c ″ in series_nTo obtain tensor

Under the guidance of question features, an attention mechanism is used to obtain aggregated video features vc:

v^c＝Attention(C″，v^q) (19)

wherein the aggregated video features v^cHas a dimension of d^c；

(3) Answer prediction

For an open type task, firstly, visual information and question information are fused, then the fused information is input into a softmax classifier, and the probability of an answer is calculated:

g＝ρ′(v^c)*ρ″(v^q) (20)

p＝softmax(W^og) (21)

wherein, rho '(. cndot.) and rho' (. cndot.) are all full-connection networks with ReLu activation function, W^oThe method comprises the steps that learning parameters are obtained, p is a probability vector, and cross entropy functions are used for updating full-connection network parameters and learning parameters of a softmax classifier;

for the multi-choice task, firstly, visual information, question information and answer representation are fused in series, then the fused features are sent to a final classifier for linear regression, and an answer index y is output:

g′＝ρ′(v^c)*ρ″(v^q)*ρ″′(v^a) (22)

y＝W^mg′ (23)

wherein v is^aFor answer representation, W^mUpdating and updating the parameters of the full-connection network and the learning parameters of the classifier by using a pairwise comparison hinge function;

for the counting task, a linear regression function is used, g in the formula (20) is used as input, then the counting result is calculated by using a rounding function, and the linear regression function parameters are updated by using a Mean Square Error (MSE) loss function.

The object of the invention is thus achieved.

The invention relates to a video question-answering method based on a progressive graph attention network, wherein a novel progressive graph attention network is sampled, and various visual interactions at a target level, a video frame level and a video fragment level are explored in a progressive mode. In the progressive graph attention network, a graph structure of a target layer is mainly used for obtaining a spatiotemporal relation between targets in the same frame or different frames, a graph structure of a video frame layer researches a mutual relation between video frames, and a graph structure of a video segment layer constructs a time sequence relation between different actions in the graph structure. The present invention also uses a focus mechanism to focus on the vertices and edges of the graph associated with the problem and to connect these different levels of graph features in a progressive manner. In this way, each graph can focus on its spatiotemporal neighboring vertices and finer grained visual content based on visual relevance. This improves the accuracy of predicting the answer to the question.

Drawings

FIG. 1 is a flow chart of an embodiment of a video question-answering method based on a progressive graph attention network according to the present invention;

fig. 2 is a schematic diagram of a video question-answering method based on a progressive graph attention network according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Fig. 1 is a flowchart of an embodiment of a video question-answering method based on a progressive graph attention network according to the present invention.

In this embodiment, as shown in fig. 1, the video question-answering method based on the progressive graph attention network of the present invention includes the following steps:

step S1: visual feature extraction

In this embodiment, as shown in fig. 2, the present invention extracts three different levels of visual features, and divides a video V composed of a frame sequence into N video segments, each segment including L frames.

Firstly, extracting the hierarchical feature c of each video segment by using a 3D CNN network (three-dimensional convolutional neural network)_nN1, 2,.., N, video clip level feature c_nHas a dimension of d^cBy C ═ C₁，c₂，...，c_NAnd represents the hierarchical characteristics of the N video clips. In this embodiment, the 3D CNN network employs a ResNeXt-101 network.

Then extracting the hierarchical feature f of each video frame by using a 2D CNN network (two-dimensional convolutional neural network)_n，lN1, 2, L, video frame level feature f_n，lHas a dimension of d^fWith F ═ F_1，1，f_1，2，...，f_N，LDenotes the hierarchical features of N × L video frames. In the present embodiment, the 2D CNN network employs a ResNet-152 network.

Then using fast R-CNN (Faster regional convolutional neural network) to extract the hierarchical feature o of each target in each video frame_n，l，k，n＝1，2，...，N，lK is the number of extracted objects in each video frame, and the object level feature o_n，l，kHas a dimension of d^oBy O ═ O_1，1，1，o_1，1，2，...，o_N，L，KRepresents N × L × K target level features.

Finally, the question Q is encoded using a Long Short Term Memory (LSTM) network to obtain a representation of the question:

all words in the question Q are firstly encoded into a word vector sequence by a Glove word embedding model, and then the word vector sequence is input into an LSTM network to obtain a hidden output sequence H ═ H₁，h₂，...，h_SIn which, the characteristic h_sS has a dimension d of 1, 2^qAnd S is the length of the question Q, and finally, a self-attention mechanism is used for focusing on important words in the question to obtain the expression of the question Q, and the expression is expressed by the following formula:

wherein,

is a parameter of the learning that is,

representing a matrix of real numbers, d^hIn the case of the number of rows,

is the weight in question of the s-th word, v^qIs a representation of a question (question representation).

Step S2: building a progressive graph attention network

In this embodiment, as shown in fig. 2, the present invention designs a progressive graph attention network for reasoning about valuable visual information related to question sentences, which includes three graph networks of different levels: the first is a target level graph attention network (target level graph attention network) for obtaining the space-time relationship between targets; the second is a video frame level graph attention network (video frame level graph attention network) for exploring the interrelation among video frames; the last one is a video segment level graph attention network (video segment level graph attention network) for establishing timing and semantic relationships between actions in video segments.

Step S2.1: constructing a target hierarchical graph attention network for obtaining the time-space relationship between targets

is a related adjacency matrix;

the invention establishes a graph structure of a target hierarchy and mainly aims to establish the relationship between any two targets in a video under the guidance of question features. Therefore, a combination of question features and visual features is required to generate a suitable adjacency matrix.

Combined use of question representation and target hierarchy features o_n，l，k(for simplicity, object level feature o_n，l，kRepresents o_iNLK) generates a suitable adjacency matrix a^o。

wherein phi '(. cndot.) and phi' (. cndot.) are all fully connected networks with ReLu activation function, which converts the feature into d^hThe vector of the dimension represents the dot product, and NLK is N × L × K.

The following formula is given:

where T denotes transposition.

The invention is based on the calculated adjacency matrix A^oUpdating each object feature with other related objects_iThereby preserving local and long-term dependencies between targets. Specifically updating each target level feature o_iIs o'_i：

Wherein "+ o_i"is for the purpose of residual concatenation.

Concatenating each updated target hierarchy feature o'_iTo obtain tensor

NL＝N×L；

v^o＝Attention(O′，v^q) (7)

wherein,

is a target level feature of the aggregation,

has a dimension of d^o。

Step S2.2: constructing a video frame hierarchy graph attention network for obtaining the interrelation between video frames

The relationship between different video frames can record changes in detailed apparent information when motion is occurring or transitioning in the video. The invention builds a graph structure of video frame level to obtain detailed appearance change.

is a related adjacency matrix.

For simplicity, video frame level feature f_n，lDenotes f_i′，i′＝1，2，...，NL，

The method is used for obtaining NL fusion video frame level features F ' ═ F ' through fusion of two features '₁，f′₂，...，f′_NL}＝{f′_i′1, 2., NL }, where the video frame features f 'are fused'_i′Comprises the following steps:

wherein,

representing a bit-wise addition of the signals,

for fully connected networks with ReLu activation function, converting features to d^fA vector of dimensions.

wherein,

all fully connected networks having ReLu activation functions to convert features to d^hA vector of dimensions.

The following formula is given:

updating each fused video frame level feature f'_i′Is f ″)_i′：

v^f＝Attention(F″，v^q) (13)

Wherein the aggregated video frame characteristics

Step S2.3: constructing a video clip hierarchy map attention network for establishing time sequence and semantic relation between actions in video clips

The invention divides the video into a plurality of short video segments and constructs a video segment level graph structure (video segment level graph) to represent the time sequence and semantic relation between actions in different video segments.

Constructing a video clip hierarchy G^c＝{V^c，ε^c，A^cWhere Vc represents the set of video segments, ε^cIs a collection of edges in the graph representing the relationship of each video segment, A^cIs a related adjacency matrix;

where ω' (. cndot.) is a fully connected network with a ReLu activation function, converting features to d^cA vector of dimensions;

firstly, the question feature v^qAnd aggregate video frame features

Carrying out polymerization:

wherein, ω '(. cndot.) and ω' (. cndot.) are all fully connected networks with ReLu activation function, so that the feature is converted into d^hA vector of dimensions.

The following formula is given:

updating each fused video segment level feature c'_nIs c_n：

v^c＝Attention(C″，v^q) (19)

wherein the aggregated video features v^cHas a dimension of d^c。

With thisSample mode, combined video features v^cThe method can include the information of the target, the global information and the dynamic information of the video frame, and the accuracy of the answer of the predicted question is improved.

Step S3: answer prediction

g＝ρ′(v^c)*ρ″(v^q) (20)

p＝softmax(W^og) (21)

wherein, rho '(. cndot.) and rho' (. cndot.) are all full-connection networks with ReLu activation function, W^oAnd updating the parameters of the full-connection network and the learning parameters of the softmax classifier by using a cross entropy function.

g′＝ρ′(v^c)*ρ″(v^q)*ρ″′(v^a) (22)

y＝W^mg′ (23)

wherein v is^aFor answer representation, W^mIs a learning parameter, and updates the full-connection network parameter and the learning parameter of the classifier by using a pairwise comparison hinge function.

Examples of the invention

Experiments show that two Multi-Choice (Multi-Choice) sub data sets in the existing large video question-answer data set TGIF-QA have serious answer bias. These biases can have a large impact on the accuracy of the model. To address this problem, the present example builds a new data set, TGIF-QA-R, based on TGIF-QA. In this data set, the candidate answers are independent of each other, in such a way that the influence of answer bias can be effectively reduced.

The effect of the method is tested on three large reference data sets of TGIF-QA, MSVD-QA and MSRVTT-QA and a newly constructed TGIF-QA-R data set, and the effect of the method is better than that of the highest level method.

1. Test results on TGIF-QA and TGIF-QA-R datasets

TABLE 1

As can be seen from Table 1, the present invention performed best in most of the subtasks, with 57.6% and 65.6% accuracy in the Action and Trans. subtasks of TGIF-QA-R, and 79.5%, 85.3% and 62.8% accuracy in the Action, Trans, and Frame subtasks of TGIF-QA, respectively.

2. Test results on MSVD-QA dataset

TABLE 2

From table 2, it can be seen that the present invention achieves the highest level of performance in terms of overall accuracy, increasing the accuracy from 36.5% to 39.8%.

3. Test results on the MSRVTT-QA dataset:

from table 3, it can be seen that the present invention achieves the highest level of performance in terms of overall accuracy, increasing the accuracy from 35.5% to 38.2%.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A video question-answering method based on a progressive graph attention network is characterized by comprising the following steps:

(1) visual feature extraction

then extracting the hierarchical feature f of each video frame by using a 2D CNN network (two-dimensional convolutional neural network)_n，lN1, 2, L, video frame level feature f_n，lHas a dimension of d^fWith F ═ F_1，1，f_1，2，…，f_N，LRepresenting the level characteristics of the NxL video frames;

then using fast R-CNN (Faster regional convolutional neural network) to extract the hierarchical feature o of each target in each video frame_n，l，kN1, 2., N, L1, 2.. L, K1, 2.. K, K is per video frameExtracting the number of targets and the target level characteristics o_n，l，kHas a dimension of d^oBy O ═ O_1，1，1，o_1，1，2，...，o_N，L，KRepresenting NxL multiplied by K target level features;

all words in the question are firstly coded into a word vector sequence by using a word embedding model, and then the word vector sequence is input into an LSTM network to obtain a hidden output sequence H ═ { H ═ of the word vector sequence₁，h₂，...，h_SIn which, the characteristic h_sS has a dimension d of 1, 2^qAnd S is the length of the question, and finally, a self-attention mechanism is used for focusing on important words in the question to obtain the expression of the question, and the expression is expressed by the following formula:

wherein,

is a parameter of the learning that is,

representing a matrix of real numbers, d^hIn the case of the number of rows,

is a related adjacency matrix;

The following formula is given:

wherein T represents transpose;

updating each target level feature o_iIs o'_i：

Concatenating each updated target hierarchy feature o'_iTo obtain tensor

NL＝N×L；

v^o＝Attention(O′，v^q) (7)

wherein,

is a target feature of the polymerization,

is a related adjacency matrix;

for simplicity, video frame level feature f_n，lDenotes f_i′NL, NL 'is obtained NL fused video frame level features F'＝{f₁′，f₂′，...，f′_NL}＝{f_i' | i ═ 1, 2., NL }, where the video frame features f ' are fused '_i′Comprises the following steps: