CN112488055A - Video question-answering method based on progressive graph attention network - Google Patents
Video question-answering method based on progressive graph attention network Download PDFInfo
- Publication number
- CN112488055A CN112488055A CN202011501849.9A CN202011501849A CN112488055A CN 112488055 A CN112488055 A CN 112488055A CN 202011501849 A CN202011501849 A CN 202011501849A CN 112488055 A CN112488055 A CN 112488055A
- Authority
- CN
- China
- Prior art keywords
- video
- question
- feature
- video frame
- graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 230000000750 progressive effect Effects 0.000 title claims abstract description 24
- 230000000007 visual effect Effects 0.000 claims abstract description 20
- 230000007246 mechanism Effects 0.000 claims abstract description 19
- 230000009471 action Effects 0.000 claims abstract description 10
- 230000006870 function Effects 0.000 claims description 37
- 239000011159 matrix material Substances 0.000 claims description 26
- 239000013598 vector Substances 0.000 claims description 25
- 238000013527 convolutional neural network Methods 0.000 claims description 19
- 230000004913 activation Effects 0.000 claims description 18
- 238000006116 polymerization reaction Methods 0.000 claims description 11
- 238000012417 linear regression Methods 0.000 claims description 9
- 230000004927 fusion Effects 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 230000006403 short-term memory Effects 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 2
- 230000003993 interaction Effects 0.000 abstract description 4
- 238000011160 research Methods 0.000 abstract description 2
- 238000012360 testing method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Multimedia (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a video question-answering method based on a progressive graph attention network, wherein a novel progressive graph attention network is sampled, and various visual interactions at a target level, a video frame level and a video clip level are explored in a progressive mode. In the progressive graph attention network, a graph structure of a target layer is mainly used for obtaining a spatiotemporal relation between targets in the same frame or different frames, a graph structure of a video frame layer researches a mutual relation between video frames, and a graph structure of a video segment layer constructs a time sequence relation between different actions in the graph structure. The present invention also uses a focus mechanism to focus on the vertices and edges of the graph associated with the problem and to connect these different levels of graph features in a progressive manner. In this way, each graph can focus on its spatiotemporal neighboring vertices and finer grained visual content based on visual relevance. This improves the accuracy of predicting the answer to the question.
Description
Technical Field
The invention belongs to the technical field of Video Question Answering (Video-QA), and particularly relates to a Video Question Answering method based on a progressive graph attention network.
Background
In the prior art, Video Question Answering (Video-QA) is mainly aimed at Answering natural language questions related to Video contents. And thus is crucial for the understanding of the video content. The classic video question-answering method mainly comprises three steps: 1) respectively extracting video features and question sentence features by using a Convolutional Neural Network (CNN) model and a Recurrent Neural Network (RNN) model; 2) under the guidance of the question features, focusing on the relevant parts in the video features and answering the questions, thereby obtaining a video representation with more expressive power; 3) and fusing the video characteristics and the question characteristics to obtain multi-modal characteristic representation, and predicting answers of the questions through a question-answering module.
Based on this classical framework, existing video question-answering methods mainly focus on visual reasoning from the temporal and spatial dimensions. Some methods utilize a spatio-Temporal Attention mechanism (Spatial-Temporal Attention) to focus on spatio-Temporal information in the video that is relevant to the question and has value. Other methods explore the visual relationship features existing in the video, thereby providing more effective semantic information for reasoning of the answer.
Most of the existing methods utilize Attention mechanism (Attention) or graph network structure (GCN) to explore single interactions between objects or frames in video, however, these interactions are often not enough to represent complex scenes in video, because in video, not only the spatio-temporal relationship between objects and the interrelation between video frames but also the temporal relationship of actions therein are involved, and therefore the accuracy of the answer to the prediction problem is low.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a video question-answering method based on a progressive graph attention network so as to improve the accuracy of the answer of the predicted question.
In order to achieve the above object, the video question-answering method based on the progressive graph attention network of the present invention is characterized by comprising the following steps:
(1) visual feature extraction
Dividing a video consisting of a frame sequence into N video segments, wherein each segment comprises L frames;
firstly, extracting the hierarchical feature c of each video segment by using a 3D CNN network (three-dimensional convolutional neural network)nN1, 2,.., N, video clip level feature cnHas a dimension of dcBy C ═ C1,c2,...,cNRepresenting the hierarchical characteristics of the N video clips;
then extracting the hierarchical feature f of each video frame by using a 2D CNN network (two-dimensional convolutional neural network)n,lN1, 2, L, video frame level feature fn,lHas a dimension of dfWith F ═ F1,1,f1,2,...,fN,LRepresenting the level characteristics of the NxL video frames;
then using fast R-CNN (Faster regional convolutional neural network) to extract the hierarchical feature o of each target in each video framen,l,kN1, 2, L, K1, 2, K, K is the number of extracted objects in each video frame, and the object level feature o is the object level featuren,l,kHas a dimension of doBy O ═ O1,1,1,o1,1,2,...,oN,L,KRepresenting NxL multiplied by K target level features;
finally, the question is encoded using a Long Short Term Memory (LSTM) network to obtain a representation of the question:
all words in the question are firstly coded into a word vector sequence by using a word embedding model, and then the word vector sequence is input into an LSTM network to obtain a hidden output sequence H ═ { H ═ of the word vector sequence1,h2,...,hSIn which, the characteristic hsS has a dimension d of 1, 2qS is the length of the question, and finally, a self-attention mechanism is used for focusing on important words in the question to obtain the questionIs expressed by the following formula:
wherein,is a parameter of the learning that is,representing a matrix of real numbers, dhIn the case of the number of rows,is the weight in question of the s-th word, vqIs a representation of a question (question representation);
(2) constructing a progressive graph attention network (comprising three graph attention networks with different levels)
2.1) constructing a target hierarchical graph attention network for obtaining the space-time relation between targets
Building a target hierarchy Go={Vo,εo,AoIn which VoIs a collection of vertices in the graph, one vertex representing a detected object, εoIs a collection of edges in the graph, representing the relationship of all objects in each video frame,is a related adjacency matrix;
combined use of question representation and target hierarchy features on,l,k(for simplicity, object level feature on,l,kRepresents oiNLK) generates a suitable adjacency matrix:
firstly, the question feature vqAnd each object level feature oiCarrying out polymerization:
wherein phi '(. cndot.) and phi' (. cndot.) are all fully connected networks with ReLu activation function, which converts the feature into dhA vector of dimensions, where is a dot product, and NLK is nxlxlxk;
then, the adjacency matrix AoThe dependency value between the ith target and the jth targetThe following formula is given:
wherein T represents transpose;
updating each target level feature oiIs o'i:
An attention mechanism is used for focusing on the target related to the question in the video frame, and the process of the attention mechanism is expressed by the following formula:
vo=Attention(O′,vq) (7)
2.2) constructing a video frame hierarchy graph attention network for obtaining the correlation among the video frames
Constructing a video frame level map Gf={Vf,εf,AfIn which VfIs a collection of vertices in the graph, each vertex representing a video frame, εfIs a collection of edges in the graph, representing the relationship of each video frame,is a related adjacency matrix;
for simplicity, video frame level feature fn,lDenotes fi′NL, obtain NL fused video frame level features F '═ F'1,f′2,...,f′NL}={f′i′1, 2., NL }, where the video frame features f 'are fused'i′Comprises the following steps:
wherein,representing a bit-wise addition of the signals,for fully connected networks with ReLu activation function, converting features to dfA vector of dimensions;
firstly, the question feature vqAnd each fused video frame feature f'i′Carrying out polymerization:
wherein,all fully connected networks having ReLu activation functions to convert features to dhA vector of dimensions;
then, the adjacency matrix AfThe dependency value between the ith 'video frame and the jth' video frameThe following formula is given:
updating each fused video frame level feature f'i′Is f ″)i′:
Under the guidance of question features, an attention mechanism is used to obtain aggregated video frame features vf:
vf=Attention(F″,vq) (13)
2.3) constructing a video clip hierarchy map attention network for establishing time sequence and semantic relation between actions in the video clip
Constructing a video clip hierarchy Gc={Vc,εc,AcIn which V iscRepresenting a collection of video segments, epsiloncIs a collection of edges in the graph representing the relationship of each video segment, AcIs a related adjacency matrix;
merging hierarchical feature C of N video segments and aggregated video frame feature vfGenerating a fused video clip level feature C ═ C'1,c′2,...,c′N}={c′n1, 2., N }, wherein the video segment level feature c 'is fused'nComprises the following steps:
where ω' (-) is a fully connected network with a ReLu activation function, transforming the features into vectors of dc dimensions;
wherein, ω '(. cndot.) and ω' (. cndot.) are all fully connected networks with ReLu activation function, so that the feature is converted into dhA vector of dimensions;
then, the adjacency matrix AcThe dependency value between the nth video segment and the kth video segmentThe following formula is given:
updating each fused video segment level feature c'nIs cn:
Under the guidance of question features, an attention mechanism is used to obtain aggregated video features vc:
vc=Attention(C″,vq) (19)
wherein the aggregated video features vcHas a dimension of dc;
(3) Answer prediction
For an open type task, firstly, visual information and question information are fused, then the fused information is input into a softmax classifier, and the probability of an answer is calculated:
g=ρ′(vc)*ρ″(vq) (20)
p=softmax(Wog) (21)
wherein, rho '(. cndot.) and rho' (. cndot.) are all full-connection networks with ReLu activation function, WoThe method comprises the steps that learning parameters are obtained, p is a probability vector, and cross entropy functions are used for updating full-connection network parameters and learning parameters of a softmax classifier;
for the multi-choice task, firstly, visual information, question information and answer representation are fused in series, then the fused features are sent to a final classifier for linear regression, and an answer index y is output:
g′=ρ′(vc)*ρ″(vq)*ρ″′(va) (22)
y=Wmg′ (23)
wherein v isaFor answer representation, WmUpdating and updating the parameters of the full-connection network and the learning parameters of the classifier by using a pairwise comparison hinge function;
for the counting task, a linear regression function is used, g in the formula (20) is used as input, then the counting result is calculated by using a rounding function, and the linear regression function parameters are updated by using a Mean Square Error (MSE) loss function.
The object of the invention is thus achieved.
The invention relates to a video question-answering method based on a progressive graph attention network, wherein a novel progressive graph attention network is sampled, and various visual interactions at a target level, a video frame level and a video fragment level are explored in a progressive mode. In the progressive graph attention network, a graph structure of a target layer is mainly used for obtaining a spatiotemporal relation between targets in the same frame or different frames, a graph structure of a video frame layer researches a mutual relation between video frames, and a graph structure of a video segment layer constructs a time sequence relation between different actions in the graph structure. The present invention also uses a focus mechanism to focus on the vertices and edges of the graph associated with the problem and to connect these different levels of graph features in a progressive manner. In this way, each graph can focus on its spatiotemporal neighboring vertices and finer grained visual content based on visual relevance. This improves the accuracy of predicting the answer to the question.
Drawings
FIG. 1 is a flow chart of an embodiment of a video question-answering method based on a progressive graph attention network according to the present invention;
fig. 2 is a schematic diagram of a video question-answering method based on a progressive graph attention network according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
Fig. 1 is a flowchart of an embodiment of a video question-answering method based on a progressive graph attention network according to the present invention.
In this embodiment, as shown in fig. 1, the video question-answering method based on the progressive graph attention network of the present invention includes the following steps:
step S1: visual feature extraction
In this embodiment, as shown in fig. 2, the present invention extracts three different levels of visual features, and divides a video V composed of a frame sequence into N video segments, each segment including L frames.
Firstly, extracting the hierarchical feature c of each video segment by using a 3D CNN network (three-dimensional convolutional neural network)nN1, 2,.., N, video clip level feature cnHas a dimension of dcBy C ═ C1,c2,...,cNAnd represents the hierarchical characteristics of the N video clips. In this embodiment, the 3D CNN network employs a ResNeXt-101 network.
Then extracting the hierarchical feature f of each video frame by using a 2D CNN network (two-dimensional convolutional neural network)n,lN1, 2, L, video frame level feature fn,lHas a dimension of dfWith F ═ F1,1,f1,2,...,fN,LDenotes the hierarchical features of N × L video frames. In the present embodiment, the 2D CNN network employs a ResNet-152 network.
Then using fast R-CNN (Faster regional convolutional neural network) to extract the hierarchical feature o of each target in each video framen,l,k,n=1,2,...,N,lK is the number of extracted objects in each video frame, and the object level feature on,l,kHas a dimension of doBy O ═ O1,1,1,o1,1,2,...,oN,L,KRepresents N × L × K target level features.
Finally, the question Q is encoded using a Long Short Term Memory (LSTM) network to obtain a representation of the question:
all words in the question Q are firstly encoded into a word vector sequence by a Glove word embedding model, and then the word vector sequence is input into an LSTM network to obtain a hidden output sequence H ═ H1,h2,...,hSIn which, the characteristic hsS has a dimension d of 1, 2qAnd S is the length of the question Q, and finally, a self-attention mechanism is used for focusing on important words in the question to obtain the expression of the question Q, and the expression is expressed by the following formula:
wherein,is a parameter of the learning that is,representing a matrix of real numbers, dhIn the case of the number of rows,is the weight in question of the s-th word, vqIs a representation of a question (question representation).
Step S2: building a progressive graph attention network
In this embodiment, as shown in fig. 2, the present invention designs a progressive graph attention network for reasoning about valuable visual information related to question sentences, which includes three graph networks of different levels: the first is a target level graph attention network (target level graph attention network) for obtaining the space-time relationship between targets; the second is a video frame level graph attention network (video frame level graph attention network) for exploring the interrelation among video frames; the last one is a video segment level graph attention network (video segment level graph attention network) for establishing timing and semantic relationships between actions in video segments.
Step S2.1: constructing a target hierarchical graph attention network for obtaining the time-space relationship between targets
Building a target hierarchy Go={Vo,εo,AoIn which VoIs a collection of vertices in the graph, one vertex representing a detected object, εoIs a collection of edges in the graph, representing the relationship of all objects in each video frame,is a related adjacency matrix;
the invention establishes a graph structure of a target hierarchy and mainly aims to establish the relationship between any two targets in a video under the guidance of question features. Therefore, a combination of question features and visual features is required to generate a suitable adjacency matrix.
Combined use of question representation and target hierarchy features on,l,k(for simplicity, object level feature on,l,kRepresents oiNLK) generates a suitable adjacency matrix ao。
Firstly, the question feature vqAnd each object level feature oiCarrying out polymerization:
wherein phi '(. cndot.) and phi' (. cndot.) are all fully connected networks with ReLu activation function, which converts the feature into dhThe vector of the dimension represents the dot product, and NLK is N × L × K.
Then, the adjacency matrix AoThe dependency value between the ith target and the jth targetThe following formula is given:
where T denotes transposition.
The invention is based on the calculated adjacency matrix AoUpdating each object feature with other related objectsiThereby preserving local and long-term dependencies between targets. Specifically updating each target level feature oiIs o'i:
Wherein "+ oi"is for the purpose of residual concatenation.
An attention mechanism is used for focusing on the target related to the question in the video frame, and the process of the attention mechanism is expressed by the following formula:
vo=Attention(O′,vq) (7)
Step S2.2: constructing a video frame hierarchy graph attention network for obtaining the interrelation between video frames
The relationship between different video frames can record changes in detailed apparent information when motion is occurring or transitioning in the video. The invention builds a graph structure of video frame level to obtain detailed appearance change.
Constructing a video frame level map Gf={Vf,εf,AfIn which VfIs a collection of vertices in the graph, each vertex representing a video frame, εfIs a collection of edges in the graph, representing the relationship of each video frame,is a related adjacency matrix.
For simplicity, video frame level feature fn,lDenotes fi′,i′=1,2,...,NL,
The method is used for obtaining NL fusion video frame level features F ' ═ F ' through fusion of two features '1,f′2,...,f′NL}={f′i′1, 2., NL }, where the video frame features f 'are fused'i′Comprises the following steps:
wherein,representing a bit-wise addition of the signals,for fully connected networks with ReLu activation function, converting features to dfA vector of dimensions.
Firstly, the question feature vqAnd each fused video frame feature f'i′Carrying out polymerization:
wherein,all fully connected networks having ReLu activation functions to convert features to dhA vector of dimensions.
Then, the adjacency matrix AfThe dependency value between the ith 'video frame and the jth' video frameThe following formula is given:
updating each fused video frame level feature f'i′Is f ″)i′:
Under the guidance of question features, an attention mechanism is used to obtain aggregated video frame features vf:
vf=Attention(F″,vq) (13)
Step S2.3: constructing a video clip hierarchy map attention network for establishing time sequence and semantic relation between actions in video clips
The invention divides the video into a plurality of short video segments and constructs a video segment level graph structure (video segment level graph) to represent the time sequence and semantic relation between actions in different video segments.
Constructing a video clip hierarchy Gc={Vc,εc,AcWhere Vc represents the set of video segments, εcIs a collection of edges in the graph representing the relationship of each video segment, AcIs a related adjacency matrix;
merging hierarchical feature C of N video segments and aggregated video frame feature vfGenerating a fused video clip level feature C ═ C'1,c′2,...,c′N}={c′n1, 2., N }, wherein the video segment level feature c 'is fused'nComprises the following steps:
where ω' (. cndot.) is a fully connected network with a ReLu activation function, converting features to dcA vector of dimensions;
wherein, ω '(. cndot.) and ω' (. cndot.) are all fully connected networks with ReLu activation function, so that the feature is converted into dhA vector of dimensions.
Then, the adjacency matrix AcThe dependency value between the nth video segment and the kth video segmentThe following formula is given:
updating each fused video segment level feature c'nIs cn:
Under the guidance of question features, an attention mechanism is used to obtain aggregated video features vc:
vc=Attention(C″,vq) (19)
wherein the aggregated video features vcHas a dimension of dc。
With thisSample mode, combined video features vcThe method can include the information of the target, the global information and the dynamic information of the video frame, and the accuracy of the answer of the predicted question is improved.
Step S3: answer prediction
For an open type task, firstly, visual information and question information are fused, then the fused information is input into a softmax classifier, and the probability of an answer is calculated:
g=ρ′(vc)*ρ″(vq) (20)
p=softmax(Wog) (21)
wherein, rho '(. cndot.) and rho' (. cndot.) are all full-connection networks with ReLu activation function, WoAnd updating the parameters of the full-connection network and the learning parameters of the softmax classifier by using a cross entropy function.
For the multi-choice task, firstly, visual information, question information and answer representation are fused in series, then the fused features are sent to a final classifier for linear regression, and an answer index y is output:
g′=ρ′(vc)*ρ″(vq)*ρ″′(va) (22)
y=Wmg′ (23)
wherein v isaFor answer representation, WmIs a learning parameter, and updates the full-connection network parameter and the learning parameter of the classifier by using a pairwise comparison hinge function.
For the counting task, a linear regression function is used, g in the formula (20) is used as input, then the counting result is calculated by using a rounding function, and the linear regression function parameters are updated by using a Mean Square Error (MSE) loss function.
Examples of the invention
Experiments show that two Multi-Choice (Multi-Choice) sub data sets in the existing large video question-answer data set TGIF-QA have serious answer bias. These biases can have a large impact on the accuracy of the model. To address this problem, the present example builds a new data set, TGIF-QA-R, based on TGIF-QA. In this data set, the candidate answers are independent of each other, in such a way that the influence of answer bias can be effectively reduced.
The effect of the method is tested on three large reference data sets of TGIF-QA, MSVD-QA and MSRVTT-QA and a newly constructed TGIF-QA-R data set, and the effect of the method is better than that of the highest level method.
1. Test results on TGIF-QA and TGIF-QA-R datasets
TABLE 1
As can be seen from Table 1, the present invention performed best in most of the subtasks, with 57.6% and 65.6% accuracy in the Action and Trans. subtasks of TGIF-QA-R, and 79.5%, 85.3% and 62.8% accuracy in the Action, Trans, and Frame subtasks of TGIF-QA, respectively.
2. Test results on MSVD-QA dataset
TABLE 2
From table 2, it can be seen that the present invention achieves the highest level of performance in terms of overall accuracy, increasing the accuracy from 36.5% to 39.8%.
3. Test results on the MSRVTT-QA dataset:
from table 3, it can be seen that the present invention achieves the highest level of performance in terms of overall accuracy, increasing the accuracy from 35.5% to 38.2%.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.
Claims (1)
1. A video question-answering method based on a progressive graph attention network is characterized by comprising the following steps:
(1) visual feature extraction
Dividing a video consisting of a frame sequence into N video segments, wherein each segment comprises L frames;
firstly, extracting the hierarchical feature c of each video segment by using a 3D CNN network (three-dimensional convolutional neural network)nN1, 2,.., N, video clip level feature cnHas a dimension of dcBy C ═ C1,c2,...,cNRepresenting the hierarchical characteristics of the N video clips;
then extracting the hierarchical feature f of each video frame by using a 2D CNN network (two-dimensional convolutional neural network)n,lN1, 2, L, video frame level feature fn,lHas a dimension of dfWith F ═ F1,1,f1,2,…,fN,LRepresenting the level characteristics of the NxL video frames;
then using fast R-CNN (Faster regional convolutional neural network) to extract the hierarchical feature o of each target in each video framen,l,kN1, 2., N, L1, 2.. L, K1, 2.. K, K is per video frameExtracting the number of targets and the target level characteristics on,l,kHas a dimension of doBy O ═ O1,1,1,o1,1,2,...,oN,L,KRepresenting NxL multiplied by K target level features;
finally, the question is encoded using a Long Short Term Memory (LSTM) network to obtain a representation of the question:
all words in the question are firstly coded into a word vector sequence by using a word embedding model, and then the word vector sequence is input into an LSTM network to obtain a hidden output sequence H ═ { H ═ of the word vector sequence1,h2,...,hSIn which, the characteristic hsS has a dimension d of 1, 2qAnd S is the length of the question, and finally, a self-attention mechanism is used for focusing on important words in the question to obtain the expression of the question, and the expression is expressed by the following formula:
wherein,is a parameter of the learning that is,representing a matrix of real numbers, dhIn the case of the number of rows,is the weight in question of the s-th word, vqIs a representation of a question (question representation);
(2) constructing a progressive graph attention network (comprising three graph attention networks with different levels)
2.1) constructing a target hierarchical graph attention network for obtaining the space-time relation between targets
Building a target hierarchy Go={Vo,εo,AoIn which VoIs a collection of vertices in the graph, one vertex representing a detected object, εoIs a collection of edges in the graph, representing the relationship of all objects in each video frame,is a related adjacency matrix;
combined use of question representation and target hierarchy features on,l,k(for simplicity, object level feature on,l,kRepresents oiNLK) generates a suitable adjacency matrix:
firstly, the question feature vqAnd each object level feature oiCarrying out polymerization:
wherein phi '(. cndot.) and phi' (. cndot.) are all fully connected networks with ReLu activation function, which converts the feature into dhA vector of dimensions, where is a dot product, and NLK is nxlxlxk;
then, the adjacency matrix AoThe dependency value between the ith target and the jth targetThe following formula is given:
wherein T represents transpose;
updating each target level feature oiIs o'i:
An attention mechanism is used for focusing on the target related to the question in the video frame, and the process of the attention mechanism is expressed by the following formula:
vo=Attention(O′,vq) (7)
2.2) constructing a video frame hierarchy graph attention network for obtaining the correlation among the video frames
Constructing a video frame level map Gf={Vf,εf,AfIn which VfIs a collection of vertices in the graph, each vertex representing a video frame, εfIs a collection of edges in the graph, representing the relationship of each video frame,is a related adjacency matrix;
for simplicity, video frame level feature fn,lDenotes fi′NL, NL 'is obtained NL fused video frame level features F'={f1′,f2′,...,f′NL}={fi' | i ═ 1, 2., NL }, where the video frame features f ' are fused 'i′Comprises the following steps:
wherein,representing a bit-wise addition of the signals,for fully connected networks with ReLu activation function, converting features to dfA vector of dimensions;
firstly, the question feature vqAnd each fused video frame feature f'i′Carrying out polymerization:
wherein,all fully connected networks having ReLu activation functions to convert features to dhA vector of dimensions;
then, the adjacency matrix AfThe dependency value between the ith 'video frame and the jth' video frameThe following formula is given:
updating each fused video frame level feature f'i′Is f ″)i′:
Under the guidance of question features, an attention mechanism is used to obtain aggregated video frame features vf:
vf=Attention(F″,vq) (13)
2.3) constructing a video clip hierarchy map attention network for establishing time sequence and semantic relation between actions in the video clip
Constructing a video clip hierarchy Gc={Vc,εc,AcIn which V iscRepresenting a collection of video segments, epsiloncIs a collection of edges in the graph representing the relationship of each video segment, AcIs a related adjacency matrix;
merging hierarchical feature C of N video segments and aggregated video frame feature vfGenerating a fused video clip level feature C ═ C'1,c′2,...,c′N}={c′n1, 2., N }, wherein the video segment level feature c 'is fused'nComprises the following steps:
where ω' (-) is a fully connected network with a ReLu activation function, transforming the features into vectors of dc dimensions;
wherein, ω '(. cndot.) and ω' (. cndot.) are all fully connected networks with ReLu activation function, so that the feature is converted into dhA vector of dimensions;
then, the adjacency matrix AcThe dependency value between the nth video segment and the kth video segmentThe following formula is given:
updating each fused video segment level feature c'nIs cn:
Under the guidance of question features, an attention mechanism is used to obtain aggregated video features vc:
vc=Attention(C″,vq) (19)
Wherein the aggregated video features vcHas a dimension of dc;
(3) Answer prediction
For an open type task, firstly, visual information and question information are fused, then the fused information is input into a softmax classifier, and the probability of an answer is calculated:
g=ρ′(vc)*ρ″(vq) (20)
ρ=softmax(Wog) (21)
wherein, rho '(. cndot.) and rho' (. cndot.) are all full-connection networks with ReLu activation function, WoThe method comprises the steps that learning parameters are obtained, p is a probability vector, and cross entropy functions are used for updating full-connection network parameters and learning parameters of a softmax classifier;
for the multi-choice task, firstly, visual information, question information and answer representation are fused in series, then the fused features are sent to a final classifier for linear regression, and an answer index y is output:
g′=ρ′(vc)*ρ″(vq)*ρ″′(va) (22)
y=Wmg′ (23)
wherein v isaFor answer representation, WmUpdating and updating the parameters of the full-connection network and the learning parameters of the classifier by using a pairwise comparison hinge function;
for the counting task, a linear regression function is used, g in the formula (20) is used as input, then the counting result is calculated by using a rounding function, and the linear regression function parameters are updated by using a Mean Square Error (MSE) loss function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011501849.9A CN112488055B (en) | 2020-12-18 | 2020-12-18 | Video question-answering method based on progressive graph attention network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011501849.9A CN112488055B (en) | 2020-12-18 | 2020-12-18 | Video question-answering method based on progressive graph attention network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112488055A true CN112488055A (en) | 2021-03-12 |
CN112488055B CN112488055B (en) | 2022-09-06 |
Family
ID=74914783
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011501849.9A Active CN112488055B (en) | 2020-12-18 | 2020-12-18 | Video question-answering method based on progressive graph attention network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112488055B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113536952A (en) * | 2021-06-22 | 2021-10-22 | 电子科技大学 | Video question-answering method based on attention network of motion capture |
CN113609330A (en) * | 2021-07-15 | 2021-11-05 | 哈尔滨理工大学 | Video question-answering system, method, computer and storage medium based on text attention and fine-grained information |
CN114398961A (en) * | 2021-12-28 | 2022-04-26 | 西南交通大学 | Visual question-answering method based on multi-mode depth feature fusion and model thereof |
CN114495282A (en) * | 2022-02-14 | 2022-05-13 | 中国科学技术大学 | Video motion detection method, system, device and storage medium |
CN116074575A (en) * | 2021-11-01 | 2023-05-05 | 国际商业机器公司 | Transducer for real world video question answering |
CN116385937A (en) * | 2023-04-07 | 2023-07-04 | 哈尔滨理工大学 | Method and system for solving video question and answer based on multi-granularity cross-mode interaction framework |
CN117235670A (en) * | 2023-11-10 | 2023-12-15 | 南京信息工程大学 | Medical image problem vision solving method based on fine granularity cross attention |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
CN107766447A (en) * | 2017-09-25 | 2018-03-06 | 浙江大学 | It is a kind of to solve the method for video question and answer using multilayer notice network mechanism |
CN108829756A (en) * | 2018-05-25 | 2018-11-16 | 杭州知智能科技有限公司 | A method of more wheel video question and answer are solved using layering attention context network |
CN109829049A (en) * | 2019-01-28 | 2019-05-31 | 杭州一知智能科技有限公司 | The method for solving video question-answering task using the progressive space-time attention network of knowledge base |
CN110222770A (en) * | 2019-06-10 | 2019-09-10 | 成都澳海川科技有限公司 | A kind of vision answering method based on syntagmatic attention network |
CN110704601A (en) * | 2019-10-11 | 2020-01-17 | 浙江大学 | Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network |
CN111652202A (en) * | 2020-08-10 | 2020-09-11 | 浙江大学 | Method and system for solving video question-answer problem by improving video-language representation learning through self-adaptive space-time diagram model |
-
2020
- 2020-12-18 CN CN202011501849.9A patent/CN112488055B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
CN107766447A (en) * | 2017-09-25 | 2018-03-06 | 浙江大学 | It is a kind of to solve the method for video question and answer using multilayer notice network mechanism |
CN108829756A (en) * | 2018-05-25 | 2018-11-16 | 杭州知智能科技有限公司 | A method of more wheel video question and answer are solved using layering attention context network |
CN109829049A (en) * | 2019-01-28 | 2019-05-31 | 杭州一知智能科技有限公司 | The method for solving video question-answering task using the progressive space-time attention network of knowledge base |
CN110222770A (en) * | 2019-06-10 | 2019-09-10 | 成都澳海川科技有限公司 | A kind of vision answering method based on syntagmatic attention network |
CN110704601A (en) * | 2019-10-11 | 2020-01-17 | 浙江大学 | Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network |
CN111652202A (en) * | 2020-08-10 | 2020-09-11 | 浙江大学 | Method and system for solving video question-answer problem by improving video-language representation learning through self-adaptive space-time diagram model |
Non-Patent Citations (2)
Title |
---|
MIKYAS T.DESTA, LARRY CHEN, TOMASZ KORNUTA: "Object-Based Reasoning in VAQ", 《2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV)》 * |
闫茹玉等: "结合自底向上注意力机制和记忆网络的视觉问答模型", 《中国图象图形学报》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113536952A (en) * | 2021-06-22 | 2021-10-22 | 电子科技大学 | Video question-answering method based on attention network of motion capture |
CN113536952B (en) * | 2021-06-22 | 2023-04-21 | 电子科技大学 | Video question-answering method based on attention network of motion capture |
CN113609330A (en) * | 2021-07-15 | 2021-11-05 | 哈尔滨理工大学 | Video question-answering system, method, computer and storage medium based on text attention and fine-grained information |
CN116074575A (en) * | 2021-11-01 | 2023-05-05 | 国际商业机器公司 | Transducer for real world video question answering |
CN114398961A (en) * | 2021-12-28 | 2022-04-26 | 西南交通大学 | Visual question-answering method based on multi-mode depth feature fusion and model thereof |
CN114495282A (en) * | 2022-02-14 | 2022-05-13 | 中国科学技术大学 | Video motion detection method, system, device and storage medium |
CN116385937A (en) * | 2023-04-07 | 2023-07-04 | 哈尔滨理工大学 | Method and system for solving video question and answer based on multi-granularity cross-mode interaction framework |
CN117235670A (en) * | 2023-11-10 | 2023-12-15 | 南京信息工程大学 | Medical image problem vision solving method based on fine granularity cross attention |
Also Published As
Publication number | Publication date |
---|---|
CN112488055B (en) | 2022-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112488055B (en) | Video question-answering method based on progressive graph attention network | |
CN110008338B (en) | E-commerce evaluation emotion analysis method integrating GAN and transfer learning | |
CN109544524B (en) | Attention mechanism-based multi-attribute image aesthetic evaluation system | |
CN108563653B (en) | Method and system for constructing knowledge acquisition model in knowledge graph | |
WO2019056628A1 (en) | Generation of point of interest copy | |
CN108765383A (en) | Video presentation method based on depth migration study | |
Zhang et al. | Recurrent attention network using spatial-temporal relations for action recognition | |
CN110046353B (en) | Aspect level emotion analysis method based on multi-language level mechanism | |
CN114339450B (en) | Video comment generation method, system, device and storage medium | |
CN114625882B (en) | Network construction method for improving unique diversity of image text description | |
CN112527993B (en) | Cross-media hierarchical deep video question-answer reasoning framework | |
CN114912419B (en) | Unified machine reading understanding method based on recombination countermeasure | |
CN115510814B (en) | Chapter-level complex problem generation method based on dual planning | |
CN113627550A (en) | Image-text emotion analysis method based on multi-mode fusion | |
CN114048314B (en) | Natural language steganalysis method | |
Liu et al. | The use of deep learning technology in dance movement generation | |
Zhu et al. | PBGN: Phased bidirectional generation network in text-to-image synthesis | |
Alrashidi et al. | Hybrid CNN-based Recommendation System | |
CN113783715A (en) | Opportunistic network topology prediction method adopting causal convolutional neural network | |
CN116148864A (en) | Radar echo extrapolation method based on DyConvGRU and Unet prediction refinement structure | |
Fu et al. | Gendds: Generating diverse driving video scenarios with prompt-to-video generative model | |
KR20190134308A (en) | Data augmentation method and apparatus using convolution neural network | |
KR20230121507A (en) | Knowledge distillation for graph-based video captioning | |
CN114818739A (en) | Visual question-answering method optimized by using position information | |
CN117012180B (en) | Voice conversion model training method, voice conversion method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |