CN112488055A - Video question-answering method based on progressive graph attention network - Google Patents

Video question-answering method based on progressive graph attention network Download PDF

Info

Publication number
CN112488055A
CN112488055A CN202011501849.9A CN202011501849A CN112488055A CN 112488055 A CN112488055 A CN 112488055A CN 202011501849 A CN202011501849 A CN 202011501849A CN 112488055 A CN112488055 A CN 112488055A
Authority
CN
China
Prior art keywords
video
question
feature
video frame
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011501849.9A
Other languages
Chinese (zh)
Other versions
CN112488055B (en
Inventor
杨阳
彭亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou University
Guangdong Electronic Information Engineering Research Institute of UESTC
Original Assignee
Guizhou University
Guangdong Electronic Information Engineering Research Institute of UESTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou University, Guangdong Electronic Information Engineering Research Institute of UESTC filed Critical Guizhou University
Priority to CN202011501849.9A priority Critical patent/CN112488055B/en
Publication of CN112488055A publication Critical patent/CN112488055A/en
Application granted granted Critical
Publication of CN112488055B publication Critical patent/CN112488055B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video question-answering method based on a progressive graph attention network, wherein a novel progressive graph attention network is sampled, and various visual interactions at a target level, a video frame level and a video clip level are explored in a progressive mode. In the progressive graph attention network, a graph structure of a target layer is mainly used for obtaining a spatiotemporal relation between targets in the same frame or different frames, a graph structure of a video frame layer researches a mutual relation between video frames, and a graph structure of a video segment layer constructs a time sequence relation between different actions in the graph structure. The present invention also uses a focus mechanism to focus on the vertices and edges of the graph associated with the problem and to connect these different levels of graph features in a progressive manner. In this way, each graph can focus on its spatiotemporal neighboring vertices and finer grained visual content based on visual relevance. This improves the accuracy of predicting the answer to the question.

Description

Video question-answering method based on progressive graph attention network
Technical Field
The invention belongs to the technical field of Video Question Answering (Video-QA), and particularly relates to a Video Question Answering method based on a progressive graph attention network.
Background
In the prior art, Video Question Answering (Video-QA) is mainly aimed at Answering natural language questions related to Video contents. And thus is crucial for the understanding of the video content. The classic video question-answering method mainly comprises three steps: 1) respectively extracting video features and question sentence features by using a Convolutional Neural Network (CNN) model and a Recurrent Neural Network (RNN) model; 2) under the guidance of the question features, focusing on the relevant parts in the video features and answering the questions, thereby obtaining a video representation with more expressive power; 3) and fusing the video characteristics and the question characteristics to obtain multi-modal characteristic representation, and predicting answers of the questions through a question-answering module.
Based on this classical framework, existing video question-answering methods mainly focus on visual reasoning from the temporal and spatial dimensions. Some methods utilize a spatio-Temporal Attention mechanism (Spatial-Temporal Attention) to focus on spatio-Temporal information in the video that is relevant to the question and has value. Other methods explore the visual relationship features existing in the video, thereby providing more effective semantic information for reasoning of the answer.
Most of the existing methods utilize Attention mechanism (Attention) or graph network structure (GCN) to explore single interactions between objects or frames in video, however, these interactions are often not enough to represent complex scenes in video, because in video, not only the spatio-temporal relationship between objects and the interrelation between video frames but also the temporal relationship of actions therein are involved, and therefore the accuracy of the answer to the prediction problem is low.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a video question-answering method based on a progressive graph attention network so as to improve the accuracy of the answer of the predicted question.
In order to achieve the above object, the video question-answering method based on the progressive graph attention network of the present invention is characterized by comprising the following steps:
(1) visual feature extraction
Dividing a video consisting of a frame sequence into N video segments, wherein each segment comprises L frames;
firstly, extracting the hierarchical feature c of each video segment by using a 3D CNN network (three-dimensional convolutional neural network)nN1, 2,.., N, video clip level feature cnHas a dimension of dcBy C ═ C1,c2,...,cNRepresenting the hierarchical characteristics of the N video clips;
then extracting the hierarchical feature f of each video frame by using a 2D CNN network (two-dimensional convolutional neural network)n,lN1, 2, L, video frame level feature fn,lHas a dimension of dfWith F ═ F1,1,f1,2,...,fN,LRepresenting the level characteristics of the NxL video frames;
then using fast R-CNN (Faster regional convolutional neural network) to extract the hierarchical feature o of each target in each video framen,l,kN1, 2, L, K1, 2, K, K is the number of extracted objects in each video frame, and the object level feature o is the object level featuren,l,kHas a dimension of doBy O ═ O1,1,1,o1,1,2,...,oN,L,KRepresenting NxL multiplied by K target level features;
finally, the question is encoded using a Long Short Term Memory (LSTM) network to obtain a representation of the question:
all words in the question are firstly coded into a word vector sequence by using a word embedding model, and then the word vector sequence is input into an LSTM network to obtain a hidden output sequence H ═ { H ═ of the word vector sequence1,h2,...,hSIn which, the characteristic hsS has a dimension d of 1, 2qS is the length of the question, and finally, a self-attention mechanism is used for focusing on important words in the question to obtain the questionIs expressed by the following formula:
Figure BDA0002843712470000021
Figure BDA0002843712470000022
wherein,
Figure BDA0002843712470000023
is a parameter of the learning that is,
Figure BDA0002843712470000026
representing a matrix of real numbers, dhIn the case of the number of rows,
Figure BDA0002843712470000024
is the weight in question of the s-th word, vqIs a representation of a question (question representation);
(2) constructing a progressive graph attention network (comprising three graph attention networks with different levels)
2.1) constructing a target hierarchical graph attention network for obtaining the space-time relation between targets
Building a target hierarchy Go={Vo,εo,AoIn which VoIs a collection of vertices in the graph, one vertex representing a detected object, εoIs a collection of edges in the graph, representing the relationship of all objects in each video frame,
Figure BDA0002843712470000025
is a related adjacency matrix;
combined use of question representation and target hierarchy features on,l,k(for simplicity, object level feature on,l,kRepresents oiNLK) generates a suitable adjacency matrix:
firstly, the question feature vqAnd each object level feature oiCarrying out polymerization:
Figure BDA0002843712470000031
wherein phi '(. cndot.) and phi' (. cndot.) are all fully connected networks with ReLu activation function, which converts the feature into dhA vector of dimensions, where is a dot product, and NLK is nxlxlxk;
then, the adjacency matrix AoThe dependency value between the ith target and the jth target
Figure BDA0002843712470000032
The following formula is given:
Figure BDA0002843712470000033
wherein T represents transpose;
updating each target level feature oiIs o'i
Figure BDA0002843712470000034
Figure BDA0002843712470000035
Concatenating each updated target hierarchy feature o'iTo obtain tensor
Figure BDA0002843712470000036
NL=N×L;
An attention mechanism is used for focusing on the target related to the question in the video frame, and the process of the attention mechanism is expressed by the following formula:
vo=Attention(O′,vq) (7)
wherein,
Figure BDA0002843712470000037
is a target feature of the polymerization,
Figure BDA0002843712470000038
Figure BDA0002843712470000039
2.2) constructing a video frame hierarchy graph attention network for obtaining the correlation among the video frames
Constructing a video frame level map Gf={Vf,εf,AfIn which VfIs a collection of vertices in the graph, each vertex representing a video frame, εfIs a collection of edges in the graph, representing the relationship of each video frame,
Figure BDA00028437124700000310
is a related adjacency matrix;
for simplicity, video frame level feature fn,lDenotes fi′NL, obtain NL fused video frame level features F '═ F'1,f′2,...,f′NL}={f′i′1, 2., NL }, where the video frame features f 'are fused'i′Comprises the following steps:
Figure BDA00028437124700000311
wherein,
Figure BDA00028437124700000312
representing a bit-wise addition of the signals,
Figure BDA00028437124700000313
for fully connected networks with ReLu activation function, converting features to dfA vector of dimensions;
firstly, the question feature vqAnd each fused video frame feature f'i′Carrying out polymerization:
Figure BDA00028437124700000314
wherein,
Figure BDA00028437124700000315
all fully connected networks having ReLu activation functions to convert features to dhA vector of dimensions;
then, the adjacency matrix AfThe dependency value between the ith 'video frame and the jth' video frame
Figure BDA0002843712470000041
The following formula is given:
Figure BDA0002843712470000042
updating each fused video frame level feature f'i′Is f ″)i′
Figure BDA0002843712470000043
Figure BDA0002843712470000044
Then serially connecting each updated video frame level characteristic f ″i′To obtain tensor
Figure BDA0002843712470000045
Under the guidance of question features, an attention mechanism is used to obtain aggregated video frame features vf
vf=Attention(F″,vq) (13)
Wherein the aggregated video frame characteristics
Figure BDA0002843712470000046
Figure BDA0002843712470000047
2.3) constructing a video clip hierarchy map attention network for establishing time sequence and semantic relation between actions in the video clip
Constructing a video clip hierarchy Gc={Vc,εc,AcIn which V iscRepresenting a collection of video segments, epsiloncIs a collection of edges in the graph representing the relationship of each video segment, AcIs a related adjacency matrix;
merging hierarchical feature C of N video segments and aggregated video frame feature vfGenerating a fused video clip level feature C ═ C'1,c′2,...,c′N}={c′n1, 2., N }, wherein the video segment level feature c 'is fused'nComprises the following steps:
Figure BDA0002843712470000048
where ω' (-) is a fully connected network with a ReLu activation function, transforming the features into vectors of dc dimensions;
firstly, the question feature vqAnd aggregate video frame features
Figure BDA0002843712470000049
Carrying out polymerization:
Figure BDA00028437124700000410
wherein, ω '(. cndot.) and ω' (. cndot.) are all fully connected networks with ReLu activation function, so that the feature is converted into dhA vector of dimensions;
then, the adjacency matrix AcThe dependency value between the nth video segment and the kth video segment
Figure BDA00028437124700000411
The following formula is given:
Figure BDA0002843712470000051
updating each fused video segment level feature c'nIs cn
Figure BDA0002843712470000052
Figure BDA0002843712470000053
Then connecting each updated fusion video clip level feature c ″ in seriesnTo obtain tensor
Figure BDA0002843712470000054
Under the guidance of question features, an attention mechanism is used to obtain aggregated video features vc:
vc=Attention(C″,vq) (19)
wherein the aggregated video features vcHas a dimension of dc
(3) Answer prediction
For an open type task, firstly, visual information and question information are fused, then the fused information is input into a softmax classifier, and the probability of an answer is calculated:
g=ρ′(vc)*ρ″(vq) (20)
p=softmax(Wog) (21)
wherein, rho '(. cndot.) and rho' (. cndot.) are all full-connection networks with ReLu activation function, WoThe method comprises the steps that learning parameters are obtained, p is a probability vector, and cross entropy functions are used for updating full-connection network parameters and learning parameters of a softmax classifier;
for the multi-choice task, firstly, visual information, question information and answer representation are fused in series, then the fused features are sent to a final classifier for linear regression, and an answer index y is output:
g′=ρ′(vc)*ρ″(vq)*ρ″′(va) (22)
y=Wmg′ (23)
wherein v isaFor answer representation, WmUpdating and updating the parameters of the full-connection network and the learning parameters of the classifier by using a pairwise comparison hinge function;
for the counting task, a linear regression function is used, g in the formula (20) is used as input, then the counting result is calculated by using a rounding function, and the linear regression function parameters are updated by using a Mean Square Error (MSE) loss function.
The object of the invention is thus achieved.
The invention relates to a video question-answering method based on a progressive graph attention network, wherein a novel progressive graph attention network is sampled, and various visual interactions at a target level, a video frame level and a video fragment level are explored in a progressive mode. In the progressive graph attention network, a graph structure of a target layer is mainly used for obtaining a spatiotemporal relation between targets in the same frame or different frames, a graph structure of a video frame layer researches a mutual relation between video frames, and a graph structure of a video segment layer constructs a time sequence relation between different actions in the graph structure. The present invention also uses a focus mechanism to focus on the vertices and edges of the graph associated with the problem and to connect these different levels of graph features in a progressive manner. In this way, each graph can focus on its spatiotemporal neighboring vertices and finer grained visual content based on visual relevance. This improves the accuracy of predicting the answer to the question.
Drawings
FIG. 1 is a flow chart of an embodiment of a video question-answering method based on a progressive graph attention network according to the present invention;
fig. 2 is a schematic diagram of a video question-answering method based on a progressive graph attention network according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
Fig. 1 is a flowchart of an embodiment of a video question-answering method based on a progressive graph attention network according to the present invention.
In this embodiment, as shown in fig. 1, the video question-answering method based on the progressive graph attention network of the present invention includes the following steps:
step S1: visual feature extraction
In this embodiment, as shown in fig. 2, the present invention extracts three different levels of visual features, and divides a video V composed of a frame sequence into N video segments, each segment including L frames.
Firstly, extracting the hierarchical feature c of each video segment by using a 3D CNN network (three-dimensional convolutional neural network)nN1, 2,.., N, video clip level feature cnHas a dimension of dcBy C ═ C1,c2,...,cNAnd represents the hierarchical characteristics of the N video clips. In this embodiment, the 3D CNN network employs a ResNeXt-101 network.
Then extracting the hierarchical feature f of each video frame by using a 2D CNN network (two-dimensional convolutional neural network)n,lN1, 2, L, video frame level feature fn,lHas a dimension of dfWith F ═ F1,1,f1,2,...,fN,LDenotes the hierarchical features of N × L video frames. In the present embodiment, the 2D CNN network employs a ResNet-152 network.
Then using fast R-CNN (Faster regional convolutional neural network) to extract the hierarchical feature o of each target in each video framen,l,k,n=1,2,...,N,lK is the number of extracted objects in each video frame, and the object level feature on,l,kHas a dimension of doBy O ═ O1,1,1,o1,1,2,...,oN,L,KRepresents N × L × K target level features.
Finally, the question Q is encoded using a Long Short Term Memory (LSTM) network to obtain a representation of the question:
all words in the question Q are firstly encoded into a word vector sequence by a Glove word embedding model, and then the word vector sequence is input into an LSTM network to obtain a hidden output sequence H ═ H1,h2,...,hSIn which, the characteristic hsS has a dimension d of 1, 2qAnd S is the length of the question Q, and finally, a self-attention mechanism is used for focusing on important words in the question to obtain the expression of the question Q, and the expression is expressed by the following formula:
Figure BDA0002843712470000071
Figure BDA0002843712470000072
wherein,
Figure BDA0002843712470000073
is a parameter of the learning that is,
Figure BDA0002843712470000076
representing a matrix of real numbers, dhIn the case of the number of rows,
Figure BDA0002843712470000074
is the weight in question of the s-th word, vqIs a representation of a question (question representation).
Step S2: building a progressive graph attention network
In this embodiment, as shown in fig. 2, the present invention designs a progressive graph attention network for reasoning about valuable visual information related to question sentences, which includes three graph networks of different levels: the first is a target level graph attention network (target level graph attention network) for obtaining the space-time relationship between targets; the second is a video frame level graph attention network (video frame level graph attention network) for exploring the interrelation among video frames; the last one is a video segment level graph attention network (video segment level graph attention network) for establishing timing and semantic relationships between actions in video segments.
Step S2.1: constructing a target hierarchical graph attention network for obtaining the time-space relationship between targets
Building a target hierarchy Go={Vo,εo,AoIn which VoIs a collection of vertices in the graph, one vertex representing a detected object, εoIs a collection of edges in the graph, representing the relationship of all objects in each video frame,
Figure BDA0002843712470000075
is a related adjacency matrix;
the invention establishes a graph structure of a target hierarchy and mainly aims to establish the relationship between any two targets in a video under the guidance of question features. Therefore, a combination of question features and visual features is required to generate a suitable adjacency matrix.
Combined use of question representation and target hierarchy features on,l,k(for simplicity, object level feature on,l,kRepresents oiNLK) generates a suitable adjacency matrix ao
Firstly, the question feature vqAnd each object level feature oiCarrying out polymerization:
Figure BDA0002843712470000081
wherein phi '(. cndot.) and phi' (. cndot.) are all fully connected networks with ReLu activation function, which converts the feature into dhThe vector of the dimension represents the dot product, and NLK is N × L × K.
Then, the adjacency matrix AoThe dependency value between the ith target and the jth target
Figure BDA0002843712470000082
The following formula is given:
Figure BDA0002843712470000083
where T denotes transposition.
The invention is based on the calculated adjacency matrix AoUpdating each object feature with other related objectsiThereby preserving local and long-term dependencies between targets. Specifically updating each target level feature oiIs o'i
Figure BDA0002843712470000084
Figure BDA0002843712470000085
Wherein "+ oi"is for the purpose of residual concatenation.
Concatenating each updated target hierarchy feature o'iTo obtain tensor
Figure BDA0002843712470000086
NL=N×L;
An attention mechanism is used for focusing on the target related to the question in the video frame, and the process of the attention mechanism is expressed by the following formula:
vo=Attention(O′,vq) (7)
wherein,
Figure BDA0002843712470000087
is a target level feature of the aggregation,
Figure BDA0002843712470000088
Figure BDA0002843712470000089
Figure BDA00028437124700000810
has a dimension of do
Step S2.2: constructing a video frame hierarchy graph attention network for obtaining the interrelation between video frames
The relationship between different video frames can record changes in detailed apparent information when motion is occurring or transitioning in the video. The invention builds a graph structure of video frame level to obtain detailed appearance change.
Constructing a video frame level map Gf={Vf,εf,AfIn which VfIs a collection of vertices in the graph, each vertex representing a video frame, εfIs a collection of edges in the graph, representing the relationship of each video frame,
Figure BDA0002843712470000091
is a related adjacency matrix.
For simplicity, video frame level feature fn,lDenotes fi′,i′=1,2,...,NL,
The method is used for obtaining NL fusion video frame level features F ' ═ F ' through fusion of two features '1,f′2,...,f′NL}={f′i′1, 2., NL }, where the video frame features f 'are fused'i′Comprises the following steps:
Figure BDA0002843712470000092
wherein,
Figure BDA0002843712470000093
representing a bit-wise addition of the signals,
Figure BDA0002843712470000094
for fully connected networks with ReLu activation function, converting features to dfA vector of dimensions.
Firstly, the question feature vqAnd each fused video frame feature f'i′Carrying out polymerization:
Figure BDA0002843712470000095
wherein,
Figure BDA0002843712470000096
all fully connected networks having ReLu activation functions to convert features to dhA vector of dimensions.
Then, the adjacency matrix AfThe dependency value between the ith 'video frame and the jth' video frame
Figure BDA0002843712470000097
The following formula is given:
Figure BDA0002843712470000098
updating each fused video frame level feature f'i′Is f ″)i′
Figure BDA0002843712470000099
Figure BDA00028437124700000910
Then serially connecting each updated video frame level characteristic f ″i′To obtain tensor
Figure BDA00028437124700000911
Under the guidance of question features, an attention mechanism is used to obtain aggregated video frame features vf
vf=Attention(F″,vq) (13)
Wherein the aggregated video frame characteristics
Figure BDA00028437124700000912
Figure BDA00028437124700000913
Step S2.3: constructing a video clip hierarchy map attention network for establishing time sequence and semantic relation between actions in video clips
The invention divides the video into a plurality of short video segments and constructs a video segment level graph structure (video segment level graph) to represent the time sequence and semantic relation between actions in different video segments.
Constructing a video clip hierarchy Gc={Vc,εc,AcWhere Vc represents the set of video segments, εcIs a collection of edges in the graph representing the relationship of each video segment, AcIs a related adjacency matrix;
merging hierarchical feature C of N video segments and aggregated video frame feature vfGenerating a fused video clip level feature C ═ C'1,c′2,...,c′N}={c′n1, 2., N }, wherein the video segment level feature c 'is fused'nComprises the following steps:
Figure BDA0002843712470000101
where ω' (. cndot.) is a fully connected network with a ReLu activation function, converting features to dcA vector of dimensions;
firstly, the question feature vqAnd aggregate video frame features
Figure BDA0002843712470000102
Carrying out polymerization:
Figure BDA0002843712470000103
wherein, ω '(. cndot.) and ω' (. cndot.) are all fully connected networks with ReLu activation function, so that the feature is converted into dhA vector of dimensions.
Then, the adjacency matrix AcThe dependency value between the nth video segment and the kth video segment
Figure BDA0002843712470000104
The following formula is given:
Figure BDA0002843712470000105
updating each fused video segment level feature c'nIs cn
Figure BDA0002843712470000106
Figure BDA0002843712470000107
Then connecting each updated fusion video clip level feature c ″ in seriesnTo obtain tensor
Figure BDA0002843712470000108
Under the guidance of question features, an attention mechanism is used to obtain aggregated video features vc:
vc=Attention(C″,vq) (19)
wherein the aggregated video features vcHas a dimension of dc
With thisSample mode, combined video features vcThe method can include the information of the target, the global information and the dynamic information of the video frame, and the accuracy of the answer of the predicted question is improved.
Step S3: answer prediction
For an open type task, firstly, visual information and question information are fused, then the fused information is input into a softmax classifier, and the probability of an answer is calculated:
g=ρ′(vc)*ρ″(vq) (20)
p=softmax(Wog) (21)
wherein, rho '(. cndot.) and rho' (. cndot.) are all full-connection networks with ReLu activation function, WoAnd updating the parameters of the full-connection network and the learning parameters of the softmax classifier by using a cross entropy function.
For the multi-choice task, firstly, visual information, question information and answer representation are fused in series, then the fused features are sent to a final classifier for linear regression, and an answer index y is output:
g′=ρ′(vc)*ρ″(vq)*ρ″′(va) (22)
y=Wmg′ (23)
wherein v isaFor answer representation, WmIs a learning parameter, and updates the full-connection network parameter and the learning parameter of the classifier by using a pairwise comparison hinge function.
For the counting task, a linear regression function is used, g in the formula (20) is used as input, then the counting result is calculated by using a rounding function, and the linear regression function parameters are updated by using a Mean Square Error (MSE) loss function.
Examples of the invention
Experiments show that two Multi-Choice (Multi-Choice) sub data sets in the existing large video question-answer data set TGIF-QA have serious answer bias. These biases can have a large impact on the accuracy of the model. To address this problem, the present example builds a new data set, TGIF-QA-R, based on TGIF-QA. In this data set, the candidate answers are independent of each other, in such a way that the influence of answer bias can be effectively reduced.
The effect of the method is tested on three large reference data sets of TGIF-QA, MSVD-QA and MSRVTT-QA and a newly constructed TGIF-QA-R data set, and the effect of the method is better than that of the highest level method.
1. Test results on TGIF-QA and TGIF-QA-R datasets
Figure BDA0002843712470000111
Figure BDA0002843712470000121
TABLE 1
As can be seen from Table 1, the present invention performed best in most of the subtasks, with 57.6% and 65.6% accuracy in the Action and Trans. subtasks of TGIF-QA-R, and 79.5%, 85.3% and 62.8% accuracy in the Action, Trans, and Frame subtasks of TGIF-QA, respectively.
2. Test results on MSVD-QA dataset
Figure BDA0002843712470000122
TABLE 2
From table 2, it can be seen that the present invention achieves the highest level of performance in terms of overall accuracy, increasing the accuracy from 36.5% to 39.8%.
3. Test results on the MSRVTT-QA dataset:
Figure BDA0002843712470000123
Figure BDA0002843712470000131
from table 3, it can be seen that the present invention achieves the highest level of performance in terms of overall accuracy, increasing the accuracy from 35.5% to 38.2%.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims (1)

1. A video question-answering method based on a progressive graph attention network is characterized by comprising the following steps:
(1) visual feature extraction
Dividing a video consisting of a frame sequence into N video segments, wherein each segment comprises L frames;
firstly, extracting the hierarchical feature c of each video segment by using a 3D CNN network (three-dimensional convolutional neural network)nN1, 2,.., N, video clip level feature cnHas a dimension of dcBy C ═ C1,c2,...,cNRepresenting the hierarchical characteristics of the N video clips;
then extracting the hierarchical feature f of each video frame by using a 2D CNN network (two-dimensional convolutional neural network)n,lN1, 2, L, video frame level feature fn,lHas a dimension of dfWith F ═ F1,1,f1,2,…,fN,LRepresenting the level characteristics of the NxL video frames;
then using fast R-CNN (Faster regional convolutional neural network) to extract the hierarchical feature o of each target in each video framen,l,kN1, 2., N, L1, 2.. L, K1, 2.. K, K is per video frameExtracting the number of targets and the target level characteristics on,l,kHas a dimension of doBy O ═ O1,1,1,o1,1,2,...,oN,L,KRepresenting NxL multiplied by K target level features;
finally, the question is encoded using a Long Short Term Memory (LSTM) network to obtain a representation of the question:
all words in the question are firstly coded into a word vector sequence by using a word embedding model, and then the word vector sequence is input into an LSTM network to obtain a hidden output sequence H ═ { H ═ of the word vector sequence1,h2,...,hSIn which, the characteristic hsS has a dimension d of 1, 2qAnd S is the length of the question, and finally, a self-attention mechanism is used for focusing on important words in the question to obtain the expression of the question, and the expression is expressed by the following formula:
Figure FDA0002843712460000011
Figure FDA0002843712460000012
wherein,
Figure FDA0002843712460000013
is a parameter of the learning that is,
Figure FDA0002843712460000014
representing a matrix of real numbers, dhIn the case of the number of rows,
Figure FDA0002843712460000015
is the weight in question of the s-th word, vqIs a representation of a question (question representation);
(2) constructing a progressive graph attention network (comprising three graph attention networks with different levels)
2.1) constructing a target hierarchical graph attention network for obtaining the space-time relation between targets
Building a target hierarchy Go={Vo,εo,AoIn which VoIs a collection of vertices in the graph, one vertex representing a detected object, εoIs a collection of edges in the graph, representing the relationship of all objects in each video frame,
Figure FDA0002843712460000021
is a related adjacency matrix;
combined use of question representation and target hierarchy features on,l,k(for simplicity, object level feature on,l,kRepresents oiNLK) generates a suitable adjacency matrix:
firstly, the question feature vqAnd each object level feature oiCarrying out polymerization:
Figure FDA0002843712460000022
wherein phi '(. cndot.) and phi' (. cndot.) are all fully connected networks with ReLu activation function, which converts the feature into dhA vector of dimensions, where is a dot product, and NLK is nxlxlxk;
then, the adjacency matrix AoThe dependency value between the ith target and the jth target
Figure FDA0002843712460000023
The following formula is given:
Figure FDA0002843712460000024
wherein T represents transpose;
updating each target level feature oiIs o'i
Figure FDA0002843712460000025
Figure FDA0002843712460000026
Concatenating each updated target hierarchy feature o'iTo obtain tensor
Figure FDA0002843712460000027
NL=N×L;
An attention mechanism is used for focusing on the target related to the question in the video frame, and the process of the attention mechanism is expressed by the following formula:
vo=Attention(O′,vq) (7)
wherein,
Figure FDA0002843712460000028
is a target feature of the polymerization,
Figure FDA0002843712460000029
Figure FDA00028437124600000210
2.2) constructing a video frame hierarchy graph attention network for obtaining the correlation among the video frames
Constructing a video frame level map Gf={Vf,εf,AfIn which VfIs a collection of vertices in the graph, each vertex representing a video frame, εfIs a collection of edges in the graph, representing the relationship of each video frame,
Figure FDA00028437124600000211
is a related adjacency matrix;
for simplicity, video frame level feature fn,lDenotes fi′NL, NL 'is obtained NL fused video frame level features F'={f1′,f2′,...,f′NL}={fi' | i ═ 1, 2., NL }, where the video frame features f ' are fused 'i′Comprises the following steps:
Figure FDA00028437124600000212
wherein,
Figure FDA00028437124600000213
representing a bit-wise addition of the signals,
Figure FDA00028437124600000214
for fully connected networks with ReLu activation function, converting features to dfA vector of dimensions;
firstly, the question feature vqAnd each fused video frame feature f'i′Carrying out polymerization:
Figure FDA0002843712460000031
wherein,
Figure FDA0002843712460000032
all fully connected networks having ReLu activation functions to convert features to dhA vector of dimensions;
then, the adjacency matrix AfThe dependency value between the ith 'video frame and the jth' video frame
Figure FDA0002843712460000033
The following formula is given:
Figure FDA0002843712460000034
updating each fused video frame level feature f'i′Is f ″)i′
Figure FDA0002843712460000035
Figure FDA0002843712460000036
Then serially connecting each updated video frame level characteristic f ″i′To obtain tensor
Figure FDA0002843712460000037
Under the guidance of question features, an attention mechanism is used to obtain aggregated video frame features vf
vf=Attention(F″,vq) (13)
Wherein the aggregated video frame characteristics
Figure FDA0002843712460000038
Figure FDA0002843712460000039
2.3) constructing a video clip hierarchy map attention network for establishing time sequence and semantic relation between actions in the video clip
Constructing a video clip hierarchy Gc={Vc,εc,AcIn which V iscRepresenting a collection of video segments, epsiloncIs a collection of edges in the graph representing the relationship of each video segment, AcIs a related adjacency matrix;
merging hierarchical feature C of N video segments and aggregated video frame feature vfGenerating a fused video clip level feature C ═ C'1,c′2,...,c′N}={c′n1, 2., N }, wherein the video segment level feature c 'is fused'nComprises the following steps:
Figure FDA00028437124600000310
where ω' (-) is a fully connected network with a ReLu activation function, transforming the features into vectors of dc dimensions;
firstly, the question feature vqAnd aggregate video frame features
Figure FDA00028437124600000311
Carrying out polymerization:
Figure FDA00028437124600000312
wherein, ω '(. cndot.) and ω' (. cndot.) are all fully connected networks with ReLu activation function, so that the feature is converted into dhA vector of dimensions;
then, the adjacency matrix AcThe dependency value between the nth video segment and the kth video segment
Figure FDA0002843712460000041
The following formula is given:
Figure FDA0002843712460000042
updating each fused video segment level feature c'nIs cn
Figure FDA0002843712460000043
Figure FDA0002843712460000044
Then connect each in seriesUpdated fusion video clip level feature c ″nTo obtain tensor
Figure FDA0002843712460000045
Under the guidance of question features, an attention mechanism is used to obtain aggregated video features vc
vc=Attention(C″,vq) (19)
Wherein the aggregated video features vcHas a dimension of dc
(3) Answer prediction
For an open type task, firstly, visual information and question information are fused, then the fused information is input into a softmax classifier, and the probability of an answer is calculated:
g=ρ′(vc)*ρ″(vq) (20)
ρ=softmax(Wog) (21)
wherein, rho '(. cndot.) and rho' (. cndot.) are all full-connection networks with ReLu activation function, WoThe method comprises the steps that learning parameters are obtained, p is a probability vector, and cross entropy functions are used for updating full-connection network parameters and learning parameters of a softmax classifier;
for the multi-choice task, firstly, visual information, question information and answer representation are fused in series, then the fused features are sent to a final classifier for linear regression, and an answer index y is output:
g′=ρ′(vc)*ρ″(vq)*ρ″′(va) (22)
y=Wmg′ (23)
wherein v isaFor answer representation, WmUpdating and updating the parameters of the full-connection network and the learning parameters of the classifier by using a pairwise comparison hinge function;
for the counting task, a linear regression function is used, g in the formula (20) is used as input, then the counting result is calculated by using a rounding function, and the linear regression function parameters are updated by using a Mean Square Error (MSE) loss function.
CN202011501849.9A 2020-12-18 2020-12-18 Video question-answering method based on progressive graph attention network Active CN112488055B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011501849.9A CN112488055B (en) 2020-12-18 2020-12-18 Video question-answering method based on progressive graph attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011501849.9A CN112488055B (en) 2020-12-18 2020-12-18 Video question-answering method based on progressive graph attention network

Publications (2)

Publication Number Publication Date
CN112488055A true CN112488055A (en) 2021-03-12
CN112488055B CN112488055B (en) 2022-09-06

Family

ID=74914783

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011501849.9A Active CN112488055B (en) 2020-12-18 2020-12-18 Video question-answering method based on progressive graph attention network

Country Status (1)

Country Link
CN (1) CN112488055B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113536952A (en) * 2021-06-22 2021-10-22 电子科技大学 Video question-answering method based on attention network of motion capture
CN113609330A (en) * 2021-07-15 2021-11-05 哈尔滨理工大学 Video question-answering system, method, computer and storage medium based on text attention and fine-grained information
CN114398961A (en) * 2021-12-28 2022-04-26 西南交通大学 Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN114495282A (en) * 2022-02-14 2022-05-13 中国科学技术大学 Video motion detection method, system, device and storage medium
CN116074575A (en) * 2021-11-01 2023-05-05 国际商业机器公司 Transducer for real world video question answering
CN116385937A (en) * 2023-04-07 2023-07-04 哈尔滨理工大学 Method and system for solving video question and answer based on multi-granularity cross-mode interaction framework
CN117235670A (en) * 2023-11-10 2023-12-15 南京信息工程大学 Medical image problem vision solving method based on fine granularity cross attention

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN107766447A (en) * 2017-09-25 2018-03-06 浙江大学 It is a kind of to solve the method for video question and answer using multilayer notice network mechanism
CN108829756A (en) * 2018-05-25 2018-11-16 杭州知智能科技有限公司 A method of more wheel video question and answer are solved using layering attention context network
CN109829049A (en) * 2019-01-28 2019-05-31 杭州一知智能科技有限公司 The method for solving video question-answering task using the progressive space-time attention network of knowledge base
CN110222770A (en) * 2019-06-10 2019-09-10 成都澳海川科技有限公司 A kind of vision answering method based on syntagmatic attention network
CN110704601A (en) * 2019-10-11 2020-01-17 浙江大学 Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network
CN111652202A (en) * 2020-08-10 2020-09-11 浙江大学 Method and system for solving video question-answer problem by improving video-language representation learning through self-adaptive space-time diagram model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN107766447A (en) * 2017-09-25 2018-03-06 浙江大学 It is a kind of to solve the method for video question and answer using multilayer notice network mechanism
CN108829756A (en) * 2018-05-25 2018-11-16 杭州知智能科技有限公司 A method of more wheel video question and answer are solved using layering attention context network
CN109829049A (en) * 2019-01-28 2019-05-31 杭州一知智能科技有限公司 The method for solving video question-answering task using the progressive space-time attention network of knowledge base
CN110222770A (en) * 2019-06-10 2019-09-10 成都澳海川科技有限公司 A kind of vision answering method based on syntagmatic attention network
CN110704601A (en) * 2019-10-11 2020-01-17 浙江大学 Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network
CN111652202A (en) * 2020-08-10 2020-09-11 浙江大学 Method and system for solving video question-answer problem by improving video-language representation learning through self-adaptive space-time diagram model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MIKYAS T.DESTA, LARRY CHEN, TOMASZ KORNUTA: "Object-Based Reasoning in VAQ", 《2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV)》 *
闫茹玉等: "结合自底向上注意力机制和记忆网络的视觉问答模型", 《中国图象图形学报》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113536952A (en) * 2021-06-22 2021-10-22 电子科技大学 Video question-answering method based on attention network of motion capture
CN113536952B (en) * 2021-06-22 2023-04-21 电子科技大学 Video question-answering method based on attention network of motion capture
CN113609330A (en) * 2021-07-15 2021-11-05 哈尔滨理工大学 Video question-answering system, method, computer and storage medium based on text attention and fine-grained information
CN116074575A (en) * 2021-11-01 2023-05-05 国际商业机器公司 Transducer for real world video question answering
CN114398961A (en) * 2021-12-28 2022-04-26 西南交通大学 Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN114495282A (en) * 2022-02-14 2022-05-13 中国科学技术大学 Video motion detection method, system, device and storage medium
CN116385937A (en) * 2023-04-07 2023-07-04 哈尔滨理工大学 Method and system for solving video question and answer based on multi-granularity cross-mode interaction framework
CN117235670A (en) * 2023-11-10 2023-12-15 南京信息工程大学 Medical image problem vision solving method based on fine granularity cross attention

Also Published As

Publication number Publication date
CN112488055B (en) 2022-09-06

Similar Documents

Publication Publication Date Title
CN112488055B (en) Video question-answering method based on progressive graph attention network
CN110008338B (en) E-commerce evaluation emotion analysis method integrating GAN and transfer learning
CN109544524B (en) Attention mechanism-based multi-attribute image aesthetic evaluation system
CN108563653B (en) Method and system for constructing knowledge acquisition model in knowledge graph
WO2019056628A1 (en) Generation of point of interest copy
CN108765383A (en) Video presentation method based on depth migration study
Zhang et al. Recurrent attention network using spatial-temporal relations for action recognition
CN110046353B (en) Aspect level emotion analysis method based on multi-language level mechanism
CN114339450B (en) Video comment generation method, system, device and storage medium
CN114625882B (en) Network construction method for improving unique diversity of image text description
CN112527993B (en) Cross-media hierarchical deep video question-answer reasoning framework
CN114912419B (en) Unified machine reading understanding method based on recombination countermeasure
CN115510814B (en) Chapter-level complex problem generation method based on dual planning
CN113627550A (en) Image-text emotion analysis method based on multi-mode fusion
CN114048314B (en) Natural language steganalysis method
Liu et al. The use of deep learning technology in dance movement generation
Zhu et al. PBGN: Phased bidirectional generation network in text-to-image synthesis
Alrashidi et al. Hybrid CNN-based Recommendation System
CN113783715A (en) Opportunistic network topology prediction method adopting causal convolutional neural network
CN116148864A (en) Radar echo extrapolation method based on DyConvGRU and Unet prediction refinement structure
Fu et al. Gendds: Generating diverse driving video scenarios with prompt-to-video generative model
KR20190134308A (en) Data augmentation method and apparatus using convolution neural network
KR20230121507A (en) Knowledge distillation for graph-based video captioning
CN114818739A (en) Visual question-answering method optimized by using position information
CN117012180B (en) Voice conversion model training method, voice conversion method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant