CN114428866A

CN114428866A - Video question-answering method based on object-oriented double-flow attention network

Info

Publication number: CN114428866A
Application number: CN202210094738.3A
Authority: CN
Inventors: 俞俊; 张欣
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2022-05-03

Abstract

The invention discloses a video question-answering method based on an object-oriented double-flow attention network. The visual content of the video is represented using a dual stream mechanism, where one stream is a static appearance stream of foreground objects and the other stream is a dynamic behavior stream of foreground objects. In each stream, the characteristics of the object include both the characteristics of the object itself, as well as the spatio-temporal coding of the object and the contextual information characteristics of the scene in which the object is located. The relative spatiotemporal relationship and the context-aware relationship between the objects can be explored when the deep feature extraction is carried out in the subsequent graph convolution operation. Meanwhile, the problem that the prior video question-answer model only considers the static characteristics of the object and lacks dynamic information analysis is solved by using a double-flow mechanism. The invention improves the exploration capability of the intra-modal interaction and the inter-modal semantic alignment, and obtains better results on the relevant video question-answer data sets.

Description

Video question-answering method based on object-oriented double-flow attention network

Technical Field

The invention relates to the field of video question answering, in particular to a video question answering method based on an object-oriented double-flow attention network containing static appearance flow and dynamic behavior flow.

Background

Video question-answering as a cross-discipline task for computer vision and natural language processing requires a system to output correct answers based on a given video and a natural language description, and the task requires the system to be able to fully understand information of two modalities, namely appearance characteristics, behavior characteristics of foreground objects in the video and content asked by questions, and also requires the ability to explore semantic interaction relationships between the two modalities. The existing video question-answering tasks mostly perform visual feature extraction by taking frames as units, the learning of a question-answering main body, namely a foreground object in a video, is not detailed enough by the feature extraction method, and meanwhile, the modeling of an interactive relation between the objects is also omitted by the feature extraction method. Therefore, it is important to design a model capable of mining the features of the target objects in the video and the interaction relationship between the features and the interaction relationship.

The video question-answer and the image question-answer belong to the same branch of the visual question-answer, and correct answers are output according to a given piece of visual information and a question. Although video questions are an extended task of image questions, there are many differences between them. Therefore, the image question-answering method cannot be simply applied to the video question-answering task. The most important difference between video and image questions is that the video question-answering task handles long sequence images that contain richer appearance and motion information rather than a single still image. Therefore, in the video question-answering task, the system needs to simultaneously model the appearance information and the motion information of the objects in the video to improve the accuracy of the model for answering the questions.

The attention network is a common object relation modeling mechanism and is widely applied to various different types of machine learning tasks such as natural language processing, image recognition, voice recognition and the like. The attention mechanism can help the model to give different weights to each input part, more key and important information is extracted, the model can be judged more accurately, and meanwhile, larger expenses cannot be brought to calculation and storage of the model. The present invention proposes two variants of attention mechanism to enable modeling of intra-modality interactions and inter-modality interactions.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a video question-answering method based on an object-oriented double-flow attention network, which can be used for learning static appearance characteristics and dynamic behavior characteristics of foreground objects in a video in parallel, accurately modeling the interactive relation among the objects in the video and realizing cross-modal semantic alignment. Better results were obtained on the relevant video question-answer dataset.

A video question-answering method based on an object-oriented double-flow attention network comprises the following steps:

step (1), data preprocessing:

inputting a video segment to be asked and answered, firstly extracting static appearance characteristics and dynamic behavior characteristics of a video frame by using a convolution network, then extracting a target object in the video by using a target detection algorithm, and simultaneously extracting the static appearance characteristics and the dynamic behavior characteristics of the object by using the convolution network; taking the extracted static appearance characteristics and dynamic behavior characteristics of the video frame as context representation characteristics of a target object;

step (2), video feature coding:

and coding the video features by adopting a dual-stream mechanism, wherein the dual streams are a static appearance stream and a dynamic behavior stream of the target object respectively.

In each stream, respectively combining the characteristics of the target object, the space-time coding of the target object and the context representation characteristics of the target object to obtain characteristics serving as nodes to construct a characteristic graph; performing graph convolution operation on the constructed feature graph to acquire high-order features of the target object in the video;

step (3), problem feature coding:

coding the problem by utilizing a recurrent neural network to obtain the local code of each word and the global code of the problem sentence;

step (4), cross-modal fusion of videos and problems:

in each stream, firstly, the high-order characteristics of the target object are input into the attention network to obtain the intra-modal interaction relationship between the objects, and meanwhile, the high-order characteristics of the video target object are reconstructed according to the intra-modal interaction relationship between the objects, similarly, the local characteristics of each word are transmitted into the attention network to obtain the intra-modal interaction relationship between the words and the vocabulary, and meanwhile, the problem characteristics are reconstructed according to the intra-modal interaction relationship between the words and the vocabulary. And then inputting the reconstructed high-order characteristics of the target object and the reconstructed problem characteristics into an attention network guided by the problem to explore cross-modal semantic relation between words and object characteristics, and updating the high-order characteristics of the video.

And fusing the high-order characteristics of the static appearance flow of the target object and the high-order characteristics of the dynamic behavior flow to obtain the double-flow high-order characteristics of the video.

Fusing the double-current high-order characteristics of the video with the global problem code to obtain an answer code;

and (5) decoding the answer code obtained in the step (4) to obtain a final answer:

different decoding modes are adopted for different problem types, a linear classifier is used for decoding open problems and multi-item selection problems, and a linear regression device is used for decoding counting problems;

further, the specific method in the step (1) is as follows:

inputting a video clip to be asked and answered, dividing the video clip into a fixed number of video frames, and carrying out corresponding static state exception on each frameExtracting the appearance characteristic and the dynamic behavior characteristic to obtain the static appearance characteristic of the video frame

And dynamic behavior characteristics

The context as the target object represents the feature. Then, positioning the position information b of the target object on each frame by using a target detection algorithm, and aligning the position information to the static appearance characteristic and the dynamic behavior characteristic of the video frame by using a RoIAlign alignment network to obtain the static appearance characteristic O of each object^aAnd dynamic behavior characteristics O^m：

The specific formula of the object position information b is as follows:

b＝[c₁，c₂，w，h](formula 1)

Wherein, c₁And c₂Is the center position coordinates of the target object, and w and h represent the length and width of the target object, respectively.

Further, the specific method of the step (2) is as follows:

the method adopts a double-stream mechanism to represent the feature coding of the video, the static appearance feature of an object and the dynamic behavior feature of the object are coded in parallel by adopting the same coding mechanism, a network for coding the static appearance feature is called static appearance flow, and a network for coding the dynamic behavior feature is called dynamic behavior flow. And finally, respectively obtaining the high-order characteristics of the static appearance flow and the high-order characteristics of the dynamic behavior flow of the video. The specific process is as follows:

the static appearance stream and the dynamic behavior stream adopt the same network structure, so for simplifying the expression, the upper corner mark a/m is used for simultaneously representing the characteristics of the two streams.

Given characteristics of target objects in video

And a set of position information of the target object

Firstly, the position information b of the target object is transmitted into a space position coding network and a time position coding network to respectively obtain the space position codes d of the object^sAnd a time-position code d^tThe spatial position coding network is composed of a plurality of layers of perceptrons, and the time position coding network adopts sine and cosine functions with different frequencies to code the time position of the target object. The specific formula is as follows:

d^smlp (b) (equation 2)

Wherein b represents the position information of the target object found by formula 1; n represents the number of target objects; d_pRepresents the dimension of the code, and 0 ≦ 2j ≦ d_p。

Then, combining the characteristics of the target object and the space-time position code of the target object to obtain the characteristics with position perception of the target object, wherein the specific formula is as follows:

combining the characteristics with the position perception of the target object with the characteristics of the video frame to obtain the characteristics with the context perception, wherein the specific formula is as follows:

after the characteristics with the position information and the context information of each object are obtained, the characteristics of each object are taken as nodes, and the interactive relation between the objects is taken as an edge to establish an undirected complete graph

Wherein

Which represents a collection of nodes, and,

representing a set of edges. The calculation formula of the interaction relation matrix between the objects is as follows:

A^a/m＝softmax(φ(F^a/m)φ(F^a/m)^T) (formula 7)

Wherein phi (F)^a/m)＝F^a/mW, W is a learnable weight matrix.

And then carrying out twice graph convolution operations on the constructed undirected complete graph to obtain the high-order characteristics of the object, wherein the graph convolution formula is as follows:

GCN(F^a/m；A^a/m)＝A^a/m(F^a/mw) (formula 8)

Further, the step (3) of performing feature coding on the problem by using the recurrent neural network specifically includes the following steps:

firstly, mapping each word in the question to a 300-dimensional word vector space initialized by GloVe to obtain the code of each word, and then transmitting each code vector to a bidirectional LSTM network to obtain the local code F of each word^Q∈R^L×dGlobal encoding oF problem sentences oF_q∈R^d。

Further, the cross-modal fusion step described in step (4) is as follows:

as in step (2), the two streams use the same network structure, and therefore the characteristics of the two streams are denoted by superscript a/m.

Defining a self-attention network to explore interaction relationships within the modality, the self-attention network formulation being as follows:

f＝MultiHead(X，X，X)＝[head₁，head₂，...，head_h]W^O(formula 9)

Wherein the content of the first and second substances,

d is a characteristic dimension; h indicates that the multi-head attention machine is provided with h heads; x represents the visual feature: w^O，

Respectively, learnable parameter matrices.

Additionally, a problem-oriented attention network is defined to explore the interaction relationships between modalities, the formula of the problem-oriented attention network is as follows:

f＝MultiHead(X，Y，Y)＝[head₁，head₂，...，head_h]W^O(formula 11)

Wherein X represents a visual feature and Y represents a problem feature.

Then, the high-order characteristic F obtained in the step (2) is processed^a/mTransmitting the text into a defined self-attention network to acquire the intra-modal interaction relationship between the objects, and acquiring the text local feature F obtained in the step (3)^Q∈R^L×dTransmitting the output of the two self-attention networks into a problem-guided attention network to acquire cross-modal semantic relation between words and objects, and simultaneously updating high-order features of each stream of the video respectively under the guidance of the problem.

And finally, fusing the global feature codes of the problems and the double-stream features of the video to obtain the codes of the answers.

Further, the feature decoding step in step (5) is as follows:

for open questions, a linear classifier is used to predict the probability that each answer in the answer set is the correct answer, and cross entropy function is used as loss function to perform parameter training:

wherein the content of the first and second substances,

for the probability that each answer is the correct answer, y_i1 denotes a_ie.A is the correct answer.

For counting type problems, a linear regressor is adopted to predict answers, and the mean square error is used as a loss function to train parameters:

for the multi-choice question, a linear classifier is adopted to predict the probability of each candidate answer being a correct answer, and the model is trained by using a hinge loss function as a loss function:

wherein s is_pAnd s_nRepresenting the scores of the correct answer and the wrong answer, respectively.

The invention has the beneficial effects that:

the invention uses a double-flow mechanism to represent the visual content of the video on the basis of the predecessor. One of the streams is a static appearance stream of foreground objects and the other stream is a dynamic behavior stream of foreground objects. In each stream, the characteristics of the object include both the characteristics of the object itself, as well as the spatio-temporal coding of the object and the contextual information characteristics of the scene in which the object is located. Therefore, when deep feature extraction is carried out in subsequent graph convolution operation, the relative space-time relation and the context perception relation between the objects can be explored. Meanwhile, the problem that the prior video question-answer model only considers the static characteristics of the object and lacks dynamic information analysis is solved by using a double-flow mechanism.

The invention provides an attention network model for problem guidance aiming at the problem of cross-modal semantic alignment and intra-modal interaction relation exploration, and improves the exploration capability of intra-modal interaction and inter-modal semantic alignment to a certain extent. The invention achieves better results on the relevant video question-answer data set.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention.

Fig. 2 is a schematic diagram of a network framework according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following figures and examples.

The invention provides a video question-answering method based on an object-oriented double-flow attention network, which comprises the following steps:

step (1), data preprocessing:

inputting a video clip to be asked and answered, dividing the video clip into a fixed number of video frames, and performing corresponding static appearance characteristic extraction and dynamic behavior characteristic extraction on each frame to obtain the static appearance characteristics of the video frames

And dynamic behavior characteristics

The specific formula of the object position information b is as follows:

b＝[c₁，c₂，w，h](formula 1)

Step (2), video feature coding:

Given characteristics of target objects in video

And a set of position information of the target object

Firstly, the position information b of a target object is transmitted into a space position coding network and a time position coding network to respectively obtain space position codes d of the object^sAnd a time-position code d^tThe spatial position coding network is composed of a plurality of layers of perceptrons, and the time position coding network adopts sine and cosine functions with different frequencies to code the time position of the target object. The specific formula is as follows:

d^smlp (b) (equation 2)

Wherein

Which represents a collection of nodes, and,

A^a/m＝softmax(φ(F^a/m)φ(F^a/m)^T) (formula 7)

Wherein phi (F)^a/m)＝F^a/mW, W being learnableA weight matrix.

GCN(F^a/m；A^a/m)＝A^a/m(F^a/mw) (formula 8)

And (3) performing feature coding on the problem by using a recurrent neural network, wherein the specific process is as follows:

Step (4), cross-modal fusion of videos and problems:

f＝MultiHead(X，X，X)＝[head₁，head₂，...，head_h]W^O(formula 9)

Wherein the content of the first and second substances,

Respectively, learnable parameter matrices.

f＝MultiHead(X，Y，Y)＝[head₁，head₂，...，head_h]W^O(formula 11)

Wherein X represents a visual feature and Y represents a problem feature.

wherein the content of the first and second substances,

for each answer to be positiveProbability of answer, y_i1 denotes a_ie.A is the correct answer.

Examples

As shown in fig. 1 and fig. 2, a video question-answering method based on an object-oriented dual-flow attention network includes the following steps:

the method comprises the following steps of (1) carrying out data preprocessing on input data, and firstly sampling video frames in an average sampling mode aiming at an input video segment, wherein the sampling number of each video segment is T-10 frames. And then generating a target object on each frame by adopting a fast-RCNN target detection algorithm to obtain a plurality of candidate frames. A convolutional network is additionally used to extract static appearance features and dynamic behavior features for each video frame. In the invention, a ResNet-152 network trained on an ImageNet image library is used for extracting static appearance characteristics, and an I3D network trained on a Kinetics action recognition data set is used for extracting dynamic behavior characteristics of a video frame. And finally, mapping the candidate box of each target object to the feature map of the Conv5 layer of ResNet-152 and the feature map of the convolution layer of the last layer of I3D respectively by using a RoIAlign method to obtain the static appearance feature and the dynamic behavior feature of the target object.

And (2) carrying out feature coding on the video by adopting a double-stream mechanism, wherein one stream is a static appearance stream of the object, and the other stream is a dynamic behavior stream of the object.

2-1. multidimensional feature aggregation, set of location information for a given target object

Static appearance feature and dynamic behavior feature set of objects

Static appearance features and dynamic behavior features of video frames

Computing features with location-awareness and context-awareness for each target object:

and respectively calculating the spatial position code and the temporal position code of each target object according to formula 2, formula 3 and formula 4, and calculating the characteristic with position perception of each target object according to formula 5.

Calculating a feature F with context awareness for each target object according to equation 6^a/m。

2-2, calculating the double-flow high-order feature representation of the target object, which is obtained by the 2-1, the feature with the position and the context perception through the convolution operation of two layers of graphs (formula 7) and a residual error network.

And (3) performing coding representation on the input natural language question, and embedding each word in the natural language question into a 300-dimensional feature space initialized by GloVe to obtain a local code of each word. Then, the local code of each word is transmitted into a bidirectional recurrent neural network according to the sequence to obtain the local code F of each word^Q∈R^L×dAnd problem ofLocal code F_q∈R^d。

And (4) performing cross-modal fusion of the video and the problem, firstly, performing intra-modal interaction relation analysis, and updating feature representations of the two modalities according to intra-modal self-attention weight. Performing double-current high-order characteristic F on the video obtained in the step (2)^a/mAnd the local character F of the word obtained in the step (3)^Q∈R^L×dThe feature reconstruction is performed according to equations 9 and 10, respectively. And then performing cross-modal semantic alignment on the reconstructed problem feature and the target object high-order feature according to a formula 11 and a formula 12, and updating the double-flow high-order feature of the video. And finally, fusing the updated double-current high-order characteristics and the problem global code to obtain an answer code, wherein the specific formula is as follows:

a＝FC(FC(f^a+f^m)+FC(f^q) Equation 16

Wherein FC denotes the full connection layer, f^aFor global static appearance feature coding, f^mFor global dynamic behavior feature coding, f^qAnd coding the global problem features.

Step (5) decoding the answer code obtained in the step (4) to obtain a final answer;

selecting different decoding modes according to the type of the problem, selecting a linear classifier for decoding the open problem, and simultaneously training parameters by using the Loss defined by the formula 13; for the counting type problem, a linear regressor is selected for decoding and the parameters are trained using the Loss defined in equation 14. A linear classifier is selected for the multiple choice class problem for decoding while training the parameters using the Loss defined by equation 15.

In order to test the performance of the double-flow mechanism and the attention mechanism in the video question-answering method based on the object-oriented double-flow attention network, different model structures are adopted for carrying out ablation experimental test. The results of the experiment are shown in table 1, and the data set adopted in the ablation experiment is a TGIF-QA data set, and the data set comprises four types of questions, namely a Count type counting question, an Action identification question, a Transition state Transition type question and a FrameQA image question-and-answer type question. The first and third rows of application and Motion respectively represent models that use only static Appearance streams and dynamic behavior streams as representations of video and do not use attention modules for cross-modal semantic alignment when fusing across modalities. The second and fourth rows represent cross-modal semantic alignment using only static appearance streams and dynamic behavior streams as representations of the video while utilizing the attention module, respectively. The last row represents the performance of the dual-flow attention network model of the present invention.

TABLE 1 comparison of ablation Experimental Performance for different model structures

To further test the performance of the video question-answering method based on the object-oriented dual-flow attention network, the invention was compared with the L-GCN model that best performs on the TGIF-QA dataset, and the experimental comparison results are shown in table 2.

TABLE 2 comparison of the Performance of the present invention with that of the L-GCN model

Claims

1. A video question-answering method based on an object-oriented double-flow attention network is characterized by comprising the following steps:

step (1), data preprocessing:

step (2), video feature coding:

coding video characteristics by adopting a double-stream mechanism, wherein the double streams are a static appearance stream and a dynamic behavior stream of a target object respectively;

step (3), problem feature coding:

step (4), cross-modal fusion of videos and problems:

in each stream, firstly, inputting high-order characteristics of a target object from an attention network to acquire intra-modal interaction relation between the objects, reconstructing the high-order characteristics of the video target object according to the intra-modal interaction relation between the objects, similarly, transmitting local characteristics of each word from the attention network to acquire the intra-modal interaction relation between vocabularies, and reconstructing problem characteristics according to the intra-modal interaction relation between the vocabularies; then inputting the reconstructed high-order characteristics of the target object and the reconstructed problem characteristics into an attention network guided by the problem to explore cross-modal semantic relation between words and object characteristics, and updating the high-order characteristics of the video;

fusing the high-order characteristics of the static appearance flow of the target object and the high-order characteristics of the dynamic behavior flow to obtain the double-flow high-order characteristics of the video;

different decoding modes are adopted for different problem types, a linear classifier is used for decoding for open problems and multi-item selection problems, and a linear regression device is used for decoding for counting problems.

2. The video question-answering method based on the object-oriented dual-flow attention network according to claim 1, wherein the specific method in the step (1) is as follows:

And dynamic behavior characteristics

A context representation feature as a target object; then, positioning the position information b of the target object on each frame by using a target detection algorithm, and aligning the position information to the static appearance characteristic and the dynamic behavior characteristic of the video frame by using a RoIAlign alignment network to obtain the static appearance characteristic O of each object^aAnd dynamic behavior characteristics O^m：

The specific formula of the object position information b is as follows:

b＝[c₁,c₂,w,h](formula 1)

3. The video question-answering method based on the object-oriented dual-flow attention network according to claim 2, wherein the specific method in the step (2) is as follows:

the method comprises the steps of representing feature coding of a video by adopting a double-flow mechanism, coding static appearance features of an object and dynamic behavior features of the object in parallel by adopting the same coding mechanism, wherein a network for coding the static appearance features is called static appearance flow, and a network for coding the dynamic behavior features is called dynamic behavior flow; finally, respectively obtaining the high-order characteristics of the static appearance flow and the high-order characteristics of the dynamic behavior flow of the video; the specific process is as follows:

the static appearance flow and the dynamic behavior flow adopt the same network structure, so in order to simplify the expression, the upper corner mark a/m is used for simultaneously representing the characteristics of the two flows;

given characteristics of target objects in video

And a set of position information of the target object

Firstly, the position information b of the target object is transmitted into a space position coding network and a time position coding network to respectively obtain the space position codes d of the object^sAnd a time-position code d^tThe spatial position coding network is composed of a plurality of layers of perceptrons, and the time position coding network adopts sine and cosine functions with different frequencies to code the time position of a target object; the specific formula is as follows:

d^smlp (b) (equation 2)

Wherein b represents the position information of the target object found by formula 1; n represents the number of target objects; d_pRepresents the dimension of the code, and 0 ≦ 2j ≦ d_p；

after the characteristics with the position information and the context information of each object are obtained, the characteristics of each object are used as nodes, and the interactive relation between the objects is used as an edge to establish an undirected complete graph

Wherein

Represents a node set, and epsilon represents an edge set; the interactive relationship matrix between the objects is calculated by the following formula:

A^a/m＝softmax(φ(F^a/m)φ(F^a/m)^T) (formula 7)

Wherein phi (F)^a/m)＝F^a/mW, W is a weight matrix which can be learned;

GCN(F^a/m；A^a/m)＝A^a/m(F^a/mw) (equation 8).

4. The video question-answering method based on the object-oriented dual-flow attention network according to claim 3, wherein the specific process of the step (3) is as follows:

5. The video question-answering method based on the object-oriented dual-flow attention network according to claim 4, wherein the cross-modal fusion step in the step (4) is as follows:

as in step (2), the two streams use the same network structure, and therefore the characteristics of the two streams are denoted by superscript a/m;

f＝MultiHead(X,X,X)＝[head₁,head₂,...,head_h]W^O(formula 9)

Wherein the content of the first and second substances,

d is a characteristic dimension; h represents that the multi-head attention device is provided with h heads; x represents the visual feature: w^O，

Respectively are learnable parameter matrixes;

a problem-oriented attention network is further defined to explore the interaction relationships among the modalities, and the formula of the problem-oriented attention network is as follows:

f＝MultiHead(X,Y,Y)＝[head₁,head₂,...,head_h]W^O(formula 11)

Wherein X represents a visual feature and Y represents a problem feature;

then, the high-order characteristic F obtained in the step (2) is processed^a/mTransmitting the text into a defined self-attention network to acquire the intra-modal interaction relationship between the objects, and acquiring the text local feature F obtained in the step (3)^Q∈R^L×dTransmitting the output of the two self-attention networks into a defined problem-guided attention network to acquire cross-modal semantic relation between words and objects, and simultaneously updating high-order features of each stream of the video under the guidance of the problem;

6. The video question-answering method based on the object-oriented dual-stream attention network of claim 5, wherein the feature decoding step of step (5) is as follows:

wherein the content of the first and second substances,

for the probability that each answer is the correct answer, y_i1 denotes a_iE is the correct answer;