CN114428866A - Video question-answering method based on object-oriented double-flow attention network - Google Patents

Video question-answering method based on object-oriented double-flow attention network Download PDF

Info

Publication number
CN114428866A
CN114428866A CN202210094738.3A CN202210094738A CN114428866A CN 114428866 A CN114428866 A CN 114428866A CN 202210094738 A CN202210094738 A CN 202210094738A CN 114428866 A CN114428866 A CN 114428866A
Authority
CN
China
Prior art keywords
video
target object
network
stream
flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210094738.3A
Other languages
Chinese (zh)
Inventor
俞俊
张欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202210094738.3A priority Critical patent/CN114428866A/en
Publication of CN114428866A publication Critical patent/CN114428866A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/487Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/489Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/20Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
    • H04N19/23Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding with coding of regions that are present throughout a whole video segment, e.g. sprites, background or mosaic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Library & Information Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Signal Processing (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a video question-answering method based on an object-oriented double-flow attention network. The visual content of the video is represented using a dual stream mechanism, where one stream is a static appearance stream of foreground objects and the other stream is a dynamic behavior stream of foreground objects. In each stream, the characteristics of the object include both the characteristics of the object itself, as well as the spatio-temporal coding of the object and the contextual information characteristics of the scene in which the object is located. The relative spatiotemporal relationship and the context-aware relationship between the objects can be explored when the deep feature extraction is carried out in the subsequent graph convolution operation. Meanwhile, the problem that the prior video question-answer model only considers the static characteristics of the object and lacks dynamic information analysis is solved by using a double-flow mechanism. The invention improves the exploration capability of the intra-modal interaction and the inter-modal semantic alignment, and obtains better results on the relevant video question-answer data sets.

Description

Video question-answering method based on object-oriented double-flow attention network
Technical Field
The invention relates to the field of video question answering, in particular to a video question answering method based on an object-oriented double-flow attention network containing static appearance flow and dynamic behavior flow.
Background
Video question-answering as a cross-discipline task for computer vision and natural language processing requires a system to output correct answers based on a given video and a natural language description, and the task requires the system to be able to fully understand information of two modalities, namely appearance characteristics, behavior characteristics of foreground objects in the video and content asked by questions, and also requires the ability to explore semantic interaction relationships between the two modalities. The existing video question-answering tasks mostly perform visual feature extraction by taking frames as units, the learning of a question-answering main body, namely a foreground object in a video, is not detailed enough by the feature extraction method, and meanwhile, the modeling of an interactive relation between the objects is also omitted by the feature extraction method. Therefore, it is important to design a model capable of mining the features of the target objects in the video and the interaction relationship between the features and the interaction relationship.
The video question-answer and the image question-answer belong to the same branch of the visual question-answer, and correct answers are output according to a given piece of visual information and a question. Although video questions are an extended task of image questions, there are many differences between them. Therefore, the image question-answering method cannot be simply applied to the video question-answering task. The most important difference between video and image questions is that the video question-answering task handles long sequence images that contain richer appearance and motion information rather than a single still image. Therefore, in the video question-answering task, the system needs to simultaneously model the appearance information and the motion information of the objects in the video to improve the accuracy of the model for answering the questions.
The attention network is a common object relation modeling mechanism and is widely applied to various different types of machine learning tasks such as natural language processing, image recognition, voice recognition and the like. The attention mechanism can help the model to give different weights to each input part, more key and important information is extracted, the model can be judged more accurately, and meanwhile, larger expenses cannot be brought to calculation and storage of the model. The present invention proposes two variants of attention mechanism to enable modeling of intra-modality interactions and inter-modality interactions.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a video question-answering method based on an object-oriented double-flow attention network, which can be used for learning static appearance characteristics and dynamic behavior characteristics of foreground objects in a video in parallel, accurately modeling the interactive relation among the objects in the video and realizing cross-modal semantic alignment. Better results were obtained on the relevant video question-answer dataset.
A video question-answering method based on an object-oriented double-flow attention network comprises the following steps:
step (1), data preprocessing:
inputting a video segment to be asked and answered, firstly extracting static appearance characteristics and dynamic behavior characteristics of a video frame by using a convolution network, then extracting a target object in the video by using a target detection algorithm, and simultaneously extracting the static appearance characteristics and the dynamic behavior characteristics of the object by using the convolution network; taking the extracted static appearance characteristics and dynamic behavior characteristics of the video frame as context representation characteristics of a target object;
step (2), video feature coding:
and coding the video features by adopting a dual-stream mechanism, wherein the dual streams are a static appearance stream and a dynamic behavior stream of the target object respectively.
In each stream, respectively combining the characteristics of the target object, the space-time coding of the target object and the context representation characteristics of the target object to obtain characteristics serving as nodes to construct a characteristic graph; performing graph convolution operation on the constructed feature graph to acquire high-order features of the target object in the video;
step (3), problem feature coding:
coding the problem by utilizing a recurrent neural network to obtain the local code of each word and the global code of the problem sentence;
step (4), cross-modal fusion of videos and problems:
in each stream, firstly, the high-order characteristics of the target object are input into the attention network to obtain the intra-modal interaction relationship between the objects, and meanwhile, the high-order characteristics of the video target object are reconstructed according to the intra-modal interaction relationship between the objects, similarly, the local characteristics of each word are transmitted into the attention network to obtain the intra-modal interaction relationship between the words and the vocabulary, and meanwhile, the problem characteristics are reconstructed according to the intra-modal interaction relationship between the words and the vocabulary. And then inputting the reconstructed high-order characteristics of the target object and the reconstructed problem characteristics into an attention network guided by the problem to explore cross-modal semantic relation between words and object characteristics, and updating the high-order characteristics of the video.
And fusing the high-order characteristics of the static appearance flow of the target object and the high-order characteristics of the dynamic behavior flow to obtain the double-flow high-order characteristics of the video.
Fusing the double-current high-order characteristics of the video with the global problem code to obtain an answer code;
and (5) decoding the answer code obtained in the step (4) to obtain a final answer:
different decoding modes are adopted for different problem types, a linear classifier is used for decoding open problems and multi-item selection problems, and a linear regression device is used for decoding counting problems;
further, the specific method in the step (1) is as follows:
inputting a video clip to be asked and answered, dividing the video clip into a fixed number of video frames, and carrying out corresponding static state exception on each frameExtracting the appearance characteristic and the dynamic behavior characteristic to obtain the static appearance characteristic of the video frame
Figure BDA0003490585570000031
And dynamic behavior characteristics
Figure BDA0003490585570000032
The context as the target object represents the feature. Then, positioning the position information b of the target object on each frame by using a target detection algorithm, and aligning the position information to the static appearance characteristic and the dynamic behavior characteristic of the video frame by using a RoIAlign alignment network to obtain the static appearance characteristic O of each objectaAnd dynamic behavior characteristics Om
The specific formula of the object position information b is as follows:
b=[c1,c2,w,h](formula 1)
Wherein, c1And c2Is the center position coordinates of the target object, and w and h represent the length and width of the target object, respectively.
Further, the specific method of the step (2) is as follows:
the method adopts a double-stream mechanism to represent the feature coding of the video, the static appearance feature of an object and the dynamic behavior feature of the object are coded in parallel by adopting the same coding mechanism, a network for coding the static appearance feature is called static appearance flow, and a network for coding the dynamic behavior feature is called dynamic behavior flow. And finally, respectively obtaining the high-order characteristics of the static appearance flow and the high-order characteristics of the dynamic behavior flow of the video. The specific process is as follows:
the static appearance stream and the dynamic behavior stream adopt the same network structure, so for simplifying the expression, the upper corner mark a/m is used for simultaneously representing the characteristics of the two streams.
Given characteristics of target objects in video
Figure BDA0003490585570000041
And a set of position information of the target object
Figure BDA0003490585570000042
Firstly, the position information b of the target object is transmitted into a space position coding network and a time position coding network to respectively obtain the space position codes d of the objectsAnd a time-position code dtThe spatial position coding network is composed of a plurality of layers of perceptrons, and the time position coding network adopts sine and cosine functions with different frequencies to code the time position of the target object. The specific formula is as follows:
dsmlp (b) (equation 2)
Figure BDA0003490585570000043
Figure BDA0003490585570000044
Wherein b represents the position information of the target object found by formula 1; n represents the number of target objects; dpRepresents the dimension of the code, and 0 ≦ 2j ≦ dp
Then, combining the characteristics of the target object and the space-time position code of the target object to obtain the characteristics with position perception of the target object, wherein the specific formula is as follows:
Figure BDA0003490585570000045
combining the characteristics with the position perception of the target object with the characteristics of the video frame to obtain the characteristics with the context perception, wherein the specific formula is as follows:
Figure BDA0003490585570000046
after the characteristics with the position information and the context information of each object are obtained, the characteristics of each object are taken as nodes, and the interactive relation between the objects is taken as an edge to establish an undirected complete graph
Figure BDA0003490585570000047
Wherein
Figure BDA0003490585570000048
Which represents a collection of nodes, and,
Figure BDA0003490585570000049
representing a set of edges. The calculation formula of the interaction relation matrix between the objects is as follows:
Aa/m=softmax(φ(Fa/m)φ(Fa/m)T) (formula 7)
Wherein phi (F)a/m)=Fa/mW, W is a learnable weight matrix.
And then carrying out twice graph convolution operations on the constructed undirected complete graph to obtain the high-order characteristics of the object, wherein the graph convolution formula is as follows:
GCN(Fa/m;Aa/m)=Aa/m(Fa/mw) (formula 8)
Further, the step (3) of performing feature coding on the problem by using the recurrent neural network specifically includes the following steps:
firstly, mapping each word in the question to a 300-dimensional word vector space initialized by GloVe to obtain the code of each word, and then transmitting each code vector to a bidirectional LSTM network to obtain the local code F of each wordQ∈RL×dGlobal encoding oF problem sentences oFq∈Rd
Further, the cross-modal fusion step described in step (4) is as follows:
as in step (2), the two streams use the same network structure, and therefore the characteristics of the two streams are denoted by superscript a/m.
Defining a self-attention network to explore interaction relationships within the modality, the self-attention network formulation being as follows:
f=MultiHead(X,X,X)=[head1,head2,...,headh]WO(formula 9)
Figure BDA0003490585570000051
Wherein the content of the first and second substances,
Figure BDA0003490585570000052
d is a characteristic dimension; h indicates that the multi-head attention machine is provided with h heads; x represents the visual feature: wO
Figure BDA0003490585570000053
Figure BDA0003490585570000054
Respectively, learnable parameter matrices.
Additionally, a problem-oriented attention network is defined to explore the interaction relationships between modalities, the formula of the problem-oriented attention network is as follows:
f=MultiHead(X,Y,Y)=[head1,head2,...,headh]WO(formula 11)
Figure BDA0003490585570000061
Wherein X represents a visual feature and Y represents a problem feature.
Then, the high-order characteristic F obtained in the step (2) is processeda/mTransmitting the text into a defined self-attention network to acquire the intra-modal interaction relationship between the objects, and acquiring the text local feature F obtained in the step (3)Q∈RL×dTransmitting the output of the two self-attention networks into a problem-guided attention network to acquire cross-modal semantic relation between words and objects, and simultaneously updating high-order features of each stream of the video respectively under the guidance of the problem.
And finally, fusing the global feature codes of the problems and the double-stream features of the video to obtain the codes of the answers.
Further, the feature decoding step in step (5) is as follows:
for open questions, a linear classifier is used to predict the probability that each answer in the answer set is the correct answer, and cross entropy function is used as loss function to perform parameter training:
Figure BDA0003490585570000062
wherein the content of the first and second substances,
Figure BDA0003490585570000063
for the probability that each answer is the correct answer, yi1 denotes aie.A is the correct answer.
For counting type problems, a linear regressor is adopted to predict answers, and the mean square error is used as a loss function to train parameters:
Figure BDA0003490585570000071
for the multi-choice question, a linear classifier is adopted to predict the probability of each candidate answer being a correct answer, and the model is trained by using a hinge loss function as a loss function:
Figure BDA0003490585570000072
wherein s ispAnd snRepresenting the scores of the correct answer and the wrong answer, respectively.
The invention has the beneficial effects that:
the invention uses a double-flow mechanism to represent the visual content of the video on the basis of the predecessor. One of the streams is a static appearance stream of foreground objects and the other stream is a dynamic behavior stream of foreground objects. In each stream, the characteristics of the object include both the characteristics of the object itself, as well as the spatio-temporal coding of the object and the contextual information characteristics of the scene in which the object is located. Therefore, when deep feature extraction is carried out in subsequent graph convolution operation, the relative space-time relation and the context perception relation between the objects can be explored. Meanwhile, the problem that the prior video question-answer model only considers the static characteristics of the object and lacks dynamic information analysis is solved by using a double-flow mechanism.
The invention provides an attention network model for problem guidance aiming at the problem of cross-modal semantic alignment and intra-modal interaction relation exploration, and improves the exploration capability of intra-modal interaction and inter-modal semantic alignment to a certain extent. The invention achieves better results on the relevant video question-answer data set.
Drawings
FIG. 1 is a flow chart of a method of an embodiment of the present invention.
Fig. 2 is a schematic diagram of a network framework according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated by the following figures and examples.
The invention provides a video question-answering method based on an object-oriented double-flow attention network, which comprises the following steps:
step (1), data preprocessing:
inputting a video clip to be asked and answered, dividing the video clip into a fixed number of video frames, and performing corresponding static appearance characteristic extraction and dynamic behavior characteristic extraction on each frame to obtain the static appearance characteristics of the video frames
Figure BDA0003490585570000081
And dynamic behavior characteristics
Figure BDA0003490585570000082
The context as the target object represents the feature. Then, positioning the position information b of the target object on each frame by using a target detection algorithm, and aligning the position information to the static appearance characteristic and the dynamic behavior characteristic of the video frame by using a RoIAlign alignment network to obtain the static appearance characteristic O of each objectaAnd dynamic behavior characteristics Om
The specific formula of the object position information b is as follows:
b=[c1,c2,w,h](formula 1)
Wherein, c1And c2Is the center position coordinates of the target object, and w and h represent the length and width of the target object, respectively.
Step (2), video feature coding:
the method adopts a double-stream mechanism to represent the feature coding of the video, the static appearance feature of an object and the dynamic behavior feature of the object are coded in parallel by adopting the same coding mechanism, a network for coding the static appearance feature is called static appearance flow, and a network for coding the dynamic behavior feature is called dynamic behavior flow. And finally, respectively obtaining the high-order characteristics of the static appearance flow and the high-order characteristics of the dynamic behavior flow of the video. The specific process is as follows:
the static appearance stream and the dynamic behavior stream adopt the same network structure, so for simplifying the expression, the upper corner mark a/m is used for simultaneously representing the characteristics of the two streams.
Given characteristics of target objects in video
Figure BDA0003490585570000083
And a set of position information of the target object
Figure BDA0003490585570000084
Firstly, the position information b of a target object is transmitted into a space position coding network and a time position coding network to respectively obtain space position codes d of the objectsAnd a time-position code dtThe spatial position coding network is composed of a plurality of layers of perceptrons, and the time position coding network adopts sine and cosine functions with different frequencies to code the time position of the target object. The specific formula is as follows:
dsmlp (b) (equation 2)
Figure BDA0003490585570000091
Figure BDA0003490585570000092
Wherein b represents the position information of the target object found by formula 1; n represents the number of target objects; dpRepresents the dimension of the code, and 0 ≦ 2j ≦ dp
Then, combining the characteristics of the target object and the space-time position code of the target object to obtain the characteristics with position perception of the target object, wherein the specific formula is as follows:
Figure BDA0003490585570000093
combining the characteristics with the position perception of the target object with the characteristics of the video frame to obtain the characteristics with the context perception, wherein the specific formula is as follows:
Figure BDA0003490585570000094
after the characteristics with the position information and the context information of each object are obtained, the characteristics of each object are taken as nodes, and the interactive relation between the objects is taken as an edge to establish an undirected complete graph
Figure BDA0003490585570000095
Wherein
Figure BDA0003490585570000096
Which represents a collection of nodes, and,
Figure BDA0003490585570000097
representing a set of edges. The calculation formula of the interaction relation matrix between the objects is as follows:
Aa/m=softmax(φ(Fa/m)φ(Fa/m)T) (formula 7)
Wherein phi (F)a/m)=Fa/mW, W being learnableA weight matrix.
And then carrying out twice graph convolution operations on the constructed undirected complete graph to obtain the high-order characteristics of the object, wherein the graph convolution formula is as follows:
GCN(Fa/m;Aa/m)=Aa/m(Fa/mw) (formula 8)
And (3) performing feature coding on the problem by using a recurrent neural network, wherein the specific process is as follows:
firstly, mapping each word in the question to a 300-dimensional word vector space initialized by GloVe to obtain the code of each word, and then transmitting each code vector to a bidirectional LSTM network to obtain the local code F of each wordQ∈RL×dGlobal encoding oF problem sentences oFq∈Rd
Step (4), cross-modal fusion of videos and problems:
as in step (2), the two streams use the same network structure, and therefore the characteristics of the two streams are denoted by superscript a/m.
Defining a self-attention network to explore interaction relationships within the modality, the self-attention network formulation being as follows:
f=MultiHead(X,X,X)=[head1,head2,...,headh]WO(formula 9)
Figure BDA0003490585570000101
Wherein the content of the first and second substances,
Figure BDA0003490585570000102
d is a characteristic dimension; h indicates that the multi-head attention machine is provided with h heads; x represents the visual feature: wO
Figure BDA0003490585570000103
Figure BDA0003490585570000104
Respectively, learnable parameter matrices.
Additionally, a problem-oriented attention network is defined to explore the interaction relationships between modalities, the formula of the problem-oriented attention network is as follows:
f=MultiHead(X,Y,Y)=[head1,head2,...,headh]WO(formula 11)
Figure BDA0003490585570000105
Wherein X represents a visual feature and Y represents a problem feature.
Then, the high-order characteristic F obtained in the step (2) is processeda/mTransmitting the text into a defined self-attention network to acquire the intra-modal interaction relationship between the objects, and acquiring the text local feature F obtained in the step (3)Q∈RL×dTransmitting the output of the two self-attention networks into a problem-guided attention network to acquire cross-modal semantic relation between words and objects, and simultaneously updating high-order features of each stream of the video respectively under the guidance of the problem.
And finally, fusing the global feature codes of the problems and the double-stream features of the video to obtain the codes of the answers.
And (5) decoding the answer code obtained in the step (4) to obtain a final answer:
for open questions, a linear classifier is used to predict the probability that each answer in the answer set is the correct answer, and cross entropy function is used as loss function to perform parameter training:
Figure BDA0003490585570000111
wherein the content of the first and second substances,
Figure BDA0003490585570000112
for each answer to be positiveProbability of answer, yi1 denotes aie.A is the correct answer.
For counting type problems, a linear regressor is adopted to predict answers, and the mean square error is used as a loss function to train parameters:
Figure BDA0003490585570000113
for the multi-choice question, a linear classifier is adopted to predict the probability of each candidate answer being a correct answer, and the model is trained by using a hinge loss function as a loss function:
Figure BDA0003490585570000114
wherein s ispAnd snRepresenting the scores of the correct answer and the wrong answer, respectively.
Examples
As shown in fig. 1 and fig. 2, a video question-answering method based on an object-oriented dual-flow attention network includes the following steps:
the method comprises the following steps of (1) carrying out data preprocessing on input data, and firstly sampling video frames in an average sampling mode aiming at an input video segment, wherein the sampling number of each video segment is T-10 frames. And then generating a target object on each frame by adopting a fast-RCNN target detection algorithm to obtain a plurality of candidate frames. A convolutional network is additionally used to extract static appearance features and dynamic behavior features for each video frame. In the invention, a ResNet-152 network trained on an ImageNet image library is used for extracting static appearance characteristics, and an I3D network trained on a Kinetics action recognition data set is used for extracting dynamic behavior characteristics of a video frame. And finally, mapping the candidate box of each target object to the feature map of the Conv5 layer of ResNet-152 and the feature map of the convolution layer of the last layer of I3D respectively by using a RoIAlign method to obtain the static appearance feature and the dynamic behavior feature of the target object.
And (2) carrying out feature coding on the video by adopting a double-stream mechanism, wherein one stream is a static appearance stream of the object, and the other stream is a dynamic behavior stream of the object.
2-1. multidimensional feature aggregation, set of location information for a given target object
Figure BDA0003490585570000121
Figure BDA0003490585570000122
Static appearance feature and dynamic behavior feature set of objects
Figure BDA0003490585570000123
Figure BDA0003490585570000124
Static appearance features and dynamic behavior features of video frames
Figure BDA0003490585570000125
Computing features with location-awareness and context-awareness for each target object:
and respectively calculating the spatial position code and the temporal position code of each target object according to formula 2, formula 3 and formula 4, and calculating the characteristic with position perception of each target object according to formula 5.
Calculating a feature F with context awareness for each target object according to equation 6a/m
2-2, calculating the double-flow high-order feature representation of the target object, which is obtained by the 2-1, the feature with the position and the context perception through the convolution operation of two layers of graphs (formula 7) and a residual error network.
And (3) performing coding representation on the input natural language question, and embedding each word in the natural language question into a 300-dimensional feature space initialized by GloVe to obtain a local code of each word. Then, the local code of each word is transmitted into a bidirectional recurrent neural network according to the sequence to obtain the local code F of each wordQ∈RL×dAnd problem ofLocal code Fq∈Rd
And (4) performing cross-modal fusion of the video and the problem, firstly, performing intra-modal interaction relation analysis, and updating feature representations of the two modalities according to intra-modal self-attention weight. Performing double-current high-order characteristic F on the video obtained in the step (2)a/mAnd the local character F of the word obtained in the step (3)Q∈RL×dThe feature reconstruction is performed according to equations 9 and 10, respectively. And then performing cross-modal semantic alignment on the reconstructed problem feature and the target object high-order feature according to a formula 11 and a formula 12, and updating the double-flow high-order feature of the video. And finally, fusing the updated double-current high-order characteristics and the problem global code to obtain an answer code, wherein the specific formula is as follows:
a=FC(FC(fa+fm)+FC(fq) Equation 16
Wherein FC denotes the full connection layer, faFor global static appearance feature coding, fmFor global dynamic behavior feature coding, fqAnd coding the global problem features.
Step (5) decoding the answer code obtained in the step (4) to obtain a final answer;
selecting different decoding modes according to the type of the problem, selecting a linear classifier for decoding the open problem, and simultaneously training parameters by using the Loss defined by the formula 13; for the counting type problem, a linear regressor is selected for decoding and the parameters are trained using the Loss defined in equation 14. A linear classifier is selected for the multiple choice class problem for decoding while training the parameters using the Loss defined by equation 15.
In order to test the performance of the double-flow mechanism and the attention mechanism in the video question-answering method based on the object-oriented double-flow attention network, different model structures are adopted for carrying out ablation experimental test. The results of the experiment are shown in table 1, and the data set adopted in the ablation experiment is a TGIF-QA data set, and the data set comprises four types of questions, namely a Count type counting question, an Action identification question, a Transition state Transition type question and a FrameQA image question-and-answer type question. The first and third rows of application and Motion respectively represent models that use only static Appearance streams and dynamic behavior streams as representations of video and do not use attention modules for cross-modal semantic alignment when fusing across modalities. The second and fourth rows represent cross-modal semantic alignment using only static appearance streams and dynamic behavior streams as representations of the video while utilizing the attention module, respectively. The last row represents the performance of the dual-flow attention network model of the present invention.
TABLE 1 comparison of ablation Experimental Performance for different model structures
Figure BDA0003490585570000141
To further test the performance of the video question-answering method based on the object-oriented dual-flow attention network, the invention was compared with the L-GCN model that best performs on the TGIF-QA dataset, and the experimental comparison results are shown in table 2.
TABLE 2 comparison of the Performance of the present invention with that of the L-GCN model
Figure BDA0003490585570000142

Claims (6)

1. A video question-answering method based on an object-oriented double-flow attention network is characterized by comprising the following steps:
step (1), data preprocessing:
inputting a video segment to be asked and answered, firstly extracting static appearance characteristics and dynamic behavior characteristics of a video frame by using a convolution network, then extracting a target object in the video by using a target detection algorithm, and simultaneously extracting the static appearance characteristics and the dynamic behavior characteristics of the object by using the convolution network; taking the extracted static appearance characteristics and dynamic behavior characteristics of the video frame as context representation characteristics of a target object;
step (2), video feature coding:
coding video characteristics by adopting a double-stream mechanism, wherein the double streams are a static appearance stream and a dynamic behavior stream of a target object respectively;
in each stream, respectively combining the characteristics of the target object, the space-time coding of the target object and the context representation characteristics of the target object to obtain characteristics serving as nodes to construct a characteristic graph; performing graph convolution operation on the constructed feature graph to acquire high-order features of the target object in the video;
step (3), problem feature coding:
coding the problem by utilizing a recurrent neural network to obtain the local code of each word and the global code of the problem sentence;
step (4), cross-modal fusion of videos and problems:
in each stream, firstly, inputting high-order characteristics of a target object from an attention network to acquire intra-modal interaction relation between the objects, reconstructing the high-order characteristics of the video target object according to the intra-modal interaction relation between the objects, similarly, transmitting local characteristics of each word from the attention network to acquire the intra-modal interaction relation between vocabularies, and reconstructing problem characteristics according to the intra-modal interaction relation between the vocabularies; then inputting the reconstructed high-order characteristics of the target object and the reconstructed problem characteristics into an attention network guided by the problem to explore cross-modal semantic relation between words and object characteristics, and updating the high-order characteristics of the video;
fusing the high-order characteristics of the static appearance flow of the target object and the high-order characteristics of the dynamic behavior flow to obtain the double-flow high-order characteristics of the video;
fusing the double-current high-order characteristics of the video with the global problem code to obtain an answer code;
and (5) decoding the answer code obtained in the step (4) to obtain a final answer:
different decoding modes are adopted for different problem types, a linear classifier is used for decoding for open problems and multi-item selection problems, and a linear regression device is used for decoding for counting problems.
2. The video question-answering method based on the object-oriented dual-flow attention network according to claim 1, wherein the specific method in the step (1) is as follows:
inputting a video clip to be asked and answered, dividing the video clip into a fixed number of video frames, and performing corresponding static appearance characteristic extraction and dynamic behavior characteristic extraction on each frame to obtain the static appearance characteristics of the video frames
Figure FDA0003490585560000021
And dynamic behavior characteristics
Figure FDA0003490585560000022
A context representation feature as a target object; then, positioning the position information b of the target object on each frame by using a target detection algorithm, and aligning the position information to the static appearance characteristic and the dynamic behavior characteristic of the video frame by using a RoIAlign alignment network to obtain the static appearance characteristic O of each objectaAnd dynamic behavior characteristics Om
The specific formula of the object position information b is as follows:
b=[c1,c2,w,h](formula 1)
Wherein, c1And c2Is the center position coordinates of the target object, and w and h represent the length and width of the target object, respectively.
3. The video question-answering method based on the object-oriented dual-flow attention network according to claim 2, wherein the specific method in the step (2) is as follows:
the method comprises the steps of representing feature coding of a video by adopting a double-flow mechanism, coding static appearance features of an object and dynamic behavior features of the object in parallel by adopting the same coding mechanism, wherein a network for coding the static appearance features is called static appearance flow, and a network for coding the dynamic behavior features is called dynamic behavior flow; finally, respectively obtaining the high-order characteristics of the static appearance flow and the high-order characteristics of the dynamic behavior flow of the video; the specific process is as follows:
the static appearance flow and the dynamic behavior flow adopt the same network structure, so in order to simplify the expression, the upper corner mark a/m is used for simultaneously representing the characteristics of the two flows;
given characteristics of target objects in video
Figure FDA0003490585560000031
And a set of position information of the target object
Figure FDA0003490585560000032
Firstly, the position information b of the target object is transmitted into a space position coding network and a time position coding network to respectively obtain the space position codes d of the objectsAnd a time-position code dtThe spatial position coding network is composed of a plurality of layers of perceptrons, and the time position coding network adopts sine and cosine functions with different frequencies to code the time position of a target object; the specific formula is as follows:
dsmlp (b) (equation 2)
Figure FDA0003490585560000033
Figure FDA0003490585560000034
Wherein b represents the position information of the target object found by formula 1; n represents the number of target objects; dpRepresents the dimension of the code, and 0 ≦ 2j ≦ dp
Then, combining the characteristics of the target object and the space-time position code of the target object to obtain the characteristics with position perception of the target object, wherein the specific formula is as follows:
Figure FDA0003490585560000035
combining the characteristics with the position perception of the target object with the characteristics of the video frame to obtain the characteristics with the context perception, wherein the specific formula is as follows:
Figure FDA0003490585560000036
after the characteristics with the position information and the context information of each object are obtained, the characteristics of each object are used as nodes, and the interactive relation between the objects is used as an edge to establish an undirected complete graph
Figure FDA0003490585560000037
Wherein
Figure FDA0003490585560000038
Represents a node set, and epsilon represents an edge set; the interactive relationship matrix between the objects is calculated by the following formula:
Aa/m=softmax(φ(Fa/m)φ(Fa/m)T) (formula 7)
Wherein phi (F)a/m)=Fa/mW, W is a weight matrix which can be learned;
and then carrying out twice graph convolution operations on the constructed undirected complete graph to obtain the high-order characteristics of the object, wherein the graph convolution formula is as follows:
GCN(Fa/m;Aa/m)=Aa/m(Fa/mw) (equation 8).
4. The video question-answering method based on the object-oriented dual-flow attention network according to claim 3, wherein the specific process of the step (3) is as follows:
firstly, mapping each word in the question to a 300-dimensional word vector space initialized by GloVe to obtain the code of each word, and then transmitting each code vector to a bidirectional LSTM network to obtain the local code F of each wordQ∈RL×dGlobal encoding oF problem sentences oFq∈Rd
5. The video question-answering method based on the object-oriented dual-flow attention network according to claim 4, wherein the cross-modal fusion step in the step (4) is as follows:
as in step (2), the two streams use the same network structure, and therefore the characteristics of the two streams are denoted by superscript a/m;
defining a self-attention network to explore interaction relationships within the modality, the self-attention network formulation being as follows:
f=MultiHead(X,X,X)=[head1,head2,...,headh]WO(formula 9)
Figure FDA0003490585560000041
Wherein the content of the first and second substances,
Figure FDA0003490585560000042
d is a characteristic dimension; h represents that the multi-head attention device is provided with h heads; x represents the visual feature: wO
Figure FDA0003490585560000043
Figure FDA0003490585560000044
Respectively are learnable parameter matrixes;
a problem-oriented attention network is further defined to explore the interaction relationships among the modalities, and the formula of the problem-oriented attention network is as follows:
f=MultiHead(X,Y,Y)=[head1,head2,...,headh]WO(formula 11)
Figure FDA0003490585560000051
Wherein X represents a visual feature and Y represents a problem feature;
then, the high-order characteristic F obtained in the step (2) is processeda/mTransmitting the text into a defined self-attention network to acquire the intra-modal interaction relationship between the objects, and acquiring the text local feature F obtained in the step (3)Q∈RL×dTransmitting the output of the two self-attention networks into a defined problem-guided attention network to acquire cross-modal semantic relation between words and objects, and simultaneously updating high-order features of each stream of the video under the guidance of the problem;
and finally, fusing the global feature codes of the problems and the double-stream features of the video to obtain the codes of the answers.
6. The video question-answering method based on the object-oriented dual-stream attention network of claim 5, wherein the feature decoding step of step (5) is as follows:
for open questions, a linear classifier is used to predict the probability that each answer in the answer set is the correct answer, and cross entropy function is used as loss function to perform parameter training:
Figure FDA0003490585560000052
wherein the content of the first and second substances,
Figure FDA0003490585560000053
for the probability that each answer is the correct answer, yi1 denotes aiE is the correct answer;
for counting type problems, a linear regressor is adopted to predict answers, and the mean square error is used as a loss function to train parameters:
Figure FDA0003490585560000061
for the multi-choice question, a linear classifier is adopted to predict the probability of each candidate answer being a correct answer, and the model is trained by using a hinge loss function as a loss function:
Figure FDA0003490585560000062
wherein s ispAnd snRepresenting the scores of the correct answer and the wrong answer, respectively.
CN202210094738.3A 2022-01-26 2022-01-26 Video question-answering method based on object-oriented double-flow attention network Pending CN114428866A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210094738.3A CN114428866A (en) 2022-01-26 2022-01-26 Video question-answering method based on object-oriented double-flow attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210094738.3A CN114428866A (en) 2022-01-26 2022-01-26 Video question-answering method based on object-oriented double-flow attention network

Publications (1)

Publication Number Publication Date
CN114428866A true CN114428866A (en) 2022-05-03

Family

ID=81313608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210094738.3A Pending CN114428866A (en) 2022-01-26 2022-01-26 Video question-answering method based on object-oriented double-flow attention network

Country Status (1)

Country Link
CN (1) CN114428866A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114463760A (en) * 2022-04-08 2022-05-10 华南理工大学 Character image writing track recovery method based on double-stream coding
CN114818989A (en) * 2022-06-21 2022-07-29 中山大学深圳研究院 Gait-based behavior recognition method and device, terminal equipment and storage medium
CN116847101A (en) * 2023-09-01 2023-10-03 易方信息科技股份有限公司 Video bit rate ladder prediction method, system and equipment based on transform network
WO2024012574A1 (en) * 2022-07-15 2024-01-18 中国电信股份有限公司 Image coding method and apparatus, image decoding method and apparatus, readable medium, and electronic device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114463760A (en) * 2022-04-08 2022-05-10 华南理工大学 Character image writing track recovery method based on double-stream coding
CN114463760B (en) * 2022-04-08 2022-06-28 华南理工大学 Character image writing track recovery method based on double-stream coding
CN114818989A (en) * 2022-06-21 2022-07-29 中山大学深圳研究院 Gait-based behavior recognition method and device, terminal equipment and storage medium
CN114818989B (en) * 2022-06-21 2022-11-08 中山大学深圳研究院 Gait-based behavior recognition method and device, terminal equipment and storage medium
WO2024012574A1 (en) * 2022-07-15 2024-01-18 中国电信股份有限公司 Image coding method and apparatus, image decoding method and apparatus, readable medium, and electronic device
CN116847101A (en) * 2023-09-01 2023-10-03 易方信息科技股份有限公司 Video bit rate ladder prediction method, system and equipment based on transform network
CN116847101B (en) * 2023-09-01 2024-02-13 易方信息科技股份有限公司 Video bit rate ladder prediction method, system and equipment based on transform network

Similar Documents

Publication Publication Date Title
CN114428866A (en) Video question-answering method based on object-oriented double-flow attention network
CN109766427B (en) Intelligent question-answering method based on collaborative attention for virtual learning environment
CN110148318B (en) Digital teaching assistant system, information interaction method and information processing method
CN113297370B (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN112487949B (en) Learner behavior recognition method based on multi-mode data fusion
JP2022018066A (en) Loop detection method based on convolutional perception hash algorithm
CN110852256A (en) Method, device and equipment for generating time sequence action nomination and storage medium
CN113870395A (en) Animation video generation method, device, equipment and storage medium
CN113283336A (en) Text recognition method and system
Sha et al. Neural knowledge tracing
CN116187349A (en) Visual question-answering method based on scene graph relation information enhancement
CN109840506B (en) Method for solving video question-answering task by utilizing video converter combined with relational interaction
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN114328943A (en) Question answering method, device, equipment and storage medium based on knowledge graph
CN115599954B (en) Video question-answering method based on scene graph reasoning
Guo Analysis of artificial intelligence technology and its application in improving the effectiveness of physical education teaching
CN113609355B (en) Video question-answering system, method, computer and storage medium based on dynamic attention and graph network reasoning
Zheng et al. Modular graph attention network for complex visual relational reasoning
CN115130461A (en) Text matching method and device, electronic equipment and storage medium
CN113569867A (en) Image processing method and device, computer equipment and storage medium
Kheldoun et al. Algsl89: An algerian sign language dataset
CN116661940B (en) Component identification method, device, computer equipment and storage medium
Wollowski et al. Constructing mutual context in human-robot collaborative problem solving with multimodal input
CN117520209B (en) Code review method, device, computer equipment and storage medium
Zhang et al. Student Classroom Teaching Behavior Recognition Based on DSCNN Model in Intelligent Campus Education

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination