CN114428866A - Video question-answering method based on object-oriented double-flow attention network - Google Patents
Video question-answering method based on object-oriented double-flow attention network Download PDFInfo
- Publication number
- CN114428866A CN114428866A CN202210094738.3A CN202210094738A CN114428866A CN 114428866 A CN114428866 A CN 114428866A CN 202210094738 A CN202210094738 A CN 202210094738A CN 114428866 A CN114428866 A CN 114428866A
- Authority
- CN
- China
- Prior art keywords
- video
- target object
- network
- stream
- flow
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/48—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/483—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/48—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/487—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/40—Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
- G06F16/48—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/489—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using time information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/20—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
- H04N19/23—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding with coding of regions that are present throughout a whole video segment, e.g. sprites, background or mosaic
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/30—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Library & Information Science (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Signal Processing (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a video question-answering method based on an object-oriented double-flow attention network. The visual content of the video is represented using a dual stream mechanism, where one stream is a static appearance stream of foreground objects and the other stream is a dynamic behavior stream of foreground objects. In each stream, the characteristics of the object include both the characteristics of the object itself, as well as the spatio-temporal coding of the object and the contextual information characteristics of the scene in which the object is located. The relative spatiotemporal relationship and the context-aware relationship between the objects can be explored when the deep feature extraction is carried out in the subsequent graph convolution operation. Meanwhile, the problem that the prior video question-answer model only considers the static characteristics of the object and lacks dynamic information analysis is solved by using a double-flow mechanism. The invention improves the exploration capability of the intra-modal interaction and the inter-modal semantic alignment, and obtains better results on the relevant video question-answer data sets.
Description
Technical Field
The invention relates to the field of video question answering, in particular to a video question answering method based on an object-oriented double-flow attention network containing static appearance flow and dynamic behavior flow.
Background
Video question-answering as a cross-discipline task for computer vision and natural language processing requires a system to output correct answers based on a given video and a natural language description, and the task requires the system to be able to fully understand information of two modalities, namely appearance characteristics, behavior characteristics of foreground objects in the video and content asked by questions, and also requires the ability to explore semantic interaction relationships between the two modalities. The existing video question-answering tasks mostly perform visual feature extraction by taking frames as units, the learning of a question-answering main body, namely a foreground object in a video, is not detailed enough by the feature extraction method, and meanwhile, the modeling of an interactive relation between the objects is also omitted by the feature extraction method. Therefore, it is important to design a model capable of mining the features of the target objects in the video and the interaction relationship between the features and the interaction relationship.
The video question-answer and the image question-answer belong to the same branch of the visual question-answer, and correct answers are output according to a given piece of visual information and a question. Although video questions are an extended task of image questions, there are many differences between them. Therefore, the image question-answering method cannot be simply applied to the video question-answering task. The most important difference between video and image questions is that the video question-answering task handles long sequence images that contain richer appearance and motion information rather than a single still image. Therefore, in the video question-answering task, the system needs to simultaneously model the appearance information and the motion information of the objects in the video to improve the accuracy of the model for answering the questions.
The attention network is a common object relation modeling mechanism and is widely applied to various different types of machine learning tasks such as natural language processing, image recognition, voice recognition and the like. The attention mechanism can help the model to give different weights to each input part, more key and important information is extracted, the model can be judged more accurately, and meanwhile, larger expenses cannot be brought to calculation and storage of the model. The present invention proposes two variants of attention mechanism to enable modeling of intra-modality interactions and inter-modality interactions.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a video question-answering method based on an object-oriented double-flow attention network, which can be used for learning static appearance characteristics and dynamic behavior characteristics of foreground objects in a video in parallel, accurately modeling the interactive relation among the objects in the video and realizing cross-modal semantic alignment. Better results were obtained on the relevant video question-answer dataset.
A video question-answering method based on an object-oriented double-flow attention network comprises the following steps:
step (1), data preprocessing:
inputting a video segment to be asked and answered, firstly extracting static appearance characteristics and dynamic behavior characteristics of a video frame by using a convolution network, then extracting a target object in the video by using a target detection algorithm, and simultaneously extracting the static appearance characteristics and the dynamic behavior characteristics of the object by using the convolution network; taking the extracted static appearance characteristics and dynamic behavior characteristics of the video frame as context representation characteristics of a target object;
step (2), video feature coding:
and coding the video features by adopting a dual-stream mechanism, wherein the dual streams are a static appearance stream and a dynamic behavior stream of the target object respectively.
In each stream, respectively combining the characteristics of the target object, the space-time coding of the target object and the context representation characteristics of the target object to obtain characteristics serving as nodes to construct a characteristic graph; performing graph convolution operation on the constructed feature graph to acquire high-order features of the target object in the video;
step (3), problem feature coding:
coding the problem by utilizing a recurrent neural network to obtain the local code of each word and the global code of the problem sentence;
step (4), cross-modal fusion of videos and problems:
in each stream, firstly, the high-order characteristics of the target object are input into the attention network to obtain the intra-modal interaction relationship between the objects, and meanwhile, the high-order characteristics of the video target object are reconstructed according to the intra-modal interaction relationship between the objects, similarly, the local characteristics of each word are transmitted into the attention network to obtain the intra-modal interaction relationship between the words and the vocabulary, and meanwhile, the problem characteristics are reconstructed according to the intra-modal interaction relationship between the words and the vocabulary. And then inputting the reconstructed high-order characteristics of the target object and the reconstructed problem characteristics into an attention network guided by the problem to explore cross-modal semantic relation between words and object characteristics, and updating the high-order characteristics of the video.
And fusing the high-order characteristics of the static appearance flow of the target object and the high-order characteristics of the dynamic behavior flow to obtain the double-flow high-order characteristics of the video.
Fusing the double-current high-order characteristics of the video with the global problem code to obtain an answer code;
and (5) decoding the answer code obtained in the step (4) to obtain a final answer:
different decoding modes are adopted for different problem types, a linear classifier is used for decoding open problems and multi-item selection problems, and a linear regression device is used for decoding counting problems;
further, the specific method in the step (1) is as follows:
inputting a video clip to be asked and answered, dividing the video clip into a fixed number of video frames, and carrying out corresponding static state exception on each frameExtracting the appearance characteristic and the dynamic behavior characteristic to obtain the static appearance characteristic of the video frameAnd dynamic behavior characteristicsThe context as the target object represents the feature. Then, positioning the position information b of the target object on each frame by using a target detection algorithm, and aligning the position information to the static appearance characteristic and the dynamic behavior characteristic of the video frame by using a RoIAlign alignment network to obtain the static appearance characteristic O of each objectaAnd dynamic behavior characteristics Om:
The specific formula of the object position information b is as follows:
b=[c1,c2,w,h](formula 1)
Wherein, c1And c2Is the center position coordinates of the target object, and w and h represent the length and width of the target object, respectively.
Further, the specific method of the step (2) is as follows:
the method adopts a double-stream mechanism to represent the feature coding of the video, the static appearance feature of an object and the dynamic behavior feature of the object are coded in parallel by adopting the same coding mechanism, a network for coding the static appearance feature is called static appearance flow, and a network for coding the dynamic behavior feature is called dynamic behavior flow. And finally, respectively obtaining the high-order characteristics of the static appearance flow and the high-order characteristics of the dynamic behavior flow of the video. The specific process is as follows:
the static appearance stream and the dynamic behavior stream adopt the same network structure, so for simplifying the expression, the upper corner mark a/m is used for simultaneously representing the characteristics of the two streams.
Given characteristics of target objects in videoAnd a set of position information of the target objectFirstly, the position information b of the target object is transmitted into a space position coding network and a time position coding network to respectively obtain the space position codes d of the objectsAnd a time-position code dtThe spatial position coding network is composed of a plurality of layers of perceptrons, and the time position coding network adopts sine and cosine functions with different frequencies to code the time position of the target object. The specific formula is as follows:
dsmlp (b) (equation 2)
Wherein b represents the position information of the target object found by formula 1; n represents the number of target objects; dpRepresents the dimension of the code, and 0 ≦ 2j ≦ dp。
Then, combining the characteristics of the target object and the space-time position code of the target object to obtain the characteristics with position perception of the target object, wherein the specific formula is as follows:
combining the characteristics with the position perception of the target object with the characteristics of the video frame to obtain the characteristics with the context perception, wherein the specific formula is as follows:
after the characteristics with the position information and the context information of each object are obtained, the characteristics of each object are taken as nodes, and the interactive relation between the objects is taken as an edge to establish an undirected complete graphWhereinWhich represents a collection of nodes, and,representing a set of edges. The calculation formula of the interaction relation matrix between the objects is as follows:
Aa/m=softmax(φ(Fa/m)φ(Fa/m)T) (formula 7)
Wherein phi (F)a/m)=Fa/mW, W is a learnable weight matrix.
And then carrying out twice graph convolution operations on the constructed undirected complete graph to obtain the high-order characteristics of the object, wherein the graph convolution formula is as follows:
GCN(Fa/m;Aa/m)=Aa/m(Fa/mw) (formula 8)
Further, the step (3) of performing feature coding on the problem by using the recurrent neural network specifically includes the following steps:
firstly, mapping each word in the question to a 300-dimensional word vector space initialized by GloVe to obtain the code of each word, and then transmitting each code vector to a bidirectional LSTM network to obtain the local code F of each wordQ∈RL×dGlobal encoding oF problem sentences oFq∈Rd。
Further, the cross-modal fusion step described in step (4) is as follows:
as in step (2), the two streams use the same network structure, and therefore the characteristics of the two streams are denoted by superscript a/m.
Defining a self-attention network to explore interaction relationships within the modality, the self-attention network formulation being as follows:
f=MultiHead(X,X,X)=[head1,head2,...,headh]WO(formula 9)
Wherein the content of the first and second substances,d is a characteristic dimension; h indicates that the multi-head attention machine is provided with h heads; x represents the visual feature: wO, Respectively, learnable parameter matrices.
Additionally, a problem-oriented attention network is defined to explore the interaction relationships between modalities, the formula of the problem-oriented attention network is as follows:
f=MultiHead(X,Y,Y)=[head1,head2,...,headh]WO(formula 11)
Wherein X represents a visual feature and Y represents a problem feature.
Then, the high-order characteristic F obtained in the step (2) is processeda/mTransmitting the text into a defined self-attention network to acquire the intra-modal interaction relationship between the objects, and acquiring the text local feature F obtained in the step (3)Q∈RL×dTransmitting the output of the two self-attention networks into a problem-guided attention network to acquire cross-modal semantic relation between words and objects, and simultaneously updating high-order features of each stream of the video respectively under the guidance of the problem.
And finally, fusing the global feature codes of the problems and the double-stream features of the video to obtain the codes of the answers.
Further, the feature decoding step in step (5) is as follows:
for open questions, a linear classifier is used to predict the probability that each answer in the answer set is the correct answer, and cross entropy function is used as loss function to perform parameter training:
wherein the content of the first and second substances,for the probability that each answer is the correct answer, yi1 denotes aie.A is the correct answer.
For counting type problems, a linear regressor is adopted to predict answers, and the mean square error is used as a loss function to train parameters:
for the multi-choice question, a linear classifier is adopted to predict the probability of each candidate answer being a correct answer, and the model is trained by using a hinge loss function as a loss function:
wherein s ispAnd snRepresenting the scores of the correct answer and the wrong answer, respectively.
The invention has the beneficial effects that:
the invention uses a double-flow mechanism to represent the visual content of the video on the basis of the predecessor. One of the streams is a static appearance stream of foreground objects and the other stream is a dynamic behavior stream of foreground objects. In each stream, the characteristics of the object include both the characteristics of the object itself, as well as the spatio-temporal coding of the object and the contextual information characteristics of the scene in which the object is located. Therefore, when deep feature extraction is carried out in subsequent graph convolution operation, the relative space-time relation and the context perception relation between the objects can be explored. Meanwhile, the problem that the prior video question-answer model only considers the static characteristics of the object and lacks dynamic information analysis is solved by using a double-flow mechanism.
The invention provides an attention network model for problem guidance aiming at the problem of cross-modal semantic alignment and intra-modal interaction relation exploration, and improves the exploration capability of intra-modal interaction and inter-modal semantic alignment to a certain extent. The invention achieves better results on the relevant video question-answer data set.
Drawings
FIG. 1 is a flow chart of a method of an embodiment of the present invention.
Fig. 2 is a schematic diagram of a network framework according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated by the following figures and examples.
The invention provides a video question-answering method based on an object-oriented double-flow attention network, which comprises the following steps:
step (1), data preprocessing:
inputting a video clip to be asked and answered, dividing the video clip into a fixed number of video frames, and performing corresponding static appearance characteristic extraction and dynamic behavior characteristic extraction on each frame to obtain the static appearance characteristics of the video framesAnd dynamic behavior characteristicsThe context as the target object represents the feature. Then, positioning the position information b of the target object on each frame by using a target detection algorithm, and aligning the position information to the static appearance characteristic and the dynamic behavior characteristic of the video frame by using a RoIAlign alignment network to obtain the static appearance characteristic O of each objectaAnd dynamic behavior characteristics Om:
The specific formula of the object position information b is as follows:
b=[c1,c2,w,h](formula 1)
Wherein, c1And c2Is the center position coordinates of the target object, and w and h represent the length and width of the target object, respectively.
Step (2), video feature coding:
the method adopts a double-stream mechanism to represent the feature coding of the video, the static appearance feature of an object and the dynamic behavior feature of the object are coded in parallel by adopting the same coding mechanism, a network for coding the static appearance feature is called static appearance flow, and a network for coding the dynamic behavior feature is called dynamic behavior flow. And finally, respectively obtaining the high-order characteristics of the static appearance flow and the high-order characteristics of the dynamic behavior flow of the video. The specific process is as follows:
the static appearance stream and the dynamic behavior stream adopt the same network structure, so for simplifying the expression, the upper corner mark a/m is used for simultaneously representing the characteristics of the two streams.
Given characteristics of target objects in videoAnd a set of position information of the target objectFirstly, the position information b of a target object is transmitted into a space position coding network and a time position coding network to respectively obtain space position codes d of the objectsAnd a time-position code dtThe spatial position coding network is composed of a plurality of layers of perceptrons, and the time position coding network adopts sine and cosine functions with different frequencies to code the time position of the target object. The specific formula is as follows:
dsmlp (b) (equation 2)
Wherein b represents the position information of the target object found by formula 1; n represents the number of target objects; dpRepresents the dimension of the code, and 0 ≦ 2j ≦ dp。
Then, combining the characteristics of the target object and the space-time position code of the target object to obtain the characteristics with position perception of the target object, wherein the specific formula is as follows:
combining the characteristics with the position perception of the target object with the characteristics of the video frame to obtain the characteristics with the context perception, wherein the specific formula is as follows:
after the characteristics with the position information and the context information of each object are obtained, the characteristics of each object are taken as nodes, and the interactive relation between the objects is taken as an edge to establish an undirected complete graphWhereinWhich represents a collection of nodes, and,representing a set of edges. The calculation formula of the interaction relation matrix between the objects is as follows:
Aa/m=softmax(φ(Fa/m)φ(Fa/m)T) (formula 7)
Wherein phi (F)a/m)=Fa/mW, W being learnableA weight matrix.
And then carrying out twice graph convolution operations on the constructed undirected complete graph to obtain the high-order characteristics of the object, wherein the graph convolution formula is as follows:
GCN(Fa/m;Aa/m)=Aa/m(Fa/mw) (formula 8)
And (3) performing feature coding on the problem by using a recurrent neural network, wherein the specific process is as follows:
firstly, mapping each word in the question to a 300-dimensional word vector space initialized by GloVe to obtain the code of each word, and then transmitting each code vector to a bidirectional LSTM network to obtain the local code F of each wordQ∈RL×dGlobal encoding oF problem sentences oFq∈Rd。
Step (4), cross-modal fusion of videos and problems:
as in step (2), the two streams use the same network structure, and therefore the characteristics of the two streams are denoted by superscript a/m.
Defining a self-attention network to explore interaction relationships within the modality, the self-attention network formulation being as follows:
f=MultiHead(X,X,X)=[head1,head2,...,headh]WO(formula 9)
Wherein the content of the first and second substances,d is a characteristic dimension; h indicates that the multi-head attention machine is provided with h heads; x represents the visual feature: wO, Respectively, learnable parameter matrices.
Additionally, a problem-oriented attention network is defined to explore the interaction relationships between modalities, the formula of the problem-oriented attention network is as follows:
f=MultiHead(X,Y,Y)=[head1,head2,...,headh]WO(formula 11)
Wherein X represents a visual feature and Y represents a problem feature.
Then, the high-order characteristic F obtained in the step (2) is processeda/mTransmitting the text into a defined self-attention network to acquire the intra-modal interaction relationship between the objects, and acquiring the text local feature F obtained in the step (3)Q∈RL×dTransmitting the output of the two self-attention networks into a problem-guided attention network to acquire cross-modal semantic relation between words and objects, and simultaneously updating high-order features of each stream of the video respectively under the guidance of the problem.
And finally, fusing the global feature codes of the problems and the double-stream features of the video to obtain the codes of the answers.
And (5) decoding the answer code obtained in the step (4) to obtain a final answer:
for open questions, a linear classifier is used to predict the probability that each answer in the answer set is the correct answer, and cross entropy function is used as loss function to perform parameter training:
wherein the content of the first and second substances,for each answer to be positiveProbability of answer, yi1 denotes aie.A is the correct answer.
For counting type problems, a linear regressor is adopted to predict answers, and the mean square error is used as a loss function to train parameters:
for the multi-choice question, a linear classifier is adopted to predict the probability of each candidate answer being a correct answer, and the model is trained by using a hinge loss function as a loss function:
wherein s ispAnd snRepresenting the scores of the correct answer and the wrong answer, respectively.
Examples
As shown in fig. 1 and fig. 2, a video question-answering method based on an object-oriented dual-flow attention network includes the following steps:
the method comprises the following steps of (1) carrying out data preprocessing on input data, and firstly sampling video frames in an average sampling mode aiming at an input video segment, wherein the sampling number of each video segment is T-10 frames. And then generating a target object on each frame by adopting a fast-RCNN target detection algorithm to obtain a plurality of candidate frames. A convolutional network is additionally used to extract static appearance features and dynamic behavior features for each video frame. In the invention, a ResNet-152 network trained on an ImageNet image library is used for extracting static appearance characteristics, and an I3D network trained on a Kinetics action recognition data set is used for extracting dynamic behavior characteristics of a video frame. And finally, mapping the candidate box of each target object to the feature map of the Conv5 layer of ResNet-152 and the feature map of the convolution layer of the last layer of I3D respectively by using a RoIAlign method to obtain the static appearance feature and the dynamic behavior feature of the target object.
And (2) carrying out feature coding on the video by adopting a double-stream mechanism, wherein one stream is a static appearance stream of the object, and the other stream is a dynamic behavior stream of the object.
2-1. multidimensional feature aggregation, set of location information for a given target object Static appearance feature and dynamic behavior feature set of objects Static appearance features and dynamic behavior features of video framesComputing features with location-awareness and context-awareness for each target object:
and respectively calculating the spatial position code and the temporal position code of each target object according to formula 2, formula 3 and formula 4, and calculating the characteristic with position perception of each target object according to formula 5.
Calculating a feature F with context awareness for each target object according to equation 6a/m。
2-2, calculating the double-flow high-order feature representation of the target object, which is obtained by the 2-1, the feature with the position and the context perception through the convolution operation of two layers of graphs (formula 7) and a residual error network.
And (3) performing coding representation on the input natural language question, and embedding each word in the natural language question into a 300-dimensional feature space initialized by GloVe to obtain a local code of each word. Then, the local code of each word is transmitted into a bidirectional recurrent neural network according to the sequence to obtain the local code F of each wordQ∈RL×dAnd problem ofLocal code Fq∈Rd。
And (4) performing cross-modal fusion of the video and the problem, firstly, performing intra-modal interaction relation analysis, and updating feature representations of the two modalities according to intra-modal self-attention weight. Performing double-current high-order characteristic F on the video obtained in the step (2)a/mAnd the local character F of the word obtained in the step (3)Q∈RL×dThe feature reconstruction is performed according to equations 9 and 10, respectively. And then performing cross-modal semantic alignment on the reconstructed problem feature and the target object high-order feature according to a formula 11 and a formula 12, and updating the double-flow high-order feature of the video. And finally, fusing the updated double-current high-order characteristics and the problem global code to obtain an answer code, wherein the specific formula is as follows:
a=FC(FC(fa+fm)+FC(fq) Equation 16
Wherein FC denotes the full connection layer, faFor global static appearance feature coding, fmFor global dynamic behavior feature coding, fqAnd coding the global problem features.
Step (5) decoding the answer code obtained in the step (4) to obtain a final answer;
selecting different decoding modes according to the type of the problem, selecting a linear classifier for decoding the open problem, and simultaneously training parameters by using the Loss defined by the formula 13; for the counting type problem, a linear regressor is selected for decoding and the parameters are trained using the Loss defined in equation 14. A linear classifier is selected for the multiple choice class problem for decoding while training the parameters using the Loss defined by equation 15.
In order to test the performance of the double-flow mechanism and the attention mechanism in the video question-answering method based on the object-oriented double-flow attention network, different model structures are adopted for carrying out ablation experimental test. The results of the experiment are shown in table 1, and the data set adopted in the ablation experiment is a TGIF-QA data set, and the data set comprises four types of questions, namely a Count type counting question, an Action identification question, a Transition state Transition type question and a FrameQA image question-and-answer type question. The first and third rows of application and Motion respectively represent models that use only static Appearance streams and dynamic behavior streams as representations of video and do not use attention modules for cross-modal semantic alignment when fusing across modalities. The second and fourth rows represent cross-modal semantic alignment using only static appearance streams and dynamic behavior streams as representations of the video while utilizing the attention module, respectively. The last row represents the performance of the dual-flow attention network model of the present invention.
TABLE 1 comparison of ablation Experimental Performance for different model structures
To further test the performance of the video question-answering method based on the object-oriented dual-flow attention network, the invention was compared with the L-GCN model that best performs on the TGIF-QA dataset, and the experimental comparison results are shown in table 2.
TABLE 2 comparison of the Performance of the present invention with that of the L-GCN model
Claims (6)
1. A video question-answering method based on an object-oriented double-flow attention network is characterized by comprising the following steps:
step (1), data preprocessing:
inputting a video segment to be asked and answered, firstly extracting static appearance characteristics and dynamic behavior characteristics of a video frame by using a convolution network, then extracting a target object in the video by using a target detection algorithm, and simultaneously extracting the static appearance characteristics and the dynamic behavior characteristics of the object by using the convolution network; taking the extracted static appearance characteristics and dynamic behavior characteristics of the video frame as context representation characteristics of a target object;
step (2), video feature coding:
coding video characteristics by adopting a double-stream mechanism, wherein the double streams are a static appearance stream and a dynamic behavior stream of a target object respectively;
in each stream, respectively combining the characteristics of the target object, the space-time coding of the target object and the context representation characteristics of the target object to obtain characteristics serving as nodes to construct a characteristic graph; performing graph convolution operation on the constructed feature graph to acquire high-order features of the target object in the video;
step (3), problem feature coding:
coding the problem by utilizing a recurrent neural network to obtain the local code of each word and the global code of the problem sentence;
step (4), cross-modal fusion of videos and problems:
in each stream, firstly, inputting high-order characteristics of a target object from an attention network to acquire intra-modal interaction relation between the objects, reconstructing the high-order characteristics of the video target object according to the intra-modal interaction relation between the objects, similarly, transmitting local characteristics of each word from the attention network to acquire the intra-modal interaction relation between vocabularies, and reconstructing problem characteristics according to the intra-modal interaction relation between the vocabularies; then inputting the reconstructed high-order characteristics of the target object and the reconstructed problem characteristics into an attention network guided by the problem to explore cross-modal semantic relation between words and object characteristics, and updating the high-order characteristics of the video;
fusing the high-order characteristics of the static appearance flow of the target object and the high-order characteristics of the dynamic behavior flow to obtain the double-flow high-order characteristics of the video;
fusing the double-current high-order characteristics of the video with the global problem code to obtain an answer code;
and (5) decoding the answer code obtained in the step (4) to obtain a final answer:
different decoding modes are adopted for different problem types, a linear classifier is used for decoding for open problems and multi-item selection problems, and a linear regression device is used for decoding for counting problems.
2. The video question-answering method based on the object-oriented dual-flow attention network according to claim 1, wherein the specific method in the step (1) is as follows:
inputting a video clip to be asked and answered, dividing the video clip into a fixed number of video frames, and performing corresponding static appearance characteristic extraction and dynamic behavior characteristic extraction on each frame to obtain the static appearance characteristics of the video framesAnd dynamic behavior characteristicsA context representation feature as a target object; then, positioning the position information b of the target object on each frame by using a target detection algorithm, and aligning the position information to the static appearance characteristic and the dynamic behavior characteristic of the video frame by using a RoIAlign alignment network to obtain the static appearance characteristic O of each objectaAnd dynamic behavior characteristics Om:
The specific formula of the object position information b is as follows:
b=[c1,c2,w,h](formula 1)
Wherein, c1And c2Is the center position coordinates of the target object, and w and h represent the length and width of the target object, respectively.
3. The video question-answering method based on the object-oriented dual-flow attention network according to claim 2, wherein the specific method in the step (2) is as follows:
the method comprises the steps of representing feature coding of a video by adopting a double-flow mechanism, coding static appearance features of an object and dynamic behavior features of the object in parallel by adopting the same coding mechanism, wherein a network for coding the static appearance features is called static appearance flow, and a network for coding the dynamic behavior features is called dynamic behavior flow; finally, respectively obtaining the high-order characteristics of the static appearance flow and the high-order characteristics of the dynamic behavior flow of the video; the specific process is as follows:
the static appearance flow and the dynamic behavior flow adopt the same network structure, so in order to simplify the expression, the upper corner mark a/m is used for simultaneously representing the characteristics of the two flows;
given characteristics of target objects in videoAnd a set of position information of the target objectFirstly, the position information b of the target object is transmitted into a space position coding network and a time position coding network to respectively obtain the space position codes d of the objectsAnd a time-position code dtThe spatial position coding network is composed of a plurality of layers of perceptrons, and the time position coding network adopts sine and cosine functions with different frequencies to code the time position of a target object; the specific formula is as follows:
dsmlp (b) (equation 2)
Wherein b represents the position information of the target object found by formula 1; n represents the number of target objects; dpRepresents the dimension of the code, and 0 ≦ 2j ≦ dp;
Then, combining the characteristics of the target object and the space-time position code of the target object to obtain the characteristics with position perception of the target object, wherein the specific formula is as follows:
combining the characteristics with the position perception of the target object with the characteristics of the video frame to obtain the characteristics with the context perception, wherein the specific formula is as follows:
after the characteristics with the position information and the context information of each object are obtained, the characteristics of each object are used as nodes, and the interactive relation between the objects is used as an edge to establish an undirected complete graphWhereinRepresents a node set, and epsilon represents an edge set; the interactive relationship matrix between the objects is calculated by the following formula:
Aa/m=softmax(φ(Fa/m)φ(Fa/m)T) (formula 7)
Wherein phi (F)a/m)=Fa/mW, W is a weight matrix which can be learned;
and then carrying out twice graph convolution operations on the constructed undirected complete graph to obtain the high-order characteristics of the object, wherein the graph convolution formula is as follows:
GCN(Fa/m;Aa/m)=Aa/m(Fa/mw) (equation 8).
4. The video question-answering method based on the object-oriented dual-flow attention network according to claim 3, wherein the specific process of the step (3) is as follows:
firstly, mapping each word in the question to a 300-dimensional word vector space initialized by GloVe to obtain the code of each word, and then transmitting each code vector to a bidirectional LSTM network to obtain the local code F of each wordQ∈RL×dGlobal encoding oF problem sentences oFq∈Rd。
5. The video question-answering method based on the object-oriented dual-flow attention network according to claim 4, wherein the cross-modal fusion step in the step (4) is as follows:
as in step (2), the two streams use the same network structure, and therefore the characteristics of the two streams are denoted by superscript a/m;
defining a self-attention network to explore interaction relationships within the modality, the self-attention network formulation being as follows:
f=MultiHead(X,X,X)=[head1,head2,...,headh]WO(formula 9)
Wherein the content of the first and second substances,d is a characteristic dimension; h represents that the multi-head attention device is provided with h heads; x represents the visual feature: wO, Respectively are learnable parameter matrixes;
a problem-oriented attention network is further defined to explore the interaction relationships among the modalities, and the formula of the problem-oriented attention network is as follows:
f=MultiHead(X,Y,Y)=[head1,head2,...,headh]WO(formula 11)
Wherein X represents a visual feature and Y represents a problem feature;
then, the high-order characteristic F obtained in the step (2) is processeda/mTransmitting the text into a defined self-attention network to acquire the intra-modal interaction relationship between the objects, and acquiring the text local feature F obtained in the step (3)Q∈RL×dTransmitting the output of the two self-attention networks into a defined problem-guided attention network to acquire cross-modal semantic relation between words and objects, and simultaneously updating high-order features of each stream of the video under the guidance of the problem;
and finally, fusing the global feature codes of the problems and the double-stream features of the video to obtain the codes of the answers.
6. The video question-answering method based on the object-oriented dual-stream attention network of claim 5, wherein the feature decoding step of step (5) is as follows:
for open questions, a linear classifier is used to predict the probability that each answer in the answer set is the correct answer, and cross entropy function is used as loss function to perform parameter training:
wherein the content of the first and second substances,for the probability that each answer is the correct answer, yi1 denotes aiE is the correct answer;
for counting type problems, a linear regressor is adopted to predict answers, and the mean square error is used as a loss function to train parameters:
for the multi-choice question, a linear classifier is adopted to predict the probability of each candidate answer being a correct answer, and the model is trained by using a hinge loss function as a loss function:
wherein s ispAnd snRepresenting the scores of the correct answer and the wrong answer, respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210094738.3A CN114428866A (en) | 2022-01-26 | 2022-01-26 | Video question-answering method based on object-oriented double-flow attention network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210094738.3A CN114428866A (en) | 2022-01-26 | 2022-01-26 | Video question-answering method based on object-oriented double-flow attention network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114428866A true CN114428866A (en) | 2022-05-03 |
Family
ID=81313608
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210094738.3A Pending CN114428866A (en) | 2022-01-26 | 2022-01-26 | Video question-answering method based on object-oriented double-flow attention network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114428866A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114463760A (en) * | 2022-04-08 | 2022-05-10 | 华南理工大学 | Character image writing track recovery method based on double-stream coding |
CN114818989A (en) * | 2022-06-21 | 2022-07-29 | 中山大学深圳研究院 | Gait-based behavior recognition method and device, terminal equipment and storage medium |
CN116847101A (en) * | 2023-09-01 | 2023-10-03 | 易方信息科技股份有限公司 | Video bit rate ladder prediction method, system and equipment based on transform network |
WO2024012574A1 (en) * | 2022-07-15 | 2024-01-18 | 中国电信股份有限公司 | Image coding method and apparatus, image decoding method and apparatus, readable medium, and electronic device |
-
2022
- 2022-01-26 CN CN202210094738.3A patent/CN114428866A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114463760A (en) * | 2022-04-08 | 2022-05-10 | 华南理工大学 | Character image writing track recovery method based on double-stream coding |
CN114463760B (en) * | 2022-04-08 | 2022-06-28 | 华南理工大学 | Character image writing track recovery method based on double-stream coding |
CN114818989A (en) * | 2022-06-21 | 2022-07-29 | 中山大学深圳研究院 | Gait-based behavior recognition method and device, terminal equipment and storage medium |
CN114818989B (en) * | 2022-06-21 | 2022-11-08 | 中山大学深圳研究院 | Gait-based behavior recognition method and device, terminal equipment and storage medium |
WO2024012574A1 (en) * | 2022-07-15 | 2024-01-18 | 中国电信股份有限公司 | Image coding method and apparatus, image decoding method and apparatus, readable medium, and electronic device |
CN116847101A (en) * | 2023-09-01 | 2023-10-03 | 易方信息科技股份有限公司 | Video bit rate ladder prediction method, system and equipment based on transform network |
CN116847101B (en) * | 2023-09-01 | 2024-02-13 | 易方信息科技股份有限公司 | Video bit rate ladder prediction method, system and equipment based on transform network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114428866A (en) | Video question-answering method based on object-oriented double-flow attention network | |
CN109766427B (en) | Intelligent question-answering method based on collaborative attention for virtual learning environment | |
CN110148318B (en) | Digital teaching assistant system, information interaction method and information processing method | |
CN113297370B (en) | End-to-end multi-modal question-answering method and system based on multi-interaction attention | |
CN112487949B (en) | Learner behavior recognition method based on multi-mode data fusion | |
JP2022018066A (en) | Loop detection method based on convolutional perception hash algorithm | |
CN110852256A (en) | Method, device and equipment for generating time sequence action nomination and storage medium | |
CN113870395A (en) | Animation video generation method, device, equipment and storage medium | |
CN113283336A (en) | Text recognition method and system | |
Sha et al. | Neural knowledge tracing | |
CN116187349A (en) | Visual question-answering method based on scene graph relation information enhancement | |
CN109840506B (en) | Method for solving video question-answering task by utilizing video converter combined with relational interaction | |
CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
CN114328943A (en) | Question answering method, device, equipment and storage medium based on knowledge graph | |
CN115599954B (en) | Video question-answering method based on scene graph reasoning | |
Guo | Analysis of artificial intelligence technology and its application in improving the effectiveness of physical education teaching | |
CN113609355B (en) | Video question-answering system, method, computer and storage medium based on dynamic attention and graph network reasoning | |
Zheng et al. | Modular graph attention network for complex visual relational reasoning | |
CN115130461A (en) | Text matching method and device, electronic equipment and storage medium | |
CN113569867A (en) | Image processing method and device, computer equipment and storage medium | |
Kheldoun et al. | Algsl89: An algerian sign language dataset | |
CN116661940B (en) | Component identification method, device, computer equipment and storage medium | |
Wollowski et al. | Constructing mutual context in human-robot collaborative problem solving with multimodal input | |
CN117520209B (en) | Code review method, device, computer equipment and storage medium | |
Zhang et al. | Student Classroom Teaching Behavior Recognition Based on DSCNN Model in Intelligent Campus Education |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |