CN113688296A - Method for solving video question-answering task based on multi-mode progressive attention model - Google Patents

Method for solving video question-answering task based on multi-mode progressive attention model Download PDF

Info

Publication number
CN113688296A
CN113688296A CN202110915934.8A CN202110915934A CN113688296A CN 113688296 A CN113688296 A CN 113688296A CN 202110915934 A CN202110915934 A CN 202110915934A CN 113688296 A CN113688296 A CN 113688296A
Authority
CN
China
Prior art keywords
video
question
feature
audio
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110915934.8A
Other languages
Chinese (zh)
Other versions
CN113688296B (en
Inventor
孙广路
刘昕雨
梁丽丽
李天麟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN202110915934.8A priority Critical patent/CN113688296B/en
Publication of CN113688296A publication Critical patent/CN113688296A/en
Application granted granted Critical
Publication of CN113688296B publication Critical patent/CN113688296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method for solving a video question-answering task based on a multi-mode progressive attention model. The method comprises the following steps: firstly, aiming at various modal information in a video question and answer task, respectively extracting various modal characteristics; secondly, performing primary attention on the extracted various modal characteristics by using the problem, calculating corresponding weight scores, and performing iterative attention on the important modal characteristics by using the problem so as to locate the modal characteristics most relevant to the problem; thirdly, realizing cross-modal fusion of the features by utilizing a multi-modal fusion algorithm, and then paying attention to multi-modal fusion representation of the video by utilizing problems to find out important video features related to the problems; and fourthly, fusing partial effective output results of the model for answer generation. Compared with the existing video question-answering solution, the method and the device can more accurately position the video frame or video picture area related to the problem. Compared with the traditional method, the effect obtained by the invention in the video question-answering task is better.

Description

Method for solving video question-answering task based on multi-mode progressive attention model
Technical Field
The embodiment of the invention relates to the technical field of video question answering, in particular to a method for solving a video question answering task based on a multi-mode progressive attention model.
Background
In recent years, video question answering is a very challenging new area, and is receiving the attention of researchers. The task requirement model can understand semantic information between the video and the question and generate an answer according to the semantic information. Open questions are a difficult type of question in the current stage video question-answering task, since they require the model to automatically generate natural language answers.
In the question-and-answer task, video information is more complex than image information. The video is an image sequence with strong time dynamics, and a large number of redundant frames irrelevant to the problem exist, so that the relevance between the video representation and the problem is influenced, the model cannot be accurately positioned to the video information relevant to the problem, and experiments show that the problem can be effectively solved and the accuracy of the model is obviously improved by applying the attention model to a video question-answering task.
Most of the current video question-answering tasks only extract the frame features and the clip features of the video in the implementation process, and completely ignore the audio features of the video, so that all effective information of the video is not utilized to the maximum extent. And because various information cross exists among different modal characteristics and respective expression modes are different, if only basic operations such as point multiplication, cascade connection and the like are utilized to perform characteristic fusion, the complex relationship between two modal states cannot be modeled. In order to solve the problems, the method accurately positions the video frames related to the problems or the video picture areas related to the problems in a stage-by-stage positioning mode by utilizing a multi-mode progressive attention model.
Disclosure of Invention
In this context, embodiments of the present invention are expected to provide a method for solving a video question-and-answer task based on a multi-modal progressive attention model, so as to overcome the problem that the prior art cannot provide more accurate answers to the video question-and-answer task.
In a first aspect of the embodiments of the present invention, there is provided a method for solving a video question-answering task based on a multi-modal progressive attention model, including: step S1, obtaining a video to be processed and a question; step S2, extracting frame characteristics, clipping characteristics and audio characteristics of the video as a plurality of modal characteristics of the video, and extracting text characteristics of the problem; step S3, respectively paying attention to a plurality of modal characteristics of the video by using the problem to obtain a plurality of modal representations with problem guidance, respectively calculating the weight scores of the modalities by using the problem, and selecting the modality representation with the highest weight score from the plurality of modalities as a key modality; step S4, according to the obtained modal representation and the weight score, fusing the multiple modal representations based on a multi-modal fusion algorithm to obtain a video fusion representation of the video; step S5, paying attention to the video fusion representation of the video by using the problem to obtain the video fusion representation with problem guidance; step S6, performing multi-step attention on the characteristics of the key modality by using the problem, and positioning the characteristics of the key modality more relevant to the problem in a multi-round iteration mode; and step S7, obtaining a predicted answer at least based on the question features, the video fusion representation with the question guidance and the results of the multi-step attention and the multi-turn iteration.
Further, the step of extracting the frame feature, the clip feature and the audio feature of the video in step S2 includes: step S21, extracting the frame characteristic v of the video by utilizing a pre-trained ResNet modelf={f1,f2,...,fN1},
Figure BDA0003205545960000021
Wherein f isiRepresenting the ith frame in a videoThe frame characteristics of (a) are determined,
Figure BDA0003205545960000022
i is 1,2,3, …, N1, N1 denotes the number of frames, d denotes the dimension of the frame feature; step S22, extracting the clipping feature v of the video by utilizing the pre-training TSN networkc={c1,c2,…,cN2},
Figure BDA0003205545960000023
Wherein, cjIndicating the clipping characteristics of the jth clip in the video,
Figure BDA0003205545960000024
j is 1,2,3, …, N2, N2 indicates the number of clips, and the dimension of the clip feature is the same as the dimension of the frame feature; step S23, converting the audio in the video into a spectrogram according to the Mel inverse pedigree number to be used as the input of a pre-training GoogleNet model, and then extracting the audio characteristic v of the video by using the pre-training GoogleNet modela={a1,a2,...,aN3},
Figure BDA0003205545960000025
Wherein, akRepresenting the audio features of the kth audio in the video,
Figure BDA0003205545960000026
k is 1,2,3, …, N3, and N3 denote the number of audios, and the dimension of the audio feature is the same as that of the frame feature;
the step of extracting the problem feature in step S2 includes: step S24, performing one-hot encoding expression on all words in the question, and obtaining question expression q ═ { q ═ q {1,q2,...,qTWherein q istIs a one-hot coded representation of the tth word in the question, T ═ 1,2,3, …, T represents the length of the question; step S25, obtaining a word embedding matrix by using the pre-training word embedding model GloVe
Figure BDA0003205545960000031
Wherein, | NvocabI denotes the number of words, number of the data setThe value 300 represents the feature dimension of each word vector; step S26, embedding the question q into a low-dimensional continuous vector space through the word embedding matrix E to obtain a word embedding vector xt=E*qtT1, 2,. said, T; step S27, embedding vector by using LSTM encoding word to obtain text characteristic of question
Figure BDA0003205545960000032
LSTMq(. cndot.) represents a long-short term memory network that handles word-embedded vectors.
Further, the plurality of modality representations with question guidance obtained in step S3 includes a frame representation with question guidance obtained by the steps of: s31 problem feature using compatibility function
Figure BDA00032055459600000313
And frame feature vf=(f1,f2,...,fN1) Dimension scaling is carried out, namely the problem feature and the frame feature are mapped to the same low-dimensional feature space from a high-dimensional feature space to carry out similarity calculation, and a corresponding frame vector group e is obtainedfThe specific calculation of each frame vector is shown as follows:
Figure BDA0003205545960000033
wherein the compatibility function used is a scaling point multiplication function,
Figure BDA0003205545960000034
the resulting set of frame vectors is represented as,
Figure BDA0003205545960000035
representing the i-th frame vector of the set of frame vectors, fiRepresenting the frame characteristics of the ith frame in the video, wherein i is 1,2,3, …, N1, and d represents a preset scaling factor; s32, using alignment function to align the frame vector group efOf each frame vector
Figure BDA0003205545960000036
Each translate into a corresponding frame attention weight score
Figure BDA0003205545960000037
To obtain the normalization result of the similarity between the problem feature and the frame feature, and the frame attention weight score corresponding to each frame vector
Figure BDA0003205545960000038
The specific calculation of (a) is shown as follows:
Figure BDA0003205545960000039
wherein, the normalization function is a softmax function, exp (·) represents a prime number operation function with a natural base number e as a base; s33, using the generated context function to convert each frame feature fiCorresponding frame attention weight score
Figure BDA00032055459600000310
Performing a weighted sum calculation to obtain a frame representation p with problem guidancefAs shown in the following formula:
Figure BDA00032055459600000311
wherein, W1Representing trainable weight matrices, b1A trainable bias vector is represented.
Further, the plurality of modal representations with issue guidance obtained in step S3 includes a clip representation with issue guidance obtained by the steps of: s34 problem feature using compatibility function
Figure BDA00032055459600000312
And a clipping feature vc=(c1,c2,...,cN2) Performing dimensionality scaling, namely mapping the problem characteristic and the clipping characteristic from a high-dimensional characteristic space to the same low-dimensional characteristic space to perform similarity calculation to obtain corresponding clipsEdit vector set ecThe specific calculation of each clip vector is as follows:
Figure BDA0003205545960000041
wherein the compatibility function used is a scaling point multiplication function,
Figure BDA0003205545960000042
the resulting set of clip vectors is represented,
Figure BDA0003205545960000043
representing the jth clip vector in the set of clip vectors, cjThe j-1, 2,3, …, N2 and d represent preset scaling factors; s35, respectively combining the clip vectors e by using alignment functioncVector of each clip
Figure BDA0003205545960000044
Translating into clipping attention weight scores
Figure BDA0003205545960000045
To obtain the normalized result of the similarity between the question feature and the clip feature, the clip attention weight score corresponding to each clip vector
Figure BDA0003205545960000046
The specific calculation of (a) is shown as follows:
Figure BDA0003205545960000047
wherein, the normalization function is a softmax function, exp (·) represents a prime number operation function with a natural base number e as a base; s36, using the generated context function to combine each clipping feature cjClip attention weight score corresponding thereto
Figure BDA0003205545960000048
Performing a weighted sum calculation to obtain a clipped representation p with question guidancecAs shown in the following formula:
Figure BDA0003205545960000049
wherein, W2Representing trainable weight matrices, b2A trainable bias vector is represented.
Further, the plurality of modal representations with question guidance obtained in step S3 includes an audio representation with question guidance obtained by the steps of: s37 problem feature using compatibility function
Figure BDA00032055459600000410
And audio features va=(a1,a2,...,aN3) Dimension scaling is carried out, namely the problem feature and the audio feature are mapped to the same low-dimensional feature space from a high-dimensional feature space to carry out similarity calculation, and a corresponding audio vector group e is obtainedaThe specific calculation of each audio vector is as follows:
Figure BDA00032055459600000411
wherein the compatibility function used is a scaling point multiplication function,
Figure BDA00032055459600000412
the resulting set of audio vectors is represented,
Figure BDA00032055459600000413
representing the k-th audio vector of the set of audio vectors, akRepresenting the audio characteristics of the kth audio in the video, k being 1,2,3, …, N3, d representing a preset scaling factor; s38, respectively using the alignment function to make the audio vector group eaEach audio vector in
Figure BDA0003205545960000051
Conversion to audio notesGravity score
Figure BDA0003205545960000052
To obtain the normalized result of the similarity between the problem feature and the audio feature, the audio attention weight score corresponding to each audio vector
Figure BDA0003205545960000053
The specific calculation of (a) is shown as follows:
Figure BDA0003205545960000054
wherein, the normalization function is a softmax function, exp (·) represents a prime number operation function with a natural base number e as a base; s39, using the generating context function to combine each audio feature akAudio attention weight score corresponding thereto
Figure BDA0003205545960000055
Performing a weighted sum calculation to obtain an audio representation p with problem guidanceaAs shown in the following formula:
Figure BDA0003205545960000056
wherein, W3Representing trainable weight matrices, b3A trainable bias vector is represented.
Further, step S3 further includes: representing p for frames with question guidance using questions respectively according to the following formulafClip representation p with question guidancecAnd audio representation p with question guidanceaCalculating the weight score to obtain a weight score result sf,sc,saAnd is in sf,sc,saThe modality with the highest weight score is selected as the key modality p,
Figure BDA0003205545960000057
Figure BDA0003205545960000058
where < > represents the cosine similarity calculation, P ═ Pf,pc,paDenotes a number of modal features with problem guidance, H ═ Hf,Hc,HaRepresents the problem feature
Figure BDA0003205545960000059
And different modality feature with question guidance P ═ { P ═ Pf,pc,paDegree of similarity between S ═ Sf,sc,saRepresents the problem feature
Figure BDA00032055459600000510
For different modal features with problem guidance P ═ { Pf,pc,paThe weight score result after the attention, p represents the most relevant modality to the question, p ∈ { p }f,pc,pa}。
Further, the multi-modal fusion representation of the video in step S4 is obtained by: representing a frame with problem guidance by a multi-mode compression bilinear model MCB according to the following formulafClip representation p with question guidancecAudio representation p with question guidanceaAnd their respective weight scores sf,sc,saAre fused together to obtain a video fusion representation vu: vu=MCBFusion(sfpf,scpc,sapa)。
Further, step S5 includes: step S51, using the resulting video fusion representation v according to the following formulauLong-short term memory network LSTM for coding problemsqHidden state of output at time t
Figure BDA0003205545960000061
Calculating, and taking the calculation result as a bidirectional long-short term memory network Bi _ LSTM at the time taThe input of (a) to (b),
Figure BDA0003205545960000062
wherein the content of the first and second substances,
Figure BDA0003205545960000063
representing multiplication in an elemental way, Bi _ LSTMa (-) represents a bidirectional long-term memory network,
Figure BDA0003205545960000064
represents Bi _ LSTMaHidden state at the t-th moment in the encoding process; step S52, using Bi _ LSTM according to the following formulaaHidden state at time t
Figure BDA0003205545960000065
For the resulting video fusion representation vuPay attention to get a video fusion representation v with problem guidanceo
Figure BDA0003205545960000066
Figure BDA0003205545960000067
Figure BDA0003205545960000068
Wherein, W4、W5And W6Representing trainable weight matrices, b5And b6Representing trainable bias vectors, etRepresenting a weight, alpha, obtained by calculating the similarity of the video fusion feature and the video featuretAnd expressing the attention weight distribution after weight normalization.
Further, step S6 includes: step S61, initializing the query condition according to the following formula,
Figure BDA0003205545960000069
wherein the content of the first and second substances,
Figure BDA00032055459600000610
representing a problem feature; step S62, respectively using query conditions z according to the following formularPaying attention to the obtained key mode p to obtain a key mode representation with problem guidance
Figure BDA00032055459600000611
Figure BDA00032055459600000612
αr=softmax(W8er+b8),
Figure BDA00032055459600000613
Wherein, W7、W8And W9Representing trainable weight matrices, b7、b8And b9Representing a trainable bias vector; p represents the most problem-relevant modality, p ∈ { p }f,pc,pa},zrThe query condition representing the update of the R-th iteration, R is 0,1,2, …, R, erRepresenting the weight, alpha, calculated from the similarity of the problem features and the key modal featuresrExpressing the attention weight distribution after weight normalization; step S63, iteratively updating the query condition according to the following formula,
Figure BDA00032055459600000614
wherein z isr-1Representing the query condition for the r-1 th iteration update,
Figure BDA00032055459600000615
representing key modal characteristics with problem guidance obtained by the (R-1) th query, wherein R is 1,2, …, R; step S64, using the updated query conditions in step S63 to executeStep S62, performing multi-step iterative attention on the key modality p to obtain key modality characteristics more relevant to the problem
Figure BDA00032055459600000616
Further, in step S7, the predicted answer is obtained according to the following steps: LSTM in step S2 is expressed as followsqExported memory cell status
Figure BDA00032055459600000617
Bi _ LSTM in step S5aOutput memory cell state
Figure BDA0003205545960000071
Video representation v with question guidance obtained in step S5oAnd the iterative focus result obtained in step S6
Figure BDA0003205545960000072
The four parts of information are fused for generating answers,
Figure BDA0003205545960000073
wherein, W10Representing trainable weight matrices, WanswerA weight matrix representing the vocabulary, Answer representing the generated Answer, argmax representing the selection of the highest score as the prediction result.
In a second aspect of embodiments of the present invention, there is provided a storage medium storing a program which, when executed by a processor, implements a method of solving a video question and answer task based on a multi-modal progressive attention model as described above.
In a third aspect of embodiments of the present invention, there is provided a computing device comprising the storage medium described above.
According to the method for solving the video question-answering task based on the multi-mode progressive attention model, the following effects can be achieved:
(1) compared with the prior art, the method utilizes the cooperative work of a plurality of attention models with different functions to more accurately locate the video frame or video picture area related to the problem.
(2) The invention realizes the cross-modal fusion of the features by utilizing an improved multi-modal fusion algorithm, and improves the representation capability after the features are fused.
Drawings
The foregoing and other objects, features and advantages of exemplary embodiments of the present invention will be readily understood by reading the following detailed description with reference to the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 schematically illustrates a flow diagram of one exemplary process for a method for solving a video question-and-answer task based on a multi-modal progressive attention model, according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating an achievable system architecture of the method for solving the video question-answering task based on the multi-modal progressive attention model of the present invention;
FIG. 3 is a diagram illustrating an example of the results of the method of the present invention for solving a video question-answering task based on a multi-modal progressive attention model;
FIG. 4 schematically shows a schematic structural diagram of a computer according to an embodiment of the present invention;
FIG. 5 schematically shows an illustrative diagram of a computer-readable storage medium according to an embodiment of the invention.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It should be understood that these embodiments are given only for the purpose of enabling those skilled in the art to better understand and to implement the present invention, and do not limit the scope of the present invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to the embodiment of the invention, a method for solving a video question-answering task based on a multi-mode progressive attention model is provided.
In this document, it is to be understood that the number of any element in the figures is intended to be illustrative rather than restrictive, and that any nomenclature is used for differentiation only and not in any limiting sense.
The principles and spirit of the present invention are explained in detail below with reference to several exemplary embodiments thereof.
Summary of The Invention
The inventors have found that first a variety of modal features are extracted from the video and the problem, respectively. And then inputting the extracted multi-modal features into a plurality of attention models, and finally fusing effective information in output results of all the modules for answer generation.
The method for solving the video question-answering task based on the multi-mode progressive attention model comprises the following steps: step S1, obtaining the video and question to be processed; step S2, extracting the frame feature, the clip feature and the audio feature of the video as a plurality of modal features of the video, and extracting the text feature of the question; step S3, respectively paying attention to a plurality of modal characteristics of the video by using the problem to obtain a plurality of modal representations with problem guidance, respectively calculating the weight score of each modal by using the problem, and selecting the modal representation with the highest weight score from the plurality of modals as a key modal; step S4, according to the obtained modal representation and the weight score, fusing a plurality of modal representations based on a multi-modal fusion algorithm to obtain a video fusion representation of the video; step S5, paying attention to the video fusion representation of the video by using the problem to obtain the video fusion representation with problem guidance; step S6, performing multi-step attention on the characteristics of the key mode by using the problem, and positioning the key mode characteristics more relevant to the problem in a multi-iteration mode; step S7, obtaining a predicted answer based on at least the question features, the video fusion representation with question guidance, and the results of the multi-step attention and the multiple rounds of iteration.
Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.
Exemplary method
Fig. 1 schematically illustrates an exemplary process flow of a method for solving a video question-answering task based on a multi-modal progressive attention model according to an embodiment of the present disclosure. Fig. 2 shows a system structure that can be realized by the method.
As shown in fig. 1, when the process flow starts, step S1 is first executed.
And step S1, obtaining the video to be processed and the question.
For example, the video and questions to be processed may be user input, may be received externally by the system, or may be downloaded from a predetermined website, etc.
As an example, in the embodiment of the present invention, the specific processing flow is described by taking english as an example. It should be understood that the language of the question is not limited to the english language shown in fig. 2, but may be other languages such as chinese, japanese, korean, french, and the like. Accordingly, the language of the predicted answer may be the same as the language of the question, or may be set to one or more selectable languages according to the user's selection.
And step S2, extracting frame features, clip features and audio features of the video as a plurality of modal features of the video, and extracting text features of the problem.
As an example, in step S2, the frame feature, the clip feature, and the audio feature of the video may be extracted, for example, by steps S21 to S23 described below.
In step S21, the frame features of the video are extracted using the previously trained ResNet model.
For example, with vfRepresenting the frame characteristics of the video, then:
vf={f1,f2,...,fN1},
Figure BDA0003205545960000101
wherein f isiRepresenting frame characteristics of the ith frame in the video,
Figure BDA0003205545960000102
i-1, 2,3, …, N1, N1 denotes the number of frames and N1 is a natural number, d denotes a preset scaling factor, as here (when corresponding to a frame feature of a video) denotes the frame feature dimension.
As described above, in the embodiment of the present invention, the ResNet model is pre-trained on ImageNet, that is, the trained ResNet model is used to extract frame features of the video. It should be noted that the above ResNet model is not used to limit the example, and in other examples, for example, other pre-trained models for extracting frame features may also be used, and details are not described here.
Thus, in step S22, the clip features of the video can be extracted using a pre-trained tsn (temporal Segment networks) network.
For example, with vcRepresenting the clipping characteristics of the video, then:
vc={c1,c2,...,cN2},
Figure BDA0003205545960000103
wherein, cjIndicating the clipping characteristics of the jth clip in the video,
Figure BDA0003205545960000104
j ═ 1,2,3, …, N2, N2 denote the number of clips and N2 is a natural number, d denotes a preset scaling factor, as here (in correspondence with the clip characteristics of the video) denotes the clip characteristic dimension. EditingThe dimension of the feature is the same as the frame feature dimension.
As described above, in the embodiment of the present invention, the TSN network is trained in advance, that is, the trained TSN network is used to extract the clipping feature of the video. It should be noted that the TSN network is not limited to this example, and in other examples, for example, other network models for extracting clipping features that are pre-trained may also be used, and details are not described here.
Then, in step S23, the sound in the video is converted into a sonogram based on mel cepstral coefficients as an input to a pre-trained google net model, which is then used to extract audio features of the video.
For example, with vaRepresenting the audio features of the video, then:
va={a1,a2,...,aN3},
Figure BDA0003205545960000105
wherein, akRepresenting the audio features of the kth audio in the video,
Figure BDA0003205545960000106
k is 1,2,3, …, N3, N3 denotes the number of audios and N3 is a natural number, d denotes a preset scaling factor, as here (in correspondence with the audio features of the video) denotes the audio feature dimension. The dimension of the audio features is the same as the frame feature dimension.
In this way, a plurality of modal features of the video may be extracted in step S2 by the method described above.
It is worth mentioning that, in the above example, three features of a frame feature, a clip feature, and an audio feature of a video are adopted as the plurality of modal features of the video, but the embodiment of the present invention is not limited thereto.
For example, in the embodiment of the present invention, at least two of the frame feature, the clip feature, the audio feature, the clip audio feature, and the frame audio feature of the video may be selected as the plurality of modal features of the video.
For example, clip feature vcAnd audio features vaCarrying out feature fusion to obtain clipping audio features vcaThe specific feature fusion mode comprises the following steps: linear addition, linear multiplication, splicing, etc., taking splicing by two characteristics as an example, to obtain a clip audio characteristic vcaThe specific fusion mode is as follows:
vca=[vc,va],
Figure BDA0003205545960000111
wherein [ · ] represents the operation of splicing two features, and compared with a single clip and audio features, the clip audio feature information is richer and has stronger semantic information.
As another example, the frame audio feature vfaBy extracting and clipping audio features vcaThe method of the feature fusion is similar, and the obtained frame audio features vfaMore efficient than the characteristic information of a single modality.
Further, in step S2, problem features may be extracted through steps S24 to S27 described below.
In step S24, all words in the question are expressed by one-hot coding, and a corresponding question expression q is obtained, q ═ q { [ q ] q1,q2,...,qT}. For example, when the language of the question is english, all words in the question, i.e., words, may be represented by one-hot encoding in step S24.
Wherein q istFor the one-hot coded representation of the tth word in the question, T is 1,2,3, …, T, the length of the T question (i.e. the number of words involved) and T is a natural number.
Then, in step S25, a word embedding matrix E is obtained by using a pre-trained word embedding model (such as GloVe model),
Figure BDA0003205545960000112
wherein, | Nvocab| represents the number of words in the predetermined data set, and the value 300 represents the word embedded in the matrixThe feature dimension of each word vector.
Next, in step S26, the question q is embedded in a low-dimensional continuous vector space by the obtained word embedding matrix, and a corresponding word embedding vector x is obtainedt,xt=E*qt
In this way, in step S27, the word embedding vector obtained in step S26 may be encoded by using LSTM (Long Short-Term Memory network), thereby obtaining the text feature of the question
Figure BDA0003205545960000121
Figure BDA0003205545960000122
Wherein, LSTMq(. cndot.) represents a long-short term memory network that handles word-embedded vectors. Since the problem is sequence data unlike a picture, the LSTM coding problem feature is adopted, so that the problem feature does not lose information, and semantic information of each word in the problem is better reserved.
As shown in fig. 2, for example, the corresponding video features can be extracted through three models (as video feature extraction models) of ResNet, TSN and google net as shown in the figure, and the text features of the question can be extracted through a GloVe + LSTM model (as question feature extraction model) as shown in fig. 2.
Step S3 is to focus on the plurality of modal features of the video by using questions to obtain a plurality of modal representations with question guidance, calculate the weight scores of the respective modalities by using the questions, and select the modality representation with the highest weight score among the plurality of modalities as the key modality.
It should be noted that the plurality of modalities refer to modalities corresponding to the frame feature, the clip feature, and the audio feature (optionally, other features may be included).
In step S3, the obtained plurality of modal representations with issue guidance includes, for example: a frame representation with issue guidance; a clip representation with issue guidance; audio presentation with problem guidance.
The above-described frame representation with the question guidance can be obtained by, for example, steps S31 to S33 described below.
In step S31, a compatibility function is used to characterize the problem
Figure BDA0003205545960000123
And frame feature vf=(f1,f2,...,fN1) Dimension scaling, i.e. problem characterization
Figure BDA0003205545960000127
And frame feature vf=(f1,f2,...,fN1) Mapping the high-dimensional feature space to the same low-dimensional feature space to perform similarity calculation (i.e. calculating semantic similarity between problem features and frame features) to obtain a corresponding frame vector group efSet of frame vectors efThe specific calculation of each frame vector in (a) is as follows:
Figure BDA0003205545960000124
wherein the compatibility function used in step S31 is a scaling point multiplication function,
Figure BDA0003205545960000125
the resulting set of frame vectors is represented as,
Figure BDA0003205545960000126
representing the i-th frame vector of the set of frame vectors, fiThe frame characteristics of the ith frame in the video are shown, i is 1,2,3, …, N1, and d is a preset scaling factor.
It should be noted that, in the embodiment of the present invention, mapping a and B from the high-dimensional feature space to the same low-dimensional feature space means that a and B are both mapped from the high-dimensional feature space to the same low-dimensional feature space, for example, a is mapped from 2048-dimensional feature space to 256-dimensional feature space, and B is also mapped from 2048-dimensional feature space to 256-dimensional feature space; alternatively, a maps from 2048-dimensional feature space to 256-dimensional feature space and B maps from 1024-dimensional feature space to 256-dimensional feature space. In other words, a and B are each mapped from a respective high-dimensional space to a low-dimensional space of the same dimension.
Next, in step S32, the frame vector group e is aligned using an alignment functionfOf each frame vector
Figure BDA0003205545960000131
Each translate into a corresponding frame attention weight score
Figure BDA0003205545960000132
To obtain the normalized result of the similarity between the problem feature and the frame feature, and the frame attention weight score corresponding to each frame vector
Figure BDA0003205545960000133
The specific calculation of (a) is shown as follows:
Figure BDA0003205545960000134
the normalization function used in step S32 is a softmax function, and exp (·) denotes a prime operation function based on a natural base number e. In addition to this, the present invention is,
Figure BDA0003205545960000135
corresponding to the expression i-i 1
Figure BDA0003205545960000136
The value range of i1 is 1-N1.
Thus, in step S33, each frame feature f is generated using the generation context functioniCorresponding frame attention weight score
Figure BDA0003205545960000137
Performing weighted summation calculation (namely, weighted summation is carried out on each frame feature based on the frame attention weight score corresponding to each frame feature) to obtain a frame representation p with problem guidancefAs shown in the following formula:
Figure BDA0003205545960000138
wherein, W1Representing trainable weight matrices, b1A trainable bias vector is represented.
As another example, the above-described clip representation with question guidance can be obtained through steps S34 to S36 described below.
In step S34, a compatibility function is used to characterize the problem
Figure BDA00032055459600001312
And a clipping feature vc=(c1,c2,...,cN2) Dimension scaling is carried out, namely, the problem feature and the clipping feature are mapped to the same low-dimensional feature space from a high-dimensional feature space to carry out similarity calculation (namely, semantic similarity between the problem feature and the clipping feature is calculated), and a corresponding clipping vector group e is obtainedcSet of clipping vectors ecThe specific calculation of each clip vector in (a) is as follows:
Figure BDA0003205545960000139
wherein the compatibility function used in step S34 is a scaling point multiplication function,
Figure BDA00032055459600001310
representing the set of resulting clipping vectors in a way that,
Figure BDA00032055459600001311
representing the jth clip vector in the set of clip vectors, cjThe clip characteristics of the jth clip in the video are shown, j is 1,2,3, …, N2, and d is a preset scaling factor.
Next, in step S35, clip vector groups e are each aligned using an alignment functioncVector of each clip
Figure BDA0003205545960000141
Translating into corresponding clip attention weight scores
Figure BDA0003205545960000142
To obtain the normalized result of the similarity between the question feature and the clip feature, and the clip attention weight score corresponding to each clip vector
Figure BDA0003205545960000143
The specific calculation of (a) is shown as follows:
Figure BDA0003205545960000144
wherein the normalization function used in step S35 is a softmax function. In addition to this, the present invention is,
Figure BDA0003205545960000145
representing the i2 th clip vector in the set of clip vectors. In addition to this, the present invention is,
Figure BDA0003205545960000146
corresponding to the expression j-i 2
Figure BDA0003205545960000147
The value range of i2 is 1-N2.
Thus, in step S36, each clip feature c is added by using the generation context functionjClip attention weight score corresponding thereto
Figure BDA0003205545960000148
A weighted summation calculation is performed (i.e., each clip feature is weighted and summed based on its corresponding clip attention weight score) to obtain a clip representation p with question guidancecAs shown in the following formula:
Figure BDA0003205545960000149
wherein, W2Representing trainable weightsMatrix, b2A trainable bias vector is represented.
Further, the audio representation with the above-described problem guidance can also be obtained by steps S37 to S39 described below.
In step S37, a compatibility function is used to characterize the problem
Figure BDA00032055459600001410
And audio features va=(a1,a2,...,aN3) Dimension scaling is carried out, namely, the problem feature and the audio feature are mapped to the same low-dimensional feature space from a high-dimensional feature space to carry out similarity calculation (namely, semantic similarity between the problem feature and the audio feature is calculated), and a corresponding audio vector group e is obtainedaSet of audio vectors eaThe specific calculation of each audio vector in (a) is as follows:
Figure BDA00032055459600001411
wherein the compatibility function used in step S37 is a scaling point multiplication function,
Figure BDA00032055459600001412
representing the resulting set of audio vectors in a way that,
Figure BDA00032055459600001413
representing the k-th audio vector of the set of audio vectors, akDenotes the audio characteristics of the kth audio in the video, k ═ 1,2,3, …, N3, and d denotes the preset scaling factor.
Next, in step S38, the audio vectors e are respectively grouped by the alignment functionaEach audio vector in
Figure BDA0003205545960000151
Translating into an audio attention weight score
Figure BDA0003205545960000152
To obtain problem features and audioNormalization result of feature similarity, audio attention weight score corresponding to each audio vector
Figure BDA0003205545960000153
The specific calculation of (a) is shown as follows:
Figure BDA0003205545960000154
the normalization function used in step S38 may be a softmax function, for example. In addition to this, the present invention is,
Figure BDA0003205545960000155
representing the i3 th audio vector in the set of audio vectors. In addition to this, the present invention is,
Figure BDA0003205545960000156
corresponding to the expression k-i 3
Figure BDA0003205545960000157
The value range of i3 is 1-N3.
Thus, in step S39, each audio feature a is combined using the generating context functionkAudio attention weight score corresponding thereto
Figure BDA0003205545960000158
Performing a weighted sum calculation to obtain an audio representation p with problem guidanceaAs shown in the following formula:
Figure BDA0003205545960000159
wherein, W3Representing trainable weight matrices, b3A trainable bias vector is represented.
In this way, in the above steps S31 to S39, the frame attention weight score, the clip attention weight score, and the audio attention weight score are obtained separately, and for the sake of clarity, may be recorded as a first weight score so as to be distinguished from a second weight score which will be described later.
Further, in step S3, p may be represented by question for frames with question guidance, respectivelyfClip representation p with question guidancecAnd an audio representation p with question guidanceaCalculating the weight score again to obtain a weight score result sf,sc,sa(e.g., as respective second weight scores) and is represented at sf,sc,saThe modality with the highest weight score is selected as the key modality p, and the following formula is shown as follows:
Figure BDA00032055459600001510
P={pf,pc,pa};
H={Hf,Hc,Ha};
Figure BDA00032055459600001511
S={sf,sc,sa}。
wherein < > represents a cosine similarity calculation, e.g.
Figure BDA00032055459600001512
To represent
Figure BDA00032055459600001513
The result of the cosine similarity with P, where P ═ Pf,pc,paDenotes a number of modal features with problem guidance, H ═ Hf,Hc,HaRepresents the problem feature
Figure BDA0003205545960000161
And different modality feature with question guidance P ═ { P ═ Pf,pc,paDegree of similarity between.
S={sf,sc,saRepresents the problem feature
Figure BDA0003205545960000162
For different modal features with problem guidance P ═ { Pf,pc,paThe weight score result after focusing on, p represents the most relevant modality to the question, p ∈ { p }f,pc,pa}。
Furthermore, sfThen the utilization problem feature is indicated
Figure BDA0003205545960000163
Representing p for frames with question guidancefAnd a second weight score, s, obtained after closing the attentioncRepresenting utilization problem features
Figure BDA0003205545960000164
Representing p for clips with question guidancecA second weight score, s, obtained after the attention is givenaRepresenting utilization problem features
Figure BDA0003205545960000165
For audio representation p with question guidanceaAnd a second weight score obtained after the attention is paid.
Thus, as shown in fig. 2, the above processing of step S3 can be completed by the video sequence attention module as shown based on the video features and question features extracted in step S2 to obtain the key modality.
And step S4, fusing the plurality of modal representations based on a multi-modal fusion algorithm according to the obtained modal representations and the weight scores to obtain a video fusion representation of the video.
In step S4, the frame with question guidance can be represented as p by using the multi-modal compressed bilinear model MCB, for example, according to the following formulafClip representation p with question guidancecAudio representation p with question guidanceaAnd their respective rightsA weight score sf,sc,sa(i.e., the respective second weight scores) are fused together to obtain a video fusion representation vu
vu=MCBFusion(sfpf,scpc,sapa)。
Wherein, MCBFusion (-) in the above formula represents the multi-modal fusion algorithm function corresponding to the multi-modal compressed bilinear model MCB.
Thus, as shown in fig. 2, the processing of step S4 can be completed by the multi-modal fusion algorithm module as shown in the figure according to the modal representation and the weight score obtained in step S3, so as to obtain a multi-modal fusion representation of the video.
And step S5, paying attention to the video fusion representation of the video by using the problem to obtain the video fusion representation with the problem guide.
For example, step S5 may include steps S51 to S52 described below.
In step S51, the resulting video fusion representation v is used according to the following formulauLong-short term memory network LSTM and coding problemqHidden state of output at time t
Figure BDA0003205545960000171
Calculating, and taking the calculation result as a bidirectional long-short term memory network Bi _ LSTM at the time taThe input of (2):
Figure BDA0003205545960000172
wherein the content of the first and second substances,
Figure BDA0003205545960000173
denotes multiplication by element, Bi _ LSTMa(-) represents a bidirectional long-short term memory network,
Figure BDA0003205545960000174
represents Bi _ LSTMaAnd processing the hidden state at the t-th moment in the coding.
Next, in step S52, Bi _ LSTM is used according to the following formulaaHidden state at time t
Figure BDA0003205545960000175
For the resulting video fusion representation vuPay attention to obtain video fusion expression v with problem guidanceo
Figure BDA0003205545960000176
Figure BDA0003205545960000177
Figure BDA0003205545960000178
Wherein, W4、W5And W6Representing trainable weight matrices, b5And b6Representing trainable bias vectors, etRepresenting weights, α, calculated from the similarity of the video fusion features and the video features (video features, i.e., frame features, clip features, and audio features as described above)tAnd expressing the attention weight distribution after weight normalization. e.g. of the typei4E corresponding to t ═ i4tAnd i4 ranges from 1 to T.
Thus, as shown in FIG. 2, the above processing of step S5 can be accomplished by the secondary attention module as shown, based on the multi-modal fused representation of the video obtained in step S4, to find important video features (i.e., video fused representation with problem guidance) related to the problem.
And step S6, performing multi-step attention on the characteristics of the key modality by using the problem, and positioning the characteristics of the key modality which are more relevant to the problem in a multi-round iteration mode.
The processing of step S6 described above may be realized by, for example, steps S61 to S62 as will be described below.
In step S61, the query condition is initialized according to the following formula:
Figure BDA0003205545960000179
wherein the content of the first and second substances,
Figure BDA00032055459600001710
the problem feature is represented.
Next, in step S62, the query conditions z are used as followsrPaying attention to the obtained key mode p to obtain a key mode representation with problem guidance
Figure BDA00032055459600001711
Figure BDA00032055459600001712
αr=softmax(W8er+b8);
Figure BDA0003205545960000181
Wherein, W7、W8And W9Representing trainable weight matrices, b7、b8And b9Representing a trainable offset vector; p represents the most problem-relevant modality, p ∈ { p }f,pc,pa},zrAnd (3) representing the query condition updated by the R-th iteration, wherein R is 0,1,2, …, R represents the total number of iterations and R is a natural number. e.g. of the typerRepresenting the weight, alpha, calculated from the similarity of the problem features and the key modal featuresrShowing the attention weight distribution after weight normalization.
Next, in step S63, the query condition is iteratively updated according to the following formula:
Figure BDA0003205545960000182
wherein z isr-1Representing the query condition for the r-1 th iteration update,
Figure BDA0003205545960000183
and (3) representing key modal characteristics with problem guidance obtained by the (R-1) th query, wherein R is 1,2, … and R.
In this way, in step S64, using the query condition updated in step S63, step S62 is executed to perform multi-step iterative attention on the key modality p, so as to obtain a key modality feature more relevant to the problem
Figure BDA0003205545960000184
Thus, as shown in FIG. 2, the process of step S6 can be accomplished according to the key modality found in step S3 by iteratively locating the attention module as shown to locate the key modality features more relevant to the problem.
Step S7, obtaining a predicted answer based on at least the question features, the video fusion representation with question guidance, and the results of the multi-step attention and the multiple rounds of iteration.
In step S7, the long short term memory network LSTM in step S2 can be expressed as followsqExported memory cell status
Figure BDA0003205545960000185
Bi _ LSTM in step S5aExported memory cell status
Figure BDA0003205545960000186
Video representation v with question guidance obtained in step S5oAnd the iterative focus result obtained in step S6
Figure BDA0003205545960000187
And fusing four parts of information for generating answers:
Figure BDA0003205545960000188
wherein, W10Representing trainable weight matrices, WanswerA weight matrix representing the vocabulary, Answer representing the generated Answer, argmax representing the selection of the highest score as the prediction result.
Thus, as shown in fig. 2, the above-mentioned processing of step S7 can be completed by the answer generation module shown in the figure, and the result is fused according to the partial valid output results of the previous steps and then input to the module, so as to generate the predicted answer. The portion enclosed by the dashed line box shown in fig. 2 is the multi-modal progressive attention model according to the embodiment of the present invention, and is used for performing the above steps.
As shown in fig. 3, given a video and a question, a predicted answer can be obtained as shown in the figure. Therefore, the method can be used for carrying out video question-answering processing, and more accurate answers can be measured in advance.
In addition, the embodiment of the invention also provides a storage medium storing a program, and the program realizes the method for solving the video question-answering task based on the multi-modal progressive attention model when being executed by a processor.
In addition, the embodiment of the invention also provides a computing device which comprises the storage medium.
FIG. 4 illustrates a block diagram of an exemplary computer system/server 50 suitable for use in implementing embodiments of the present invention. The computer system/server 50 shown in FIG. 4 is only an example and should not be taken to limit the scope of use and functionality of embodiments of the present invention.
As shown in FIG. 4, computer system/server 50 is in the form of a general purpose computing device. Components of computer system/server 50 may include, but are not limited to: one or more processors or processing units 501, a system memory 502, and a bus 503 that couples the various system components (including the system memory 502 and the processing unit 501).
Computer system/server 50 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 50 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 502 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)5021 and/or cache memory 5022. The computer system/server 50 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, the ROM5023 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, and commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to the bus 503 by one or more data media interfaces. At least one program product may be included in system memory 502 with a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.
A program/utility 5025 having a set (at least one) of program modules 5024 may be stored in, for example, system memory 502, and such program modules 5024 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment. The program modules 5024 generally perform the functions and/or methodologies of the described embodiments of the invention.
The computer system/server 50 may also communicate with one or more external devices 504 (e.g., keyboard, pointing device, display, etc.). Such communication may be through input/output (I/O) interfaces 505. Also, the computer system/server 50 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) through a network adapter 506. As shown in FIG. 5, the network adapter 506 communicates with other modules of the computer system/server 50 (e.g., processing unit 501, etc.) via the bus 503. It should be appreciated that although not shown in FIG. 4, other hardware and/or software modules may be used in conjunction with computer system/server 50.
The processing unit 501 executes various functional applications and data processing, for example, executes and implements the steps of the above-described method, by running a program stored in the system memory 502.
A specific example of a computer-readable storage medium embodying the present invention is shown in fig. 5.
The computer-readable storage medium of fig. 5 is an optical disc 600, and a computer program (i.e., a program product) is stored thereon, and when the program is executed by a processor, the program implements the steps described in the above method embodiments, and specific implementations of the steps are not repeated here.
PREFERRED EMBODIMENTS
In the preferred embodiment, experiments were performed on a ZJL experimental dataset with a total of 13161 short videos and 197415 challenge-response pairs. In order to objectively evaluate the performance of the method of the present invention, the effect of the present invention is evaluated using the Accuracy evaluation criteria in the selected test set, which reflects the Accuracy of model prediction. The experimental results obtained following the procedure described above are shown in table 1.
TABLE 1
Figure BDA0003205545960000201
The present invention performs an ablation study experiment to evaluate the effectiveness of each modality, wherein Q represents Question Only to predict answers based on the characteristics of questions, V + Q represents Video and Question to predict answers based on Video and questions, a + Q represents Audio and Question to predict answers based on Audio and questions, V + a + Q represents Video, and Question and Audio to predict answers based on Video, questions and Audio, and the obtained experimental results are shown in table 2.
TABLE 2
Figure BDA0003205545960000211
It should be noted that although in the above detailed description several units, modules or sub-modules are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module according to embodiments of the invention. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.
Further, while the operations of the method of the present invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. The method for solving the video question-answering task based on the multi-modal progressive attention model comprises the following steps:
step S1, obtaining the video and question to be processed;
step S2, extracting frame characteristics, clipping characteristics and audio characteristics of the video as a plurality of modal characteristics of the video, and extracting text characteristics of the problem;
step S3, respectively paying attention to a plurality of modal characteristics of the video by using questions to obtain a plurality of modal representations with question guidance, respectively calculating the weight scores of the modalities by using the questions, and selecting the modality representation with the highest weight score from the plurality of modalities as a key modality;
step S4, according to the obtained modal representation and the weight score, fusing a plurality of modal representations based on a multi-modal fusion algorithm to obtain a video fusion representation of the video;
step S5, paying attention to the video fusion representation of the video by using the problem to obtain the video fusion representation with problem guidance;
step S6, performing multi-step attention on the characteristics of the key modality by using the problem, and positioning the characteristics of the key modality more relevant to the problem in a multi-round iteration mode;
step S7, obtaining a predicted answer based on at least the question features, the video fusion representation with question guidance, and the results of the multi-step attention and the multiple rounds of iteration.
2. The method for solving the video question-answering task based on the multi-modal progressive attention model according to claim 1, wherein the step of extracting the frame feature, the clip feature and the audio feature of the video in the step S2 comprises:
step S21, extracting the frame characteristic v of the video by utilizing a pre-trained ResNet modelf={f1,f2,...,fN1},
Figure FDA0003205545950000011
Wherein f isiRepresenting frame characteristics of the ith frame in the video,
Figure FDA0003205545950000012
i is 1,2,3, …, N1, N1 denotes the number of frames, d denotes the dimension of the frame feature;
step S22, extracting the clipping feature v of the video by utilizing the pre-training TSN networkc={c1,c2,...,cN2},
Figure FDA0003205545950000013
Wherein, cjIndicating the clipping characteristics of the jth clip in the video,
Figure FDA0003205545950000014
n2 represents the number of clips, the dimensions of the clip features being the same as the dimensions of the frame features;
step S23, converting the audio in the video into a spectrogram according to the Mel cepstrum coefficient to be used as the input of a pre-training GoogleLeNet model, and then extracting the audio characteristic v of the video by using the pre-training GoogleLeNet modela={a1,a2,...,aN3},
Figure FDA0003205545950000021
Wherein, akRepresenting the audio features of the kth audio in the video,
Figure FDA0003205545950000022
n3 represents the number of audio frequencies, the dimension of the audio features being the same as the dimension of the frame features;
the step of extracting the problem feature in step S2 includes:
step S24, performing one-hot encoding expression on all words in the question to obtain question expression q ═ { q { (q) }1,q2,...,qTWherein q istIs a one-hot coded representation of the tth word in the problem, T ═ 1, 2., T represents the length of the problem;
step S25, obtaining a word embedding matrix by using the pre-training word embedding model GloVe
Figure FDA0003205545950000023
Wherein, | Nvocab| represents the vocabulary number of the data set, and the value 300 represents the characteristic dimension of each word vector;
step S26, embedding the question q into a low-dimensional continuous vector space through the word embedding matrix E to obtain a word embedding vector xt=E*qt,t=1,2,...,T;
Step S27, using LSTM to encode word-embeddingText features of the inbound vector derivation problem
Figure FDA0003205545950000024
LSTMq(. cndot.) represents a long-short term memory network that handles word-embedded vectors.
3. The method for solving the video question-answering task based on the multi-modal progressive attention model according to claim 1 or 2, wherein the plurality of modal representations with question guidance obtained in step S3 includes a frame representation with question guidance obtained by the steps of:
s31 problem feature using compatibility function
Figure FDA0003205545950000025
And frame feature vf=(f1,f2,...,fN1) Dimension scaling is carried out, namely the problem feature and the frame feature are mapped to the same low-dimensional feature space from a high-dimensional feature space to carry out similarity calculation, and a corresponding frame vector group e is obtainedfThe specific calculation of each frame vector is shown as follows:
Figure FDA0003205545950000026
wherein the compatibility function used is a scaling point multiplication function,
Figure FDA0003205545950000027
the resulting set of frame vectors is represented as,
Figure FDA0003205545950000028
representing the i-th frame vector of the set of frame vectors, fiRepresenting the frame characteristics of the ith frame in the video, wherein i is 1,2,3, …, N1, and d represents a preset scaling factor;
s32, using alignment function to align the frame vector group efOf each frame vector
Figure FDA0003205545950000031
Each translate into a corresponding frame attention weight score
Figure FDA0003205545950000032
To obtain the normalization result of the similarity between the problem feature and the frame feature, and the frame attention weight score corresponding to each frame vector
Figure FDA0003205545950000033
The specific calculation of (a) is shown as follows:
Figure FDA0003205545950000034
wherein, the normalization function is a softmax function, exp (·) represents a prime number operation function with a natural base number e as a base;
s33, using the generated context function to convert each frame feature fiCorresponding frame attention weight score
Figure FDA0003205545950000035
Performing a weighted sum calculation to obtain a frame representation p with problem guidancefAs shown in the following formula:
Figure FDA0003205545950000036
wherein, W1Representing trainable weight matrices, b1A trainable bias vector is represented.
4. The method for solving the video question-answering task based on the multi-modal progressive attention model according to any one of claims 1-3, wherein the plurality of modal representations with question guidance obtained in step S3 includes a clip representation with question guidance obtained by:
s34 problem feature using compatibility function
Figure FDA0003205545950000037
And a clipping feature vc=(c1,c2,...,cN2) Performing dimensionality scaling, namely mapping the problem features and the clipping features from a high-dimensional feature space to the same low-dimensional feature space to perform similarity calculation to obtain a corresponding clipping vector group ecThe specific calculation of each clip vector is shown as follows:
Figure FDA0003205545950000038
wherein the compatibility function used is a scaling point multiplication function,
Figure FDA0003205545950000039
representing the set of resulting clipping vectors in a way that,
Figure FDA00032055459500000310
representing the jth clip vector in the set of clip vectors, cjThe j-1, 2,3, …, N2 and d represent preset scaling factors;
s35, respectively combining the clip vectors e by using alignment functioncVector of each clip
Figure FDA00032055459500000311
Translating into clip attention weight scores
Figure FDA00032055459500000312
To obtain the normalized result of the similarity between the question feature and the clip feature, and the clip attention weight score corresponding to each clip vector
Figure FDA00032055459500000313
Specific calculation ofAs shown in the following formula:
Figure FDA0003205545950000041
wherein, the normalization function is a softmax function, exp (·) represents a prime number operation function with a natural base number e as a base;
s36, using the generated context function to combine each clipping feature cjClip attention weight score corresponding thereto
Figure FDA0003205545950000042
Performing a weighted sum calculation to obtain a clip representation p with question guidancecAs shown in the following formula:
Figure FDA0003205545950000043
wherein, W2Representing trainable weight matrices, b2A trainable bias vector is represented.
5. The method for solving video question-answering task based on multi-modal progressive attention model according to any one of claims 1-4, wherein the plurality of modal representations with question guidance obtained in step S3 includes an audio representation with question guidance obtained by the steps of:
s37 problem feature using compatibility function
Figure FDA0003205545950000044
And audio features va=(a1,a2,...,aN3) Dimension scaling is carried out, namely the problem feature and the audio feature are mapped to the same low-dimensional feature space from a high-dimensional feature space to carry out similarity calculation, and a corresponding audio vector group e is obtainedaThe specific calculation of each audio vector is shown as follows:
Figure FDA0003205545950000045
wherein the compatibility function used is a scaling point multiplication function,
Figure FDA0003205545950000046
representing the resulting set of audio vectors in a way that,
Figure FDA0003205545950000047
representing the k-th audio vector of the set of audio vectors, akRepresenting the audio characteristics of the kth audio in the video, k being 1,2,3, …, N3, d representing a preset scaling factor;
s38, respectively using the alignment function to make the audio vector group eaEach audio vector in
Figure FDA0003205545950000048
Translating into an audio attention weight score
Figure FDA0003205545950000049
To obtain the normalization result of the similarity of the problem feature and the audio feature, the audio attention weight score corresponding to each audio vector
Figure FDA00032055459500000410
The specific calculation of (a) is shown as follows:
Figure FDA00032055459500000411
wherein, the normalization function is a softmax function, exp (·) represents a prime number operation function with a natural base number e as a base;
s39, using the generating context function to combine each audio feature akAudio attention weight score corresponding thereto
Figure FDA0003205545950000051
Performing a weighted sum calculation to obtain an audio representation p with problem guidanceaAs shown in the following formula:
Figure FDA0003205545950000052
wherein, W3Representing trainable weight matrices, b3A trainable bias vector is represented.
6. The method for solving the video question-answering task based on the multi-modal progressive attention model according to any one of claims 3-5, wherein the step S3 further comprises:
representing p for frames with question guidance using questions respectively according to the following formulafClip representation p with question guidancecAnd an audio representation p with question guidanceaCalculating the weight score to obtain a weight score result sf,sc,saAnd is in sf,sc,saThe modality with the highest weight score is selected as the key modality p,
Figure FDA0003205545950000053
Figure FDA0003205545950000054
wherein < > represents the cosine similarity calculation, P ═ Pf,pc,paDenotes a number of modal features with problem guidance, H ═ Hf,Hc,HaRepresents the problem feature
Figure FDA0003205545950000055
And different modality feature with question guidance P ═ { P ═ Pf,pc,paDegree of similarity between S ═ Sf,sc,saRepresents the problem feature
Figure FDA0003205545950000056
For different modal features with problem guidance P ═ { Pf,pc,paThe weight score result after focusing on, p represents the most relevant modality to the question, p ∈ { p }f,pc,pa}。
7. The method for solving the video question-answering task based on the multi-modal progressive attention model according to claim 1, wherein the multi-modal fusion representation of the video in the step S4 is obtained by:
representing a frame with problem guidance by a multi-mode compression bilinear model MCB according to the following formulafClip representation p with question guidancecAudio representation p with question guidanceaAnd their respective weight scores sf,sc,saAre fused together to obtain a video fusion representation vu
vu=MCBFusion(sfpf,scpc,sapa)。
8. The method for solving the video question-answering task based on the multi-modal progressive attention model according to claim 1, wherein the step S5 comprises:
step S51, using the resulting video fusion representation v according to the following formulauLong-short term memory network LSTM for coding problemsqHidden state of output at time t
Figure FDA0003205545950000061
Calculating, and taking the calculation result as a bidirectional long-short term memory network Bi _ LSTM at the time taThe input of (a) is performed,
Figure FDA0003205545950000062
wherein the content of the first and second substances,
Figure FDA0003205545950000063
denotes multiplication by element, Bi _ LSTMa(-) represents a bidirectional long-short term memory network,
Figure FDA0003205545950000064
represents Bi _ LSTMaHidden state at the t-th moment in the encoding process;
step S52, using Bi _ LSTM according to the following formulaaHidden state at time t
Figure FDA0003205545950000065
For the resulting video fusion representation vuPay attention to get a video fusion representation v with problem guidanceo
Figure FDA0003205545950000066
Figure FDA0003205545950000067
Figure FDA0003205545950000068
Wherein, W4、W5And W6Representing trainable weight matrices, b5And b6Representing trainable bias vectors, etRepresenting a weight, alpha, obtained by calculating the similarity of the video fusion feature and the video featuretAnd expressing the attention weight distribution after weight normalization.
9. The method for solving the video question-answering task based on the multi-modal progressive attention model according to claim 1, wherein the step S6 comprises:
step S61, initializing the query condition according to the following formula,
Figure FDA0003205545950000069
wherein the content of the first and second substances,
Figure FDA00032055459500000610
representing a problem feature;
step S62, using query conditions z according to the following formularPaying attention to the obtained key mode p to obtain a key mode representation with problem guidance
Figure FDA00032055459500000611
Figure FDA00032055459500000612
αr=softmax(W8er+b8),
Figure FDA00032055459500000613
Wherein, W7、W8And W9Representing trainable weight matrices, b7、b8And b9Representing a trainable bias vector; p represents the most problem-relevant modality, p ∈ { p }f,pc,pa},zrThe query condition representing the update of the R-th iteration, R is 0,1,2, …, R, erRepresenting the weight, alpha, calculated from the similarity of the problem features and the key modal featuresrExpressing the attention weight distribution after weight normalization;
step S63, iteratively updating the query condition according to the following formula,
Figure FDA0003205545950000071
wherein z isr-1Representing the query condition for the r-1 th iteration update,
Figure FDA0003205545950000072
representing key modal characteristics with problem guidance obtained by the (R-1) th query, wherein R is 1,2, …, R;
step S64, the query conditions obtained by updating in step S63 are utilized, step S62 is executed to carry out multi-step iterative attention on the key mode p, and key mode features more relevant to the problem are obtained
Figure FDA0003205545950000073
10. The method for solving the video question-answering task based on the multi-modal progressive attention model according to claim 1, wherein the predicted answer is obtained in step S7 according to the following steps:
LSTM in step S2 is expressed as followsqExported memory cell status
Figure FDA0003205545950000074
Bi _ LSTM in step S5aExported memory cell status
Figure FDA0003205545950000075
Video representation v with question guidance obtained in step S5oAnd the iterative focus result obtained in step S6
Figure FDA0003205545950000076
The four parts of information are fused for generating answers,
Figure FDA0003205545950000077
wherein, W10Representing trainable weight matrices, WanswerA weight matrix representing the vocabulary, Answer representing the generated Answer, argmax representing the selection of the highest score as the prediction result.
CN202110915934.8A 2021-08-10 2021-08-10 Method for solving video question-answering task based on multi-mode progressive attention model Active CN113688296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110915934.8A CN113688296B (en) 2021-08-10 2021-08-10 Method for solving video question-answering task based on multi-mode progressive attention model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110915934.8A CN113688296B (en) 2021-08-10 2021-08-10 Method for solving video question-answering task based on multi-mode progressive attention model

Publications (2)

Publication Number Publication Date
CN113688296A true CN113688296A (en) 2021-11-23
CN113688296B CN113688296B (en) 2022-05-31

Family

ID=78579588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110915934.8A Active CN113688296B (en) 2021-08-10 2021-08-10 Method for solving video question-answering task based on multi-mode progressive attention model

Country Status (1)

Country Link
CN (1) CN113688296B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822515A (en) * 2023-06-21 2023-09-29 哈尔滨理工大学 Multi-mode named entity recognition method and system based on entity span positioning visual area

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9767349B1 (en) * 2016-05-09 2017-09-19 Xerox Corporation Learning emotional states using personalized calibration tasks
CN107463609A (en) * 2017-06-27 2017-12-12 浙江大学 It is a kind of to solve the method for video question and answer using Layered Space-Time notice codec network mechanism
CN109829049A (en) * 2019-01-28 2019-05-31 杭州一知智能科技有限公司 The method for solving video question-answering task using the progressive space-time attention network of knowledge base
CN109947991A (en) * 2017-10-31 2019-06-28 腾讯科技(深圳)有限公司 A kind of extraction method of key frame, device and storage medium
CN110727824A (en) * 2019-10-11 2020-01-24 浙江大学 Method for solving question-answering task of object relationship in video by using multiple interaction attention mechanism
CN111414845A (en) * 2020-03-18 2020-07-14 浙江大学 Method for solving polymorphic sentence video positioning task by using space-time graph reasoning network
CN112860945A (en) * 2021-01-07 2021-05-28 国网浙江省电力有限公司 Method for multi-mode video question-answering by using frame-subtitle self-supervision

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9767349B1 (en) * 2016-05-09 2017-09-19 Xerox Corporation Learning emotional states using personalized calibration tasks
CN107463609A (en) * 2017-06-27 2017-12-12 浙江大学 It is a kind of to solve the method for video question and answer using Layered Space-Time notice codec network mechanism
CN109947991A (en) * 2017-10-31 2019-06-28 腾讯科技(深圳)有限公司 A kind of extraction method of key frame, device and storage medium
CN109829049A (en) * 2019-01-28 2019-05-31 杭州一知智能科技有限公司 The method for solving video question-answering task using the progressive space-time attention network of knowledge base
CN110727824A (en) * 2019-10-11 2020-01-24 浙江大学 Method for solving question-answering task of object relationship in video by using multiple interaction attention mechanism
CN111414845A (en) * 2020-03-18 2020-07-14 浙江大学 Method for solving polymorphic sentence video positioning task by using space-time graph reasoning network
CN112860945A (en) * 2021-01-07 2021-05-28 国网浙江省电力有限公司 Method for multi-mode video question-answering by using frame-subtitle self-supervision

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JUNYEONG KIM ET AL.: ""Modality Shifting Attention Network for Multi-Modal Video Question Answering"", 《IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 *
张莹莹 等: ""基于多模态知识感知注意力机制的问答方法"", 《计算机研究与发展》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822515A (en) * 2023-06-21 2023-09-29 哈尔滨理工大学 Multi-mode named entity recognition method and system based on entity span positioning visual area

Also Published As

Publication number Publication date
CN113688296B (en) 2022-05-31

Similar Documents

Publication Publication Date Title
US11113479B2 (en) Utilizing a gated self-attention memory network model for predicting a candidate answer match to a query
CN112000818B (en) Text and image-oriented cross-media retrieval method and electronic device
CN108959246A (en) Answer selection method, device and electronic equipment based on improved attention mechanism
CN110019471A (en) Text is generated from structural data
CN109635197B (en) Searching method, searching device, electronic equipment and storage medium
CN114676234A (en) Model training method and related equipment
US11461613B2 (en) Method and apparatus for multi-document question answering
CN113688296B (en) Method for solving video question-answering task based on multi-mode progressive attention model
CN111488455A (en) Model training method, text classification method, system, device and medium
Mocialov et al. Transfer learning for british sign language modelling
CN116912642A (en) Multimode emotion analysis method, device and medium based on dual-mode and multi-granularity interaction
CN113822125A (en) Processing method and device of lip language recognition model, computer equipment and storage medium
KR20210044697A (en) Ai based question and answer system and method
Liu et al. Cross-domain slot filling as machine reading comprehension: A new perspective
Tian et al. Tod-da: Towards boosting the robustness of task-oriented dialogue modeling on spoken conversations
CN112100360B (en) Dialogue response method, device and system based on vector retrieval
JP2013117683A (en) Voice recognizer, error tendency learning method and program
CN113204679B (en) Code query model generation method and computer equipment
CN114791950A (en) Method and device for classifying aspect-level emotions based on part-of-speech position and graph convolution network
WO2022029839A1 (en) Text generation program, text generation device and machine learning method
Petrovski et al. Embedding individual table columns for resilient SQL chatbots
Li et al. Dense semantic matching network for multi-turn conversation
Kipyatkova et al. Experimenting with attention mechanisms in joint CTC-attention models for Russian speech recognition
Yu et al. Empirical study on deep learning models for question answering
JP2020140674A (en) Answer selection device and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant