CN114625849A - Context-aware progressive attention video question-answering method and system - Google Patents

Context-aware progressive attention video question-answering method and system Download PDF

Info

Publication number
CN114625849A
CN114625849A CN202210192397.3A CN202210192397A CN114625849A CN 114625849 A CN114625849 A CN 114625849A CN 202210192397 A CN202210192397 A CN 202210192397A CN 114625849 A CN114625849 A CN 114625849A
Authority
CN
China
Prior art keywords
video
feature
attention
features
bilstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210192397.3A
Other languages
Chinese (zh)
Inventor
周凡
张富为
林谋广
王若梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202210192397.3A priority Critical patent/CN114625849A/en
Publication of CN114625849A publication Critical patent/CN114625849A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses a context-aware progressive attention video question-answering method and system. The method comprises the following steps: the method comprises the steps of extracting feature information in a video by a feature coding unit, constructing a context sensing unit, carrying out multi-module fusion and alignment by adopting three attention modules, constructing a model training unit, carrying out model training, generating an answer prediction score, and finally inputting a target video to obtain a prediction answer. The method extracts multi-modal features by constructing three units, progressively fuses multi-modal information related to the problems in the video through an attention mechanism, constructs context information among the multi-modalities by using a BilSTM network, obtains the multi-modal information most related to the problems by using the attention mechanism, updates the multi-modal information in a memory unit by using the BilSTM network, and predicts answers by using a Softmax function, so that the performance and the accurate positioning capability of a video question-answering model are improved.

Description

Context-aware progressive attention video question-answering method and system
Technical Field
The invention relates to the field of video question answering, in particular to a context-aware progressive attention video question answering method and system.
Background
Video question answering (VideoQA) is a fine-grained video understanding task following video description, and relative to generalized description in the video description task, the video question answering not only needs to understand visual content, text information and voice information, but also needs to establish a connection among three modal data and carry out reasoning, so that the video question answering process needs more detailed description information and a complex reasoning process than the video description process, and therefore, it is important to research how to extract effective information from a large amount of growing videos. The video question-answering method is divided into a rule-based video question-answering method and a deep learning-based video question-answering method, wherein the rule-based video question-answering method starts in 2003 at first, the early video question-answering method takes the video question-answering method as query content and a question as a query son, relevant video content information is positioned in a retrieval mode, research objects are mainly concentrated in the news video field, video content is modeled in a video content structuring mode, an HMM (hidden Markov model) is used for constructing a reasoning mechanism, and the video question-answering method is important and valuable for acquiring information from videos, especially because a large number of videos are manufactured at present. The current video question and answer method starts in 2016 at first, research objects are mainly concentrated on corresponding data sets, and due to the space characteristics of video questions and answers, video question and answer data are constructed and integrated into a challenging task, so that the progress of the field of video questions and answers is delayed. With the gradual improvement of data sets in recent years, the video question-answering research has also made new progress. Some works are explored on space attention and time attention, some have breakthroughs on the aspect of fusion of static features and dynamic features, and the dynamic memory network model in the visual question-answering is expanded. The networks can better extract useful video information and carry out interaction, and good performance is achieved. However, because of the complexity of the task, the overall performance still has a space for greatly improving, and more work in the field of video question answering still focuses on integrating video dynamic time sequence information and video multi-modal feature fusion at present.
One of the prior arts at present, Pendurkar S et al, constructs an extensible attention-based multimodal fusion framework, in order to obtain information about relation between dynamic information in a video frame and words in a question, uses dense bidirectional LSTM to continuously focus on the dynamic information and the question information and perform multimodal fusion, then adds additional weight information to control the same dimensionality, finally uses LSTM to focus on and filter output words again, and uses softmax to perform answer classification prediction. The model can alleviate the problem of insufficient modeling in video information for a long time through dense bidirectional LSTM. The disadvantages are: because the attention features of different modes are extracted respectively, the noise information of later-stage fusion vectors is increased, and the final question answering result is influenced.
Another current prior art is that Kim J constructs an attention transfer network (MSAN) consisting of the following components: (a) using BERT for embedded video and text representation, (b) moment suggestion network to locate time moments of interest for answering questions, (c) heterogeneous network inference to infer location-based time moments for correct answers, and (d) modulating the output of (b) and (c) of importance weights according to different importance. The model constructs a heterogeneous attention reasoning mechanism by combining self-attention and co-attention, relieves the problem of intra-modal and inter-modal fusion, and dynamically relieves the problem of different model information corresponding to different problem types by using importance modulation. The disadvantages are: because the modal information is processed by adopting a vector dot product or addition processing mode, the problem of fuzzy modal semantic feature fusion still exists.
The third prior art is Kim J, Ma M, Kim K, etc. that constructs a Progressive Attention Memory Network (PAMN) consisting of the following components: (a) the question and candidate answer are first embedded into a common space. Video and subtitles are embedded into a dual memory, which facilitates independent memory for each modality. (b) To infer an answer, the progressive attention mechanism determines a portion of time associated with answering the question. (c) The dynamic modality fusion mechanism adaptively integrates the output of each memory by considering the contribution of the modalities. (d) The confidence-corrected answer mechanism corrects the confidence of each answer in turn from the confidence of the possible initializations. The model utilizes the question and the answer to filter the modal information of the video and the subtitle, and then adopts a soft attention mechanism to dynamically fuse the filtered modal information, wherein the progressive attention is to utilize the answer and the question information to correct the confidence coefficient of the video and the subtitle memory. The model carries out modal semantic representation through questions and answers, the problem that the modal semantic representation is inaccurate is greatly relieved, a soft attention mechanism is used for designing a dynamic fusion module to be combined with a confidence correction model to form a reasoning mechanism, and the reasoning capability of the model is greatly improved. The disadvantages are: the use of questions and answers for progressive attention results in a bias in the model, i.e., the model is less generalizable.
Disclosure of Invention
The invention aims to overcome the defects of the existing method and provides a context-aware progressive attention video question-answering method. The invention solves the main problems that: firstly, the existing model has the problems of prejudice and weak generalization capability; secondly, the problem of fuzzy modal semantic feature fusion; thirdly, attention features of different modes are extracted, and the problem of noise information of later-stage fusion vectors is increased.
In order to solve the above problem, the present invention provides a method for video question answering with context-aware progressive attention, comprising:
the method comprises the steps of constructing a feature coding unit, inputting a video question and answer related data set, and outputting static features, dynamic features, question text features and caption text features in a video;
a context perception unit is constructed, the static feature, the dynamic feature, the question text feature and the caption text feature are input, and the corresponding multi-modal context enhancement feature is output;
a model training unit is constructed, the multi-mode context enhancement features are input, and a cross entropy loss function is adopted for training to obtain a pre-training model;
and constructing an answer prediction unit, inputting the target question-answer video into the pre-training model, and outputting the target question-answer video as corresponding predicted answer information.
Preferably, the constructed feature encoding unit inputs a video question and answer related data set, and outputs a static feature, a dynamic feature, a question text feature and a caption text feature in a video, specifically:
the feature coding unit consists of a video dynamic feature extraction module, a video static feature extraction module, a caption text feature extraction module and a problem text feature extraction module, word embedding is carried out on text features by utilizing a pre-trained BERT network model, coding is carried out by utilizing a bidirectional circulation network BilSTM, static frame information in a video is extracted by utilizing a pre-trained VGG16 network model, coding is carried out by utilizing the bidirectional circulation network BilSTM, dynamic information in the video is extracted by utilizing a pre-trained C3D network model, and coding is carried out by utilizing the bidirectional circulation network BilSTM;
the method comprises the steps of constructing a video dynamic feature extraction module, firstly defining a video as V, extracting dynamic features of the video through a pre-trained C3D network, and obtaining dynamic features m of the video from a full connection layer of the last layer of the videoiWherein m isi(i ═ 1, 2.., N) is the ith video motion feature, resulting in video motion feature Vm=[m1,m2,...,mN]∈R4096×NWherein N is the frame number of the video, and the dynamic characteristics of the video are encoded by using a 250-dimensional bidirectional circulation network BilSTM
Figure BDA0003524844580000051
Obtaining the dynamic characteristics of the coded video
Figure BDA0003524844580000052
Wherein
Figure BDA0003524844580000053
The feature vector after the ith video dynamic feature coding is obtained, N is the frame number of the video, and m represents the video dynamic feature;
constructing a video static feature extraction module, extracting static features in a video by adopting a pre-trained VGG16 network model, extracting video static frame features by using 1FPS, and obtaining the static features a of the seventh full-connection layer fc7 of the VGG16 network modeliWherein a isi(i 1, 2.., N) is the ith video static feature, resulting in video static feature Va=[a1,a2,...,aN]∈R4096×NWherein N is the frame number of the video, and the static characteristics of the video are coded by using a 250-dimensional bidirectional circulation network BilSTM
Figure BDA0003524844580000054
Obtaining the static characteristics of the coded video
Figure BDA0003524844580000055
Wherein
Figure BDA0003524844580000056
The feature vector after the ith video static feature coding is obtained, N is the frame number of the video, and a represents the video static feature;
constructing a caption text feature extraction module, extracting caption text features by adopting a 12-layer pre-trained BERT network model, and obtaining caption text features c from the second last layer of the BERT network modeliWherein c isi(i ═ 1, 2.., N) is the ith subtitle text feature, resulting in video subtitle text feature Vc=[c1,c2,...,cN]∈R768×NWherein N is the frame number of the video, and the caption text features are coded by using a 250-dimensional bidirectional circulation network BilSTM
Figure BDA0003524844580000057
Obtaining the character of the caption text after coding
Figure BDA0003524844580000058
Wherein
Figure BDA0003524844580000059
The feature vector is coded for the ith subtitle text feature, N is the frame number of the video, and c represents the video subtitle text feature;
the method comprises the steps of constructing a problem text feature extraction module, firstly defining a problem as q, selecting a 12-layer pre-trained BERT network model to extract problem text features, and obtaining problem text features q from the second last layer of the BERT network modeljWherein q isj(i 1, 2.. n) is the text feature corresponding to the jth question, n is the number of questions corresponding to each video in the data set, and the video question text feature V is obtainedq=[q1,q2,...,qn]∈R768×nWherein n is the number of problems corresponding to each video in different data sets, and a 250-dimensional bidirectional circulation network BilSTM is used for coding the problem text characteristics
Figure BDA0003524844580000061
Get its coded question text features
Figure BDA0003524844580000062
Wherein
Figure BDA0003524844580000063
And (4) encoding the feature vector of the jth question, wherein n is the number of questions corresponding to each video in different data sets, and q represents the text feature of the questions.
Preferably, the constructing context sensing unit inputs the dynamic feature, the caption text feature, the static feature and the question text feature, and outputs a corresponding multi-modal context enhancement feature, specifically:
constructing a context sensing unit, and constructing three Attention modules Q2A-Attention, QA2C-Attention and QAC2M-Attention by utilizing a Co-Attention mechanism;
whereinNote that the module Q2A-Attention is to apply the question text feature vector to
Figure BDA00035248445800000611
And the static feature vector
Figure BDA00035248445800000612
As an input, where j represents the corresponding jth question in each video, i represents the ith static feature in each video, and the attention model is represented as
Figure BDA0003524844580000064
Co-Attention mechanism for aligning multi-modal information Attention performs multi-modal fusion and alignment using soft-alignment matrices, each soft-alignment matrix Sr,cIs the product of ReLU function, where r is the row of the matrix and c is the column of the matrix, then makes correlation attention to the static feature of the video, and generates the feature by using attention weight
Figure BDA0003524844580000065
And
Figure BDA0003524844580000066
finally, the generated characteristics are connected
Figure BDA0003524844580000067
The attention mechanism is shown as the following formula:
Figure BDA0003524844580000068
Figure BDA0003524844580000069
Figure BDA00035248445800000610
Figure BDA0003524844580000071
wherein Sr,cIs a soft alignment matrix, r is the row of the matrix, c is the column of the matrix, wa,wqIs the training weight of the joint learning,
Figure BDA00035248445800000718
and
Figure BDA00035248445800000719
is the feature vector of the input and,
Figure BDA0003524844580000072
and
Figure BDA0003524844580000073
features of the corresponding feature vector that have undergone the associated attention, [,]for the connection operation, performing correlation attention on the feature vectors through the soft alignment matrix to obtain the feature vectors after attention, and then connecting the feature vectors by using the connection operation;
wherein the Attention Module QA2C-Attention is to the features after the last Attention layer
Figure BDA0003524844580000074
And the feature vector of the caption text
Figure BDA0003524844580000075
As input, where i represents the feature after the ith first layer attention and the feature of the caption text in each video, the attention model is represented as
Figure BDA0003524844580000076
Then, the character screen text features are learned for relevant attention, and the attention weight is used for generating the features
Figure BDA0003524844580000077
And
Figure BDA0003524844580000078
finally, the generated characteristics are connected
Figure BDA0003524844580000079
The formula can be expressed as:
Figure BDA00035248445800000710
therein, note that the module QAC2M-Attention characterizes the output of the second layer
Figure BDA00035248445800000711
And the dynamic feature vector
Figure BDA00035248445800000712
As input to the Attention module QAC2M _ Attention, where i denotes the output features and static features after the corresponding i-th Attention in each video, the Attention model is expressed as
Figure BDA00035248445800000713
Then, the video dynamic characteristics are learned with the correlation attention weight, and the attention weight is used for generating the characteristics
Figure BDA00035248445800000714
And
Figure BDA00035248445800000715
finally, the generated features are connected
Figure BDA00035248445800000716
The formula can be expressed as:
Figure BDA00035248445800000717
gradually paying attention to the multi-modal characteristics in the video to obtain the correlation characteristics of the multi-modal characteristics, and finally modeling the context information among the multi-modal characteristics by utilizing a bidirectional circulation network BilSTM to obtain the multi-modal context enhancement characteristics;
the method for modeling the context information among the multiple modes by utilizing the bidirectional circulation network BilSTM comprises the steps of firstly respectively inputting the relevant information of different modes into the corresponding bidirectional circulation network BilSTM for autocorrelation updating, then taking the updated information as the input of the bidirectional circulation network BilSTM for updating the context information of the multiple modes, wherein a formula can be expressed as follows:
daq=BiLSTMaq(uaq),daqc=BiLSTMaqc(uaqc)
daqcm=BiLSTMaqcm(uaqcm),daq;aqc=BiLSTMaqc(daq)
daq;aqcm=BiLSTMaqcm(daq),daqc;aq=BiLSTMaq(daqc)
daqc;aqcm=BiLSTMaqcm(daqc),daqcm;aq=BiLSTMaq(daqcm)
daqcm;aqc=BiLSTMaqc(daqcm)
and finally, connecting the bi-directional circulation network BilSTM coding layer output by using a connection operation, wherein the formula is as follows:
H=Concatenate([daq;aqc,daq;aqcm,daqc;aqcm,daqc;aq,daqcm;aqc,daqcm;aq])。
preferably, the input of the model building training unit is the multi-modal context enhancement feature, and a cross entropy loss function is used for training to obtain a pre-training model, specifically:
constructing a model training unit, constructing a fourth Attention module by utilizing a Co-Attention mechanism, and constructing the problem feature vector
Figure BDA0003524844580000086
And feature H after connectioniAs input to said Attention module Q2A _ Attention, where j denotes the corresponding jth question in each video,i represents the ith output characteristic after passing through the context sensing unit, and the attention model is represented as
Figure BDA0003524844580000081
Then, the multi-modal context information is subjected to learning of correlation attention weight, and the attention weight is utilized to generate features
Figure BDA0003524844580000082
And
Figure BDA0003524844580000083
finally, the generated features are connected
Figure BDA0003524844580000084
And the memory cell is updated by using the bidirectional circulation network BilSTM, and the formula can be expressed as follows:
Figure BDA0003524844580000085
BqH=BiLSTM(uqH)
finally, a final modal vector h is obtained after passing through a full connection layerFC=FC(BqH) Using softmax function to convert the modal vector hFCConverting into answer prediction scores, selecting the maximum prediction score from the final answer prediction scores as an answer, and calculating by using a calculation formula: pro ═ softmax (h)FC) Obtaining the prediction scores of the candidate answers, and finally adopting a cross entropy loss function to train
Figure BDA0003524844580000091
Obtaining a pre-training model, wherein
Figure BDA0003524844580000092
Representing the kth candidate answer.
Accordingly, the present invention also provides a video question-answering system of context-aware progressive attention, comprising:
the feature coding unit is used for extracting video dynamic features, video static features, subtitle text features and problem text features and coding by utilizing a bidirectional circulation network BilSTM;
the context enhancement unit is used for performing multi-mode fusion and alignment on the extracted features to obtain multi-mode context enhancement features;
and the model training and predicting unit is used for training the model and predicting the answer.
The implementation of the invention has the following beneficial effects:
first, the present invention provides a context-aware progressive attention video question-answering method. Capturing relevance among the multiple modes through progressive attention to the multiple modes of information in the video, and modeling context information among the multiple modes through bidirectional BilSTM; secondly, the invention utilizes the correlation among multi-mode information to improve the performance of the video question-answering model; third, inspired by the attention mechanism and the two-way LSTM network, the present invention employs the attention mechanism to build a fusion alignment module between the multiple modalities and utilizes the two-way LSTM network to model context information between the modalities.
Drawings
FIG. 1 is a general flow diagram of a method for context-aware progressive attention video question answering in accordance with an embodiment of the present invention;
FIG. 2 is a diagram of a context-aware progressive-attention video question-answer model according to an embodiment of the present invention;
fig. 3 is a block diagram of a video context-aware progressive-attention question-answering system in accordance with an embodiment of the present invention.
Detailed Description
Technical inventions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
Fig. 1 is a general flowchart of a video question answering method for context-aware progressive attention according to an embodiment of the present invention, as shown in fig. 1, the method includes:
s1, constructing a feature coding unit, inputting the feature coding unit into a video question and answer related data set, and outputting the feature coding unit into a static feature, a dynamic feature, a question text feature and a caption text feature in a video;
s2, constructing a context sensing unit, inputting the static feature, the dynamic feature, the question text feature and the caption text feature, and outputting corresponding multi-modal context enhancement features;
s3, constructing a model training unit, inputting the multi-modal context enhancement features, and training by adopting a cross entropy loss function to obtain a pre-training model;
s4, constructing an answer prediction unit, inputting the target question answering video into the pre-training model, and outputting the target question answering video as corresponding prediction answer information.
Step S1 is specifically as follows:
s1-1: fig. 2 shows a video question-answering model of context-aware progressive attention. Firstly, a video question and answer data set in the current video question and answer field, such as TVQA, LifeQA and the like, is input, and coding features of different modes are output. The invention utilizes a pretrained BERT network model to embed words into text information, utilizes BilSTM to carry out feature coding, constructs a text feature extraction network of the invention, simultaneously utilizes pretrained VGG16 and C3D network models to respectively extract static feature information and dynamic feature information in videos, and utilizes BilSTM to carry out feature coding, and constructs the video static feature extraction network and the video dynamic feature extraction network of the invention.
S1-2: constructing a video dynamic feature extraction module, firstly defining a video as V, extracting the dynamic feature of the video through a pre-trained C3D network, and obtaining the dynamic feature m of the video from the full connection layer of the last layeriWherein m isi(i ═ 1, 2.., N) is the ith video motion feature, resulting in video motion feature Vm=[m1,m2,...,mN]∈R4096×NWhere N is the number of frames in the video, using 250 dimensionsBidirectional cyclic network BilsTM of video coding dynamic characteristics
Figure BDA0003524844580000111
Obtaining the dynamic characteristics of the coded video
Figure BDA0003524844580000112
Wherein
Figure BDA0003524844580000113
The feature vector after the motion feature coding of the ith video is obtained, N is the frame number of the video, and m represents the motion feature of the video.
S1-3: constructing a video static feature extraction module, extracting static features in a video by adopting a pre-trained VGG16 network model, extracting video static frame features by using 1FPS, and obtaining static features a of the model from a seventh full-connection layer fc7 of a VGG16 network modeliWherein a isi(i ═ 1, 2., N) is the ith video static feature, resulting in video static feature Va=[a1,a2,...,aN]∈R4096×NWherein N is the frame number of the video, and the static characteristics of the video are coded by using a 250-dimensional bidirectional circulation network BilSTM
Figure BDA0003524844580000121
Obtaining the static characteristics of the coded video
Figure BDA0003524844580000122
Wherein
Figure BDA0003524844580000123
And (4) coding the feature vector of the ith video static feature, wherein N is the frame number of the video, and a represents the video static feature.
S1-4: constructing a caption text feature extraction module, extracting caption text features by adopting a 12-layer pre-trained BERT network model, and obtaining caption text features c from the second last layer of the BERT network modeliWherein c isi(i ═ 1, 2.., N) is the ith subtitle text feature, resulting in video subtitle text feature Vc=[c1,c2,...,cN]∈R768×NWherein N is the frame number of the video, and the caption text features are coded by using a 250-dimensional bidirectional circulation network BilSTM
Figure BDA0003524844580000124
Obtaining the character of the caption text after coding
Figure BDA0003524844580000125
Wherein
Figure BDA0003524844580000126
And (4) a feature vector after the ith subtitle text feature is coded, N is the frame number of the video, and c represents the subtitle text feature of the video.
S1-5: a problem text feature extraction module is constructed, firstly, the problem is defined as q, 12 layers of pre-trained BERT network models are selected to extract problem text features, and the problem text features q are obtained from the second last layer of the BERT network modelsjWherein q isj(i 1, 2.. n) is the text feature corresponding to the jth question, n is the number of questions corresponding to each video in the data set, and the video question text feature V is obtainedq=[q1,q2,...,qn]∈R768×nWherein n is the number of problems corresponding to each video in different data sets, and a 250-dimensional bidirectional circulation network BilSTM is used for coding the problem text characteristics
Figure BDA0003524844580000127
Get its coded question text features
Figure BDA0003524844580000128
Wherein
Figure BDA0003524844580000129
And (4) encoding the feature vector of the jth question, wherein n is the number of questions corresponding to each video in different data sets, and q represents the text feature of the questions.
Step S2 is specifically as follows:
s2-1: the context sensing unit receives the features encoded in S1 as input, and outputs the features as context enhancement attention. In order to capture the multi-modal correlation in video, the invention designs three Attention modules in the module by utilizing the Co-Attention mechanism. In order to pay progressive Attention to multi-modal features, the problem is firstly paid Attention to in a correlation mode with the static features of the video, the problem corresponds to a module Q2F-Attention, then the related information after Attention is utilized to pay Attention to the video subtitles in a correlation mode, the problem corresponds to QF2C-Attention, finally the related information after Attention is utilized to pay Attention to the dynamic features of the video, the related information among the modalities corresponds to QFC2M, therefore, the related information among the modalities is obtained, then two-way BilSTM layer coding is constructed to model the context information among the multi-modal information, and finally all the outputs are connected.
S2-2: first Attention to the module Q2A-Attention, the present invention addresses the problem feature vector
Figure BDA0003524844580000131
And video static features
Figure BDA0003524844580000132
As input to the Attention module Q2A-Attention, where j denotes the corresponding jth question in each video, i denotes the ith static feature in each video, and the Attention model is expressed as
Figure BDA0003524844580000133
Co-Attention mechanism for aligning multi-modal information Attention performs multi-modal fusion and alignment using soft-alignment matrices, each soft-alignment matrix Sr,cIs the product of ReLU function, where r is the row of the matrix and c is the column of the matrix, then makes correlation attention to the static feature of the video, and generates the feature by using attention weight
Figure BDA0003524844580000134
And
Figure BDA0003524844580000135
finally, the generated features are connected
Figure BDA0003524844580000136
The attention mechanism is shown as the following formula:
Figure BDA0003524844580000137
Figure BDA0003524844580000138
Figure BDA0003524844580000139
Figure BDA00035248445800001310
wherein Sr,cIs a soft alignment matrix, r is a row of the matrix, c is a column of the matrix, wa,wqIs the training weight of the joint learning,
Figure BDA00035248445800001311
and
Figure BDA00035248445800001312
is the feature vector of the input and,
Figure BDA00035248445800001313
and
Figure BDA00035248445800001314
features of the corresponding feature vector that have undergone the associated attention, [,]for the connection operation, the feature vectors are subjected to correlation attention through the soft alignment matrix, the feature vectors after attention are obtained, and then the feature vectors are connected by using the connection operation to ensure the output alignment of the feature vectors.
S2-3: the second attention guiding module of the invention also adopts the same attention mechanism to perform fusion and alignment operation on the subtitle text characteristics, firstlyFeature after last attention layer
Figure BDA0003524844580000141
And caption text features
Figure BDA0003524844580000142
As QA2C _ Attention input, where i represents the features after the ith first layer Attention and the caption text features in each video, and the Attention model is represented as
Figure BDA0003524844580000143
Then, the character screen text features are learned for relevant attention, and the attention weight is used for generating the features
Figure BDA0003524844580000144
And
Figure BDA0003524844580000145
finally, the generated features are connected
Figure BDA0003524844580000146
The formula can be expressed as:
Figure BDA0003524844580000147
s2-4: the invention designs a third Attention module by adopting the same method, performs fusion and alignment operation on the dynamic characteristics of the video by utilizing a Co-Attention mechanism, and first characterizes the output characteristics of a second layer
Figure BDA0003524844580000148
And video motion features
Figure BDA0003524844580000149
As QAC2M _ Attention input, where i denotes the output features and static features after the corresponding ith Attention in each video, Attention model is expressed as
Figure BDA00035248445800001410
Then, the video dynamic characteristics are learned with the correlation attention weight, and the attention weight is used for generating the characteristics
Figure BDA00035248445800001411
And
Figure BDA00035248445800001412
finally, the generated features are connected
Figure BDA00035248445800001413
The formula can be expressed as:
Figure BDA00035248445800001414
s2-5: by gradually paying attention to the multi-modal characteristics in the video to obtain the correlation characteristics, the invention utilizes the BilSTM to model the context information among the multiple modes, firstly, the correlation information of different modes is respectively input to the corresponding BilSTM to carry out autocorrelation updating, then, the updated information is used as the input of the BilSTM to update the multi-modal context information, and the formula can be expressed as follows:
daq=BiLSTMaq(uaq),daqc=BiLSTMaqc(uaqc)
daqcm=BiLSTMaqcm(uaqcm),daq;aqc=BiLSTMaqc(daq)
daq;aqcm=BiLSTMaqcm(daq),daqc;aq=BiLSTMaq(daqc)
daqc;aqcm=BiLSTMaqcm(daqc),daqcm;aq=BiLSTMaq(daqcm)
daqcm;aqc=BiLSTMaqc(daqcm)
and finally, connecting the output of the BilSTM coding layer by using a connection operation, wherein the formula is as follows:
H=Concatenate([daq;aqc,daq;aqcm,daqc;aqcm,daqc;aq,daqcm;aqc,daqcm;aq])
step S3 is specifically as follows:
s3-1: and the model training unit inputs the context features after attention enhancement of the S2 and outputs the corresponding candidate answer indexes. In order to obtain the multi-modal context information most relevant to the problem, in the module, the invention constructs a fourth Attention module by utilizing a C.o-Attention mechanism, and the invention leads the problem feature vector to be used
Figure BDA0003524844580000151
And feature H after connectioniAs the input of the Attention module Q2A _ Attention, where j represents the corresponding jth question in each video, i represents the ith output characteristic after passing through the context sensing unit, and the Attention model is expressed as
Figure BDA0003524844580000152
Then, the multi-modal context information is subjected to learning of relevant attention weight, and the attention weight is utilized to generate features
Figure BDA0003524844580000153
And
Figure BDA0003524844580000154
finally, the generated features are connected
Figure BDA0003524844580000155
And the BilSTM is used for updating the memory cell, and the formula can be expressed as follows:
Figure BDA0003524844580000156
BqH=BiLSTM(uqH)
s3-2: finally, a final modal vector h is obtained after passing through a full connection layerFC=FC(BqH) Feature vector h is transformed using softmaxFCConverting the answer prediction scores into answer prediction scores, selecting the maximum prediction score from the final answer prediction scores as an answer, and calculating by using a calculation formula: pro ═ softmax (h)FC) Obtaining the predicted scores of the candidate answers, and finally, the invention adopts a cross entropy loss function to train
Figure BDA0003524844580000161
Wherein
Figure BDA0003524844580000162
Representing the kth candidate answer.
Step S4 is specifically as follows:
s4-1: and the answer prediction unit obtains a pre-training model through the training, and predicts the final answer index information by using argmax.
Accordingly, the present invention also provides a video question-answering system with context-aware progressive attention, as shown in fig. 3, comprising:
and the feature coding unit 1 is used for extracting video dynamic features, video static features, subtitle text features and problem text features and coding by using a bidirectional circulation network BilSTM.
The method specifically comprises a video dynamic feature extraction module, a video static feature extraction module, a caption text feature extraction module and a problem text feature extraction module, wherein word embedding is carried out on text features by utilizing a pre-trained BERT network model, coding is carried out by utilizing a bidirectional circulation network BilSTM, static frame information in the video is extracted by utilizing a pre-trained VGG16 network model, coding is carried out by utilizing the bidirectional circulation network BilSTM, dynamic information in the video is extracted by utilizing a pre-trained C3D network model, and coding is carried out by utilizing the bidirectional circulation network BilSTM.
And the context enhancement unit 2 is used for performing multi-mode fusion and alignment on the extracted features to obtain multi-mode context enhancement features.
Specifically, a context sensing unit is constructed, three Attention modules Q2A-Attention, QA2C-Attention and QAC2M-Attention are constructed by utilizing a Co-Attention mechanism, the dynamic feature, the subtitle text feature, the static feature and the question text feature are input, and the dynamic feature, the subtitle text feature, the static feature and the question text feature are output as corresponding multi-modal context enhancement features.
And the model training and predicting unit 3 is used for training the model and predicting the answer.
Specifically, a model training unit is constructed, a fourth Attention module is constructed by using a Co-Attention mechanism to obtain a final modal vector, a softmax function is used for converting the modal vector into an answer prediction score, the maximum prediction score is selected from the final answer prediction scores to serve as an answer, and a formula is calculated: pro ═ softmax (h)FC) And finally, training by adopting a cross entropy loss function to obtain a pre-training model, inputting the target question-answer video into the pre-training model, and outputting the target question-answer video as corresponding predicted answer information.
Therefore, the invention firstly uses a pre-trained network to respectively extract static information and dynamic information characteristics in a video, utilizes a BERT network to embed caption information and question-answer pair information in a data set, then respectively utilizes a BilSTM network to encode multi-mode information, then utilizes a Co-Attention mechanism to construct a progressive Attention module, performs fusion alignment operation on the multi-mode information, utilizes a bidirectional BilSTM network to construct a context coding layer, acquires correlation information between modes, uses the Attention mode again to obtain the multi-mode information most relevant to a question, and finally utilizes the BilSTM to update a memory unit and connects a full connection layer and a Softmax layer to predict an answer.
The above detailed description is provided for the context-aware progressive attention video question-answering method and system according to the embodiments of the present invention, and specific examples are applied herein to illustrate the principles and embodiments of the present invention, and the above description of the embodiments is only used to help understanding the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (8)

1. A method for context-aware progressive video question-answering, the method comprising:
the method comprises the steps of constructing a feature coding unit, inputting a video question and answer related data set, and outputting static features, dynamic features, question text features and caption text features in a video;
a context perception unit is constructed, the static feature, the dynamic feature, the question text feature and the caption text feature are input, and the corresponding multi-modal context enhancement feature is output;
constructing a model training unit, inputting the multi-modal context enhancement features, and training by adopting a cross entropy loss function to obtain a pre-training model;
and constructing an answer prediction unit, inputting the target question-answer video into the pre-training model, and outputting the target question-answer video as corresponding predicted answer information.
2. The method as claimed in claim 1, wherein the constructed feature encoding unit inputs a video question-answer related data set and outputs a static feature, a dynamic feature, a question text feature and a caption text feature in the video, and specifically comprises:
the feature coding unit consists of a video dynamic feature extraction module, a video static feature extraction module, a caption text feature extraction module and a problem text feature extraction module, word embedding is carried out on text features by utilizing a pre-trained BERT network model, coding is carried out by utilizing a bidirectional circulation network BilSTM, static frame information in a video is extracted by utilizing a pre-trained VGG16 network model, coding is carried out by utilizing the bidirectional circulation network BilSTM, dynamic information in the video is extracted by utilizing a pre-trained C3D network model, and coding is carried out by utilizing the bidirectional circulation network BilSTM;
the video dynamic feature extraction module is constructed, a video is defined as V, the dynamic features of the video are extracted through a pre-trained C3D network, and finally the dynamic features are extractedOne layer of the full connection layer obtains the dynamic characteristic m of the full connection layeriWherein m isi(i ═ 1, 2.., N) is the ith video motion feature, resulting in video motion feature Vm=[m1,m2,...,mN]∈R4096×NWherein N is the frame number of the video, and the dynamic characteristics of the video are encoded by using a 250-dimensional bidirectional circulation network BilSTM
Figure FDA0003524844570000021
Obtaining the dynamic characteristics of the coded video
Figure FDA0003524844570000022
Wherein
Figure FDA0003524844570000023
The feature vector is coded for the ith video dynamic feature, N is the frame number of the video, and m represents the video dynamic feature;
constructing a video static feature extraction module, extracting static features in a video by adopting a pre-trained VGG16 network model, extracting video static frame features by using 1FPS, and obtaining the static features a of the seventh full-connection layer fc7 of the VGG16 network modeliWherein a isi(i ═ 1, 2., N) is the ith video static feature, resulting in video static feature Va=[a1,a2,...,aN]∈R4096×NWherein N is the frame number of the video, and the static characteristics of the video are coded by using a 250-dimensional bidirectional circulation network BilSTM
Figure FDA0003524844570000024
Obtaining the static characteristics of the coded video
Figure FDA0003524844570000025
Wherein
Figure FDA0003524844570000026
The feature vector after the i-th video static feature coding, N is the frame number of the video, and a represents the videoA static characteristic;
constructing a caption text feature extraction module, extracting caption text features by adopting a 12-layer pre-trained BERT network model, and obtaining caption text features c from the second last layer of the BERT network modeliWherein c isi(i ═ 1, 2., N) is the ith subtitle text feature, resulting in video subtitle text feature Vc=[c1,c2,...,cN]∈R768×NWherein N is the frame number of the video, and the caption text characteristics are coded by using a 250-dimensional bidirectional circulation network BilSTM
Figure FDA0003524844570000031
Obtaining the character of the caption text after coding
Figure FDA0003524844570000032
Wherein
Figure FDA0003524844570000033
The feature vector is coded for the ith subtitle text feature, N is the frame number of the video, and c represents the video subtitle text feature;
the method comprises the steps of constructing a problem text feature extraction module, firstly defining a problem as q, selecting a 12-layer pre-trained BERT network model to extract problem text features, and obtaining problem text features q from the second last layer of the BERT network modeljWherein q isj(i 1, 2.. n) is the text feature corresponding to the jth question, n is the number of questions corresponding to each video in the data set, and the video question text feature V is obtainedq=[q1,q2,...,qn]∈R768×nWherein n is the number of problems corresponding to each video in different data sets, and a 250-dimensional bidirectional circulation network BilSTM is used for coding the problem text characteristics
Figure FDA0003524844570000034
Get its coded question text features
Figure FDA0003524844570000035
Wherein
Figure FDA0003524844570000036
And (4) encoding the feature vector of the jth question, wherein n is the number of questions corresponding to each video in different data sets, and q represents the text feature of the questions.
3. The method as claimed in claim 2, wherein the construction context-aware unit inputs the dynamic feature, the caption text feature, the static feature and the question text feature, and outputs a corresponding multi-modal context enhancement feature, specifically:
constructing a context sensing unit, and constructing three Attention modules Q2A-Attention, QA2C-Attention and QAC2M-Attention by utilizing a Co-Attention mechanism;
therein, note that the question text feature vector is transformed by the module Q2A-Attention
Figure FDA0003524844570000041
And the static feature vector
Figure FDA0003524844570000042
As an input, where j represents the corresponding jth question in each video, i represents the ith static feature in each video, and the attention model is represented as
Figure FDA0003524844570000043
Co-Attention mechanism for aligning multi-modal information Attention performs multi-modal fusion and alignment using soft-alignment matrices, each soft-alignment matrix Sr,cIs the product of ReLU function, where r is the row of the matrix and c is the column of the matrix, then makes correlation attention to the static feature of the video, and generates the feature by using attention weight
Figure FDA0003524844570000044
And
Figure FDA0003524844570000045
finally, the generated features are connected
Figure FDA0003524844570000046
The attention mechanism is shown as the following formula:
Figure FDA0003524844570000047
Figure FDA0003524844570000048
Figure FDA0003524844570000049
Figure FDA00035248445700000410
wherein Sr,cIs a soft alignment matrix, r is a row of the matrix, c is a column of the matrix, wa,wqIs the training weight for the joint learning,
Figure FDA0003524844570000051
and with
Figure FDA0003524844570000052
Is the feature vector of the input and,
Figure FDA0003524844570000053
and with
Figure FDA0003524844570000054
Features of the corresponding feature vector that have undergone the associated attention, [,]for join operation, by soft-alignment matrix pairsCarrying out relevance attention on the feature vectors to obtain the feature vectors after attention, and then connecting the feature vectors by utilizing a connecting operation;
wherein the Attention Module QA2C-Attention is to the features after the last Attention layer
Figure FDA0003524844570000055
And the feature vector of the caption text
Figure FDA0003524844570000056
As input, where i represents the feature after the ith first layer attention and the feature of the caption text in each video, the attention model is represented as
Figure FDA0003524844570000057
Then, the character screen text features are learned for relevant attention, and the attention weight is used for generating the features
Figure FDA0003524844570000058
And
Figure FDA0003524844570000059
finally, the generated features are connected
Figure FDA00035248445700000510
The formula can be expressed as:
Figure FDA00035248445700000511
therein, note that the module QAC2M-Attention characterizes the output of the second layer
Figure FDA00035248445700000512
And the dynamic feature vector
Figure FDA00035248445700000513
QAC2M _ Attention as Attention ModuleWherein i represents the output features and static features after the corresponding i-th attention in each video, and the attention model is represented as
Figure FDA00035248445700000514
Then, the video dynamic characteristics are subjected to learning of correlation attention weight, and characteristics are generated by using the attention weight
Figure FDA00035248445700000515
And
Figure FDA00035248445700000516
finally, the generated features are connected
Figure FDA00035248445700000517
The formula can be expressed as:
Figure FDA00035248445700000518
gradually paying attention to the multi-modal characteristics in the video to obtain the correlation characteristics of the multi-modal characteristics, and finally modeling the context information among the multi-modal characteristics by utilizing a bidirectional circulation network BilSTM to obtain the multi-modal context enhancement characteristics;
the method for modeling the context information among the multiple modes by utilizing the bidirectional circulation network BilSTM comprises the steps of firstly respectively inputting the relevant information of different modes into the corresponding bidirectional circulation network BilSTM for autocorrelation updating, then taking the updated information as the input of the bidirectional circulation network BilSTM for updating the context information of the multiple modes, wherein a formula can be expressed as follows:
daq=BiLSTMaq(uaq),daqc=BiLSTMaqc(uaqc)
daqcm=BiLSTMaqcm(uaqcm),daq;aqc=BiLSTMaqc(daq)
daq;aqcm=BiLSTMaqcm(daq),daqc;aq=BiLSTMaq(daqc)
daqc;aqcm=BiLSTMaqcm(daqc),daqcm;aq=BiLSTMaq(daqcm)
daqcm;aqc=BiLSTMaqc(daqcm)
and finally, connecting the bi-directional circulation network BilSTM coding layer output by using a connection operation, wherein the formula is as follows:
H=Concatenate([daq;aqc,daq;aqcm,daqc;aqcm,daqc;aq,daqcm;aqc,daqcm;aq])。
4. the method according to claim 3, wherein the input of the model building training unit is the multi-modal context enhancement feature, and the model building training unit is trained by using a cross entropy loss function to obtain a pre-training model, specifically:
constructing a model training unit, constructing a fourth Attention module by using a Co-Attention mechanism, and constructing the problem feature vector
Figure FDA0003524844570000061
With feature H after connectioniAs the input of the Attention module Q2A _ Attention, where j represents the corresponding jth question in each video, i represents the ith output feature after passing through the context sensing unit, and the Attention model is expressed as
Figure FDA0003524844570000071
Then, the multi-modal context information is subjected to learning of relevant attention weight, and the attention weight is utilized to generate features
Figure FDA0003524844570000072
And
Figure FDA0003524844570000073
finally, the generated features are connected
Figure FDA0003524844570000074
And the memory cell is updated by using the bidirectional circulation network BilSTM, and the formula can be expressed as follows:
Figure FDA0003524844570000075
BqH=BiLSTM(uqH)
finally, a final modal vector h is obtained after passing through a full connection layerFC=FC(BqH) Using softmax function to convert the modal vector hFCConverting into answer prediction scores, selecting the maximum prediction score from the final answer prediction scores as an answer, and calculating by using a calculation formula: pro ═ softmax (h)FC) Obtaining the prediction scores of the candidate answers, and finally adopting a cross entropy loss function to train
Figure FDA0003524844570000076
Obtaining a pre-training model, wherein
Figure FDA0003524844570000077
Representing the kth candidate answer.
5. A video question-answering system for context-aware progressive attention, the system comprising:
the feature coding unit is used for extracting video dynamic features, video static features, subtitle text features and problem text features and coding by utilizing a bidirectional circulation network BilSTM;
the context enhancement unit is used for performing multi-mode fusion and alignment on the extracted features to obtain multi-mode context enhancement features;
and the model training and predicting unit is used for training the model and predicting the answer.
6. The system of claim 5, wherein the feature encoding unit comprises a video dynamic feature extraction module, a video static feature extraction module, a caption text feature extraction module and a question text feature extraction module, and the feature encoding unit performs word embedding on text features by using a pre-trained BERT network model and encodes by using a bi-directional circulation network BilsTM, and simultaneously extracts static frame information in the video by using a pre-trained VGG16 network model and encodes by using a bi-directional circulation network BilsTM, and extracts dynamic information in the video by using a pre-trained C3D network model and encodes by using a bi-directional circulation network BilsTM.
7. The video question-answering system of a context-aware progressive Attention according to claim 5, wherein said context enhancement unit, constructing a context-aware unit, constructing three Attention modules Q2A-Attention, QA2C-Attention, QAC2M-Attention using a Co-Attention mechanism, inputting said dynamic feature, said subtitle text feature, said static feature and said question text feature, and outputting as corresponding multi-modal context enhancement features.
8. The system of claim 5, wherein the model training and prediction unit is configured to build a model training unit, build a fourth Attention module using a Co-Attention mechanism to obtain a final modal vector, convert the modal vector into an answer prediction score using a softmax function, select a maximum prediction score from the final answer prediction scores as an answer, and compute the formula: pro ═ softmax (h)FC) And finally, training by adopting a cross entropy loss function to obtain a pre-training model, inputting the target question-answer video into the pre-training model, and outputting the target question-answer video as corresponding predicted answer information.
CN202210192397.3A 2022-02-28 2022-02-28 Context-aware progressive attention video question-answering method and system Pending CN114625849A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210192397.3A CN114625849A (en) 2022-02-28 2022-02-28 Context-aware progressive attention video question-answering method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210192397.3A CN114625849A (en) 2022-02-28 2022-02-28 Context-aware progressive attention video question-answering method and system

Publications (1)

Publication Number Publication Date
CN114625849A true CN114625849A (en) 2022-06-14

Family

ID=81900412

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210192397.3A Pending CN114625849A (en) 2022-02-28 2022-02-28 Context-aware progressive attention video question-answering method and system

Country Status (1)

Country Link
CN (1) CN114625849A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115860152A (en) * 2023-02-20 2023-03-28 南京星耀智能科技有限公司 Cross-modal joint learning method oriented to character military knowledge discovery
CN116681087A (en) * 2023-07-25 2023-09-01 云南师范大学 Automatic problem generation method based on multi-stage time sequence and semantic information enhancement

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115860152A (en) * 2023-02-20 2023-03-28 南京星耀智能科技有限公司 Cross-modal joint learning method oriented to character military knowledge discovery
CN116681087A (en) * 2023-07-25 2023-09-01 云南师范大学 Automatic problem generation method based on multi-stage time sequence and semantic information enhancement
CN116681087B (en) * 2023-07-25 2023-10-10 云南师范大学 Automatic problem generation method based on multi-stage time sequence and semantic information enhancement

Similar Documents

Publication Publication Date Title
WO2023035610A1 (en) Video question-answering method and system based on keyword perception multi-modal attention
WO2019205562A1 (en) Attention regression-based method and device for positioning sentence in video timing sequence
CN114625849A (en) Context-aware progressive attention video question-answering method and system
CN114911914A (en) Cross-modal image-text retrieval method
CN111967272B (en) Visual dialogue generating system based on semantic alignment
CN112733768B (en) Natural scene text recognition method and device based on bidirectional characteristic language model
CN113486669B (en) Semantic recognition method for emergency rescue input voice
CN113792177B (en) Scene character visual question-answering method based on knowledge-guided deep attention network
CN111368142B (en) Video intensive event description method based on generation countermeasure network
CN115964467A (en) Visual situation fused rich semantic dialogue generation method
CN114020891A (en) Double-channel semantic positioning multi-granularity attention mutual enhancement video question-answering method and system
CN112527993B (en) Cross-media hierarchical deep video question-answer reasoning framework
CN111723779B (en) Chinese sign language recognition system based on deep learning
CN113486190A (en) Multi-mode knowledge representation method integrating entity image information and entity category information
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
CN114970517A (en) Visual question and answer oriented method based on multi-modal interaction context perception
CN114969298A (en) Video question-answering method based on cross-modal heterogeneous graph neural network
CN115019142B (en) Image title generation method and system based on fusion characteristics and electronic equipment
CN116912642A (en) Multimode emotion analysis method, device and medium based on dual-mode and multi-granularity interaction
CN117635964A (en) Multimode aesthetic quality evaluation method based on transducer
CN116894085A (en) Dialog generation method and device, electronic equipment and storage medium
CN116561305A (en) False news detection method based on multiple modes and transformers
Peng et al. Temporal pyramid transformer with multimodal interaction for video question answering
CN114911930A (en) Global and local complementary bidirectional attention video question-answering method and system
CN116704198A (en) Knowledge enhancement visual question-answering method based on multi-mode information guidance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination