CN112860945B - Method for multi-mode video question answering by using frame-subtitle self-supervision - Google Patents
Method for multi-mode video question answering by using frame-subtitle self-supervision Download PDFInfo
- Publication number
- CN112860945B CN112860945B CN202110017595.1A CN202110017595A CN112860945B CN 112860945 B CN112860945 B CN 112860945B CN 202110017595 A CN202110017595 A CN 202110017595A CN 112860945 B CN112860945 B CN 112860945B
- Authority
- CN
- China
- Prior art keywords
- caption
- question
- attention
- answer
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention belongs to the field of video question answering, and particularly relates to a method for performing multi-mode video question answering by using frame-subtitle self-supervision. The method comprises the following steps: extracting video frame characteristics, question and answer characteristics, caption characteristics and caption suggestion characteristics; obtaining the attention frame characteristics and the attention subtitle characteristics to obtain fusion characteristics; calculating a time attention score based on the fusion characteristics; calculating a time boundary of the problem by using the time attention score; calculating to obtain a question answer by utilizing the fusion characteristics and the time attention score; training a neural network by using the time boundary of the question and the answer of the question; network parameters of the neural network are optimized, video question answering is carried out by utilizing the optimal neural network, and a time boundary is defined. The invention does not use time labels with expensive labels, but generates the time boundary related to the problem according to the self-designed time attention score. In addition, the invention obtains more accurate answers by mining the relation between the subtitles and the corresponding video contents.
Description
Technical Field
The invention belongs to the field of video question answering, and particularly relates to a method for performing multi-mode video question answering by using frame-subtitle self-supervision.
Background
The multi-modal video question-answering task is a challenging task and currently attracts the attention of many people. The task designs two fields of computer vision and natural language processing, and needs a system which can provide answers to questions aiming at a specific video and define corresponding time boundaries of the questions in the video. At present, the video question-answering task is still a novel task, and the research on the video question-answering task is not mature.
The existing multi-mode video question-answer task generally uses a convolutional neural network to encode videos, and uses a cyclic neural network to encode a question-answer and subtitles in the videos to respectively fuse the question-answer encoding, the subtitle encoding and the video encoding to obtain fusion characteristics. Designing a decoder, and training the decoder through the question-answering labels and the time labels to obtain question answers and time boundaries.
This scheme requires time-stamp training decoders to improve the effect, but time-stamp labeling is empirical and expensive. In addition, the method splits the video frame and the subtitle, and ignores the corresponding relation between the frame and the subtitle.
Disclosure of Invention
The invention aims to solve the technical problem of providing a method for multi-mode video question answering by utilizing frame-subtitle self-supervision.
In order to solve the technical problems, the invention adopts the following technical scheme: the method for multi-mode video question answering by utilizing frame-subtitle self-supervision comprises the following steps:
s1: extracting video frame characteristics, question and answer characteristics, caption characteristics and caption suggestion characteristics from input videos, question and answer texts and caption texts;
s2: the video frame characteristics and the question-answer characteristics are introduced into an attention mechanism to be fused to obtain frame characteristics with attention; the caption suggestion features and the question and answer features are introduced into an attention mechanism to be fused to obtain the caption features with attention; stacking all the attention frame features and the attention subtitle features to obtain fusion features; calculating a time attention score based on the fusion characteristics;
s3: calculating a time boundary of the problem by using the time attention score;
s4: calculating to obtain a question answer by utilizing the fusion characteristics and the time attention score;
s5: training a neural network by using the time boundary of the question and the answer of the question;
s6: optimizing network parameters of the neural network to obtain an optimal neural network, and performing video question answering and time boundary planning by using the optimal neural network.
Preferably, the extracting the video frame features, the question and answer features, the caption features and the caption suggestion features from the input video, the question and answer text and the caption text comprises:
for an input video, frames are extracted from the video according to a set frequency, for each frame, 20 candidate objects are segmented out by using a FasterR-CNN pre-training model to obtain feature expressions of the candidate objects, the candidate objects are reduced to 300 dimensions by a principal component analysis method, and then the candidate objects are projected to a 128-dimensional space through a layer of full-connection network to obtain video frame featureWhere T represents the number of video frames, NoRepresenting the number of object regions;
for a video question-answer input comprising a question q and 5 candidate answersIt is formed into 5 groups of question-answer pairs hk=[q,ak]Embedding question-answer pairs and subtitles into 768-dimensional vectors by using a BERT word embedding model, and projecting the 768-dimensional vectors to a 128-dimensional space through a layer of fully-connected network to obtain question-answer characteristicsAnd caption featuresWherein L issRepresenting the number of words of each question-answer pair, and T representing the total number of frames;
two sections of captionsSplicing the features to be used as caption suggestions, calculating corresponding video frames according to timestamps of captions, setting corresponding video frame labels to be 1, setting other labels to be 0, using the 01 string as the caption suggestions, embedding the caption suggestions into 768-dimensional vectors by using a BERT word embedding model, projecting the 768-dimensional vectors to a 128-dimensional space through a layer of full-connection network to obtain caption suggestion featuresWherein T isspRepresenting the total number of feature subtitle suggestions, LspRepresenting the number of words in the subtitle suggestion.
Preferably, the step of fusing the video frame features and the question-answer features into an attention mechanism to obtain the attention frame features includes:
for a given coded video frame characteristicAnd individual question-answer characteristicsWherein N isoNumber of representative object regions, LhRepresenting the number of words, the similarity matrix of all word region pairs is calculated using a word-level attention mechanism:
wherein SimvSimilarity matrix representing frame features, simvAndmultiplying v and h respectively, and using maximum pooling dimension reduction:
vatt=max(Simvv),vatt∈R128,
fusing the obtained results through the following formula to obtain the attention frame characteristic vf∈R128:
vf=([vatt;hatt;vatt⊙hatt;vatt+hatt])W1+b1,
Wherein W1And b1Is the parameter to be trained and reduces the fused feature dimension to 128 dimensions, which represents a Hadamard product.
Preferably, the step of introducing the caption advising feature and the question and answer feature into the attention mechanism for fusion to obtain the caption feature with attention includes:
suggesting features for a single subtitleAnd individual question-answer characteristicsThe similarity matrix is calculated for all word region pairs using a word-level attention mechanism:
whereinA similarity matrix representing the features of the frameAndmultiplying v and h respectively, and using maximum pooling dimensionality reduction in words and regions respectively:
fusing the obtained results through the following formula to obtain the character screen characteristic s with attentionf∈R128:
Wherein W1And b1Is the parameter to be trained and reduces the fused feature dimension to 128 dimensions, which represents a Hadamard product.
Preferably, the stacking all attention frame features and attention caption features to obtain a fusion feature comprises:
all the attention frame characteristics and the attention caption characteristics are stacked to obtain Vf∈RT×128And Sf∈RT×128;
Will VfAnd SfMultiplying to obtain a similar matrix Simf∈RT×T;
Respectively matching the similarity matrix corresponding to the video frame and the similarity matrix corresponding to the subtitle with the frame characteristic V with attentionfAnd caption feature S with attention mechanismfMultiplication:
Sfatt=SimfSf,Sfatt∈RT×128,
the obtained results are fused to obtain fusion characteristics F epsilon RT×128:
F=([Vfatt;sfatt;Vfatt⊙Sfatt;Vfatt+Sfatt])W2+b2,
Wherein W2And b2Representing the weight matrix to be trained.
Preferably, the calculating the time attention score based on the fusion features comprises:
for the resulting fusion signature F ∈ RT×128Calculating to obtain time attention score A through a full connection layer and sigmoid functionk∈RTThe temporal attention score is used to reflect the degree of correlation of the video frame and the problem,
Ak=sigmoid(WF+b)
wherein W is a parameter matrix of F, and is responsible for making F be equal to RT×128Projection to RTSpace, b is an offset term, and sigmoid represents a sigmoid function.
Preferably, the calculating the time boundary of the problem by using the time attention score includes:
the resulting temporal attention score is Ak∈RTSetting a threshold AtIs greater than a threshold value AtAs candidates for time boundaries;
for each time-bounded candidate, compute the refinement score A from start to finishP:
Where st represents the start time and ed represents the end time; and selecting the highest refined partition score as a problem time boundary partition scheme { st, ed }.
Preferably, the calculating the answer to the question by using the fused feature and the time attention score includes:
set the start and end times st, ed, score A for time attentionkComputing local question-answer representations using maximally pooled computational representationsUsing maximum pooling at all times simultaneouslyInter-computing global representationsStitching a local representation and a global representationObtaining answer probability through the softmax layer, and taking the candidate answer with the highest answer probability as the question answer.
Preferably, the network parameters of the optimized neural network include:
acquiring caption feature time attention A and a caption suggestion tag based on the question and answer features, the caption features and the caption suggestion features;
calculating binary cross entropy loss and sequencing loss according to the caption feature time attention A and the caption suggestion label, wherein the sequencing loss formula is as follows:
lossrank=1+avg(Aout)-avg(Ain)
wherein A isoutSuggesting a subtitle feature temporal attention with a tag of 1 for a subtitle, AinSuggesting the caption feature time attention corresponding to the caption with the label of 0 for the caption;
the binary cross entropy loss formula is:
wherein T isinIs the number of frames with caption recommended label of 1, ToutThe caption suggested label is 0 frame number;
the total unsupervised loss function is established as follows:
lossself=lossrank+lossbce;
after the total self-supervision loss value is calculated through the total self-supervision loss function, the gradient corresponding to each input data is calculated through a chain rule, then the gradient of the next moment is obtained through a gradient descending formula, the new total self-supervision loss value is calculated and output through forward propagation again by combining the gradient of the next moment, and the steps are repeated until the function is converged.
The technical scheme adopted by the invention has the following beneficial effects: the invention does not use time labels with expensive labels, but generates the time boundary related to the problem according to the self-designed time attention score. In addition, the invention obtains more accurate answers by mining the relation between the subtitles and the corresponding video contents.
The following detailed description of the present invention will be provided in conjunction with the accompanying drawings.
Drawings
The invention is further described with reference to the accompanying drawings and the detailed description below:
fig. 1 is a flow chart illustrating a method for multi-modal video question answering using frame-subtitle auto-surveillance according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a method for performing multi-mode video question answering by using frame-subtitle self-supervision, which is shown by referring to fig. 1 and comprises the following steps:
s1: and extracting video frame characteristics, question and answer characteristics, caption characteristics and caption suggestion characteristics from the input video, question and answer texts and the caption texts.
For an input video, extracting frames from the video according to a set frequency, segmenting 20 candidate objects to obtain characteristic expressions of each frame by using a FasterR-CNN pre-training model, reducing the dimension of each candidate object to 300 dimensions by a principal component analysis method, and passing through a layer of full-scaleProjecting the connecting network to 128-dimensional space to obtain video frame characteristicsWhere T represents the number of video frames, NoRepresenting the number of object regions;
for a video question-answer input comprising a question q and 5 candidate answersIt is formed into 5 groups of question and answer pairs hk=[q,ak]Embedding question-answer pairs and subtitles into 768-dimensional vectors by using a BERT word embedding model, and projecting the 768-dimensional vectors to a 128-dimensional space through a layer of fully-connected network to obtain question-answer characteristicsAnd caption featuresWherein L issRepresenting the number of words of each question-answer pair, and T representing the total number of frames;
splicing two sections of caption features to be used as caption suggestions, calculating corresponding video frames according to timestamps of captions, setting corresponding video frame labels to be 1, setting the labels to be 0, using the 01 string as the caption suggestions, embedding the caption suggestions into 768-dimensional vectors by using a BERT word embedding model, and projecting the 768-dimensional vectors to a 128-dimensional space through a layer of full-connection network to obtain the caption suggestion featuresWherein T isspRepresenting the total number of feature subtitle suggestions, LspRepresenting the number of words in the caption suggestion.
S2: the video frame characteristics and the question-answer characteristics are introduced into an attention mechanism to be fused to obtain frame characteristics with attention; the caption suggestion features and the question and answer features are introduced into an attention mechanism to be fused to obtain the caption features with attention; stacking all the attention frame features and the attention subtitle features to obtain fusion features; a temporal attention score is calculated based on the fused features.
For a given coded video frame characteristicAnd individual question-answer characteristicsWherein N isoNumber of representative object regions, LhRepresenting the number of words, the similarity matrix for all word region pairs is calculated using a word-level attention mechanism:
wherein SimvSimilarity matrix representing frame features, simvAndmultiplying v and h respectively, and using maximum pooling dimension reduction:
vatt=max(Simvv),vatt∈R128,
fusing the obtained results through the following formula to obtain the attention frame characteristic vf∈R128:
vf=([vatt;hatt;vatt⊙hatt;vatt+hatt])W1+b1,
Wherein W1And b1Is the parameter to be trained and reduces the fused feature dimension to 128 dimensions, which represents a Hadamard product.
Suggesting features for a single subtitleAnd individual question-answer characteristicsThe similarity matrix is calculated for all word region pairs using a word-level attention mechanism:
whereinA similarity matrix representing the features of the frameAndmultiplying with v and h respectively, and using maximum pooling dimensionality reduction in words and regions respectively:
fusing the obtained results through the following formula to obtain the character screen characteristic s with attentionf∈R128:
Wherein W1And b1Is the parameter to be trained and reduces the fused feature dimension to 128 dimensions, which represents a Hadamard product.
All the attention frame characteristics and the attention caption characteristics are stacked to obtain Vf∈RT×128And Sf∈RT×128;
Will VfAnd SfMultiplying to obtain a similar matrix Simf∈RT×T;
Respectively matching the similarity matrix corresponding to the video frame and the similarity matrix corresponding to the subtitle with the frame characteristic V with attentionfAnd caption feature S with attention mechanismfMultiplication:
Sfatt=SimfSf,Sfatt∈RT×128,
the obtained results are fused to obtain fusion characteristics F epsilon RT×128:
F=([Vfatt;sfatt;Vfatt⊙Sfatt;Vfatt+Sfatt])W2+b2,
Wherein W2And b2Representing the weight matrix to be trained.
For the resulting fusion signature F ∈ RT×128Calculating to obtain time attention score A through a full connection layer and sigmoid functionk∈RTThe temporal attention score is used to reflect the degree of correlation of the video frame and the question,
Ak=sigmoid(WF+b)
wherein W is a parameter matrix of F, and is responsible for making F epsilon RT×128Projection to RTSpace, b is an offset term, and sigmoid represents a sigmoid function.
S3: the time boundary of the problem is calculated using the time attention score.
The resulting temporal attention score is Ak∈RTSetting a threshold AtIs greater than a threshold value AtAs candidates for time boundaries;
for each time-bounded candidate, compute from start to finishRefinement score of bundle AP:
Where st represents the start time and ed represents the end time; and selecting the highest refined partition score as a problem time boundary partition scheme { st, ed }.
S4: and calculating to obtain the answer of the question by utilizing the fusion characteristics and the time attention score.
Set the start and end times st, ed, score A for time attentionkComputing local question-answer representations using maximally pooled computational representationsComputing global representations at all times using maximal pooling simultaneouslyStitching a local representation and a global representationObtaining answer probability through the softmax layer, and taking the candidate answer with the highest answer probability as the question answer.
S5: the neural network is trained using the time boundaries of the questions and answers to the questions.
S6: optimizing network parameters of the neural network to obtain an optimal neural network, and performing video question answering and time boundary planning by using the optimal neural network.
Acquiring caption feature time attention A and a caption suggestion tag based on the question and answer features, the caption features and the caption suggestion features;
calculating binary cross entropy loss and sequencing loss according to the caption feature time attention A and the caption suggestion label, wherein the sequencing loss formula is as follows:
lossrank=1+avg(Aout)-avg(Ain)
wherein A isoutSuggesting a tag of 1 correspondence for subtitlesFeature time attention of subtitles, AinSuggesting the caption feature time attention corresponding to the caption with the label of 0 for the caption;
the binary cross entropy loss formula is:
wherein T isinIs the number of frames with caption recommended label of 1, ToutThe caption suggested label is 0 frame number;
the total unsupervised loss function is established as follows:
lossself=lossrank+lossbce;
after the total self-supervision loss value is calculated through the total self-supervision loss function, the gradient corresponding to each input data is calculated through a chain rule, then the gradient of the next moment is obtained through a gradient descending formula, the new total self-supervision loss value is calculated and output through forward propagation again by combining the gradient of the next moment, and the steps are repeated until the function is converged.
The invention does not use time labels with expensive labels, but generates the time boundary related to the problem according to the self-designed time attention score. In addition, the invention obtains more accurate answers by mining the relation between the subtitles and the corresponding video contents.
The embodiment of the invention tests in TVQA and TVQA + data sets, wherein TVQA is a large-scale multi-mode video data set and comprises 15.25 ten thousand question and answer pairs and 2.18 ten thousand videos. TVQA is a subset of TVQA, and the data set is augmented with a problem-related label box at the frame level based on TVQA. This example uses challenge-response pairs for supervision, and the results are shown in tables 1 and 2 (where "-" indicates that the method cannot count the data) in comparison with the prior art STAGE. Aiming at the question answering performance, the invention counts the classification accuracy, namely the proportion of samples with correct answers to total samples. For time positioning, the invention counts average time intersection ratio (T.mIoU), namely the average of the intersection of the predicted time boundary and the real time boundary of all samples in the proportion of a union set, and answer-duration joint accuracy rate (ASA), namely the proportion of the time intersection ratio (IoU) in the proportion of total samples when the answer is correct.
Table 1 test results of the present invention and technique-on TVQA data set
Name of method | Acc. | T.mIoU | ASA |
STAGE | 66.38% | - | - |
WSQG | 69.13% | 29.28% | 20.78% |
Table 2 test results of the present invention and technique-on TVQA + data set
Name of method | Acc. | T.mIoU | ASA |
STAGE | 68.31% | - | - |
WSQG | 67.88% | 30.30% | 21.98% |
As can be seen from tables 1 and 2, the method of the present invention has higher accuracy than the prior art STAGE.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that the invention is not limited thereto, and may be embodied in other forms without departing from the spirit or essential characteristics thereof. Any modifications which do not depart from the functional and structural principles of the present invention are intended to be included within the scope of the claims.
Claims (6)
1. The method for multi-mode video question answering by utilizing frame-subtitle self-supervision is characterized by comprising the following steps of:
s1: extracting video frame characteristics, question and answer characteristics, caption characteristics and caption suggestion characteristics from input videos, question and answer texts and caption texts;
s2: the video frame characteristics and the question-answer characteristics are introduced into an attention mechanism to be fused to obtain frame characteristics with attention; introducing the caption suggestion features and the question and answer features into an attention mechanism for fusion to obtain the caption features with attention; stacking all the attention frame features and the attention subtitle features to obtain fusion features; calculating a time attention score based on the fusion characteristics;
s3: calculating a time boundary of the problem by using the time attention score;
s4: calculating to obtain a question answer by utilizing the fusion characteristics and the time attention score;
s5: training a neural network by using the time boundary of the question and the answer of the question;
s6: optimizing network parameters of the neural network to obtain an optimal neural network, and performing video question answering and defining a time boundary by using the optimal neural network;
the stacking all attention frame features and attention caption features to obtain a fusion feature comprises:
all the attention frame characteristics and the attention caption characteristics are stacked to obtain Vf∈RT×128And Sf∈RT×128;
Will VfAnd SfMultiplying to obtain a similar matrix Simf∈RT×T;
Respectively matching the similarity matrix corresponding to the video frame and the similarity matrix corresponding to the subtitle with the frame characteristic V with attentionfAnd caption feature S with attention mechanismfMultiplication:
Sfatt=SimfSf,Sfatt∈RT×128,
the obtained results are fused to obtain fusion characteristics F epsilon RT×128:
F=([Vfatt;sfatt;Vfatt⊙Sfatt;Vfatt+Sfatt])W2+b2,
Wherein W2And b2Representing a weight matrix to be trained;
the calculating the time attention score based on the fusion features comprises:
for the resulting fusion signature F ∈ RT×128Calculating to obtain time attention score A through a full connection layer and sigmoid functionk∈RTThe temporal attention score is used to reflect the degree of correlation of the video frame and the problem,
Ak=sigmoid(WF+b)
wherein W is a parameter matrix of F, and is responsible for making F be equal to RT×128Projection to RTSpace, b is an offset item, and sigmoid represents a sigmoid function;
the calculating the time boundary of the problem by using the time attention score comprises the following steps:
the resulting temporal attention score is Ak∈RTSetting a threshold AtIs greater than a threshold value AtAs candidates for time boundaries;
for each time-bounded candidate, compute the refinement score A from start to finishP:
Where st represents the start time and ed represents the end time; and selecting the highest refined partition score as a problem time boundary partition scheme { st, ed }.
2. The method of claim 1, wherein the extracting video frame features, question and answer features, caption features and caption suggestion features from the input video, question and answer text and caption text comprises:
for an input video, frames are extracted from the video according to a set frequency, for each frame, 20 candidate objects are segmented out by using a FasterR-CNN pre-training model to obtain characteristic expressions of the candidate objects, the candidate objects are reduced to 300 dimensions through a principal component analysis method, and then the candidate objects are projected to the video through a layer of full-connection network128-dimensional space, obtaining video frame characteristicsWhere T represents the number of video frames, NoRepresenting the number of object regions;
for a video question-answer input comprising a question q and 5 candidate answersIt is formed into 5 groups of question-answer pairs hk=[q,ak]Embedding question-answer pairs and subtitles into 768-dimensional vectors by using a BERT word embedding model, and projecting the 768-dimensional vectors to a 128-dimensional space through a layer of fully-connected network to obtain question-answer characteristicsAnd caption featuresWherein L issRepresenting the number of words of each question-answer pair, and T representing the total number of frames;
splicing two sections of caption features to be used as caption suggestions, calculating corresponding video frames according to timestamps of captions, setting corresponding video frame labels to be 1, setting the labels to be 0, using the 01 string as the caption suggestions, embedding the caption suggestions into 768-dimensional vectors by using a BERT word embedding model, and projecting the 768-dimensional vectors to a 128-dimensional space through a layer of full-connection network to obtain the caption suggestion featuresWherein T isspRepresenting the total number of feature subtitle suggestions, LspRepresenting the number of words in the subtitle suggestion.
3. The method for multi-modal video question answering using frame-subtitle auto-surveillance as claimed in claim 1, wherein the step of fusing the video frame features and the question answering features into an attention mechanism to obtain the frame features with attention comprises:
for a given coded viewFrequency frame characteristicsAnd individual question-answer characteristicsWherein N isoNumber of representative object regions, LhRepresenting the number of words, the similarity matrix for all word region pairs is calculated using a word-level attention mechanism:
wherein SimvSimilarity matrix representing frame features, simvAndmultiplying v and h respectively, and using maximum pooling dimension reduction:
vatt=max(Simvv),vatt∈R128,
fusing the obtained results through the following formula to obtain the attention frame characteristic vf∈R128:
vf=([vatt;hatt;vatt⊙hatt;vatt+hatt])W1+b1,
Wherein W1And b1Is the parameter to be trained and reduces the fused feature dimension to 128 dimensions, which indicates a Hadamard product.
4. The method for multi-modal video question answering using frame-caption self-supervision as claimed in claim 1, wherein the step of fusing the caption proposal feature and the question answering feature into an attention mechanism to obtain the attention caption feature comprises the steps of:
suggesting features for a single subtitleAnd individual question-answer characteristicsThe similarity matrix is calculated for all word region pairs using a word-level attention mechanism:
whereinA similarity matrix representing the features of the frameAndmultiplying v and h respectively, and using maximum pooling dimensionality reduction in words and regions respectively:
fusing the obtained results through the following formula to obtain the character screen characteristic s with attentionf∈R128:
Wherein W1And b1Is the parameter to be trained and reduces the fused feature dimension to 128 dimensions, which indicates a Hadamard product.
5. The method of claim 1, wherein the computing the answer to the question using the fused feature and the temporal attention score comprises:
set the start and end times st, ed, score A for time attentionkComputing local question-answer representations using maximally pooled computational representationsComputing global representations at all times using maximal pooling simultaneouslyStitching a local representation and a global representationObtaining answer probability through the softmax layer, and taking the candidate answer with the highest answer probability as the question answer.
6. The method for multimodal video question answering using frame-subtitle auto-supervision according to claim 1, wherein the optimizing network parameters of the neural network comprises:
acquiring caption feature time attention A and a caption suggestion tag based on the question and answer features, the caption features and the caption suggestion features;
calculating binary cross entropy loss and sequencing loss according to the caption feature time attention A and the caption suggestion label, wherein the sequencing loss formula is as follows:
lossrank=1+avg(Aout)-avg(Ain)
wherein A isoutSuggesting a subtitle feature temporal attention with a tag of 1 for a subtitle, AinSuggesting a tag for the caption of0 corresponding to the caption feature time attention;
the binary cross entropy loss formula is:
wherein T isinIs the number of frames with caption recommended label of 1, ToutThe caption suggested label is 0 frame number;
the total unsupervised loss function is established as follows:
lossself=lossrank+lossbce;
after the total self-supervision loss value is calculated through the total self-supervision loss function, the gradient corresponding to each input data is calculated through a chain rule, then the gradient of the next moment is obtained through a gradient descending formula, the new total self-supervision loss value is calculated and output through forward propagation again by combining the gradient of the next moment, and the steps are repeated until the function is converged.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110017595.1A CN112860945B (en) | 2021-01-07 | 2021-01-07 | Method for multi-mode video question answering by using frame-subtitle self-supervision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110017595.1A CN112860945B (en) | 2021-01-07 | 2021-01-07 | Method for multi-mode video question answering by using frame-subtitle self-supervision |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112860945A CN112860945A (en) | 2021-05-28 |
CN112860945B true CN112860945B (en) | 2022-07-08 |
Family
ID=76004712
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110017595.1A Active CN112860945B (en) | 2021-01-07 | 2021-01-07 | Method for multi-mode video question answering by using frame-subtitle self-supervision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112860945B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113590879B (en) * | 2021-08-05 | 2022-05-31 | 哈尔滨理工大学 | System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network |
CN113688296B (en) * | 2021-08-10 | 2022-05-31 | 哈尔滨理工大学 | Method for solving video question-answering task based on multi-mode progressive attention model |
CN113423004B (en) * | 2021-08-23 | 2021-11-30 | 杭州一知智能科技有限公司 | Video subtitle generating method and system based on decoupling decoding |
CN113837259B (en) * | 2021-09-17 | 2023-05-30 | 中山大学附属第六医院 | Education video question-answering method and system for graph-note-meaning fusion of modal interaction |
CN114707022B (en) * | 2022-05-31 | 2022-09-06 | 浙江大学 | Video question-answer data set labeling method and device, storage medium and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9330084B1 (en) * | 2014-12-10 | 2016-05-03 | International Business Machines Corporation | Automatically generating question-answer pairs during content ingestion by a question answering computing system |
CN107766447A (en) * | 2017-09-25 | 2018-03-06 | 浙江大学 | It is a kind of to solve the method for video question and answer using multilayer notice network mechanism |
CN107948730A (en) * | 2017-10-30 | 2018-04-20 | 百度在线网络技术(北京)有限公司 | Method, apparatus, equipment and storage medium based on picture generation video |
CN110704601A (en) * | 2019-10-11 | 2020-01-17 | 浙江大学 | Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network |
CN111625660A (en) * | 2020-05-27 | 2020-09-04 | 腾讯科技(深圳)有限公司 | Dialog generation method, video comment method, device, equipment and storage medium |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018081751A1 (en) * | 2016-10-28 | 2018-05-03 | Vilynx, Inc. | Video tagging system and method |
-
2021
- 2021-01-07 CN CN202110017595.1A patent/CN112860945B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9330084B1 (en) * | 2014-12-10 | 2016-05-03 | International Business Machines Corporation | Automatically generating question-answer pairs during content ingestion by a question answering computing system |
CN107766447A (en) * | 2017-09-25 | 2018-03-06 | 浙江大学 | It is a kind of to solve the method for video question and answer using multilayer notice network mechanism |
CN107948730A (en) * | 2017-10-30 | 2018-04-20 | 百度在线网络技术(北京)有限公司 | Method, apparatus, equipment and storage medium based on picture generation video |
CN110704601A (en) * | 2019-10-11 | 2020-01-17 | 浙江大学 | Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network |
CN111625660A (en) * | 2020-05-27 | 2020-09-04 | 腾讯科技(深圳)有限公司 | Dialog generation method, video comment method, device, equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
Junyeong Kim等.Gaining Extra Supervision via Multi-task learning for Multi-Modal Video Question Answering.《arXiv》.2019, * |
王博.视觉语义表示模型在视频问答中的研究与应用.《中国优秀硕士学位论文全文数据库 信息科技辑》.2020, * |
Also Published As
Publication number | Publication date |
---|---|
CN112860945A (en) | 2021-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112860945B (en) | Method for multi-mode video question answering by using frame-subtitle self-supervision | |
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
Gao et al. | Hierarchical representation network with auxiliary tasks for video captioning and video question answering | |
Torabi et al. | Learning language-visual embedding for movie understanding with natural-language | |
CN112528676B (en) | Document-level event argument extraction method | |
Dilawari et al. | ASoVS: abstractive summarization of video sequences | |
US11120268B2 (en) | Automatically evaluating caption quality of rich media using context learning | |
CN111274804A (en) | Case information extraction method based on named entity recognition | |
CN113221571B (en) | Entity relation joint extraction method based on entity correlation attention mechanism | |
CN112580362A (en) | Visual behavior recognition method and system based on text semantic supervision and computer readable medium | |
CN112800292A (en) | Cross-modal retrieval method based on modal specificity and shared feature learning | |
CN112905736A (en) | Unsupervised text emotion analysis method based on quantum theory | |
CN116029305A (en) | Chinese attribute-level emotion analysis method, system, equipment and medium based on multitask learning | |
Chen et al. | Image captioning with memorized knowledge | |
CN114065738B (en) | Chinese spelling error correction method based on multitask learning | |
CN116049557A (en) | Educational resource recommendation method based on multi-mode pre-training model | |
CN114048290A (en) | Text classification method and device | |
CN112651225B (en) | Multi-item selection machine reading understanding method based on multi-stage maximum attention | |
CN113051904B (en) | Link prediction method for small-scale knowledge graph | |
Xue et al. | LCSNet: End-to-end lipreading with channel-aware feature selection | |
Mazaheri et al. | Video fill in the blank using lr/rl lstms with spatial-temporal attentions | |
CN110929013A (en) | Image question-answer implementation method based on bottom-up entry and positioning information fusion | |
CN115759102A (en) | Chinese poetry wine culture named entity recognition method | |
CN115659242A (en) | Multimode emotion classification method based on mode enhanced convolution graph | |
CN117195892B (en) | Classroom teaching evaluation method and system based on big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |