CN112860945B - Method for multi-mode video question answering by using frame-subtitle self-supervision - Google Patents

Method for multi-mode video question answering by using frame-subtitle self-supervision Download PDF

Info

Publication number
CN112860945B
CN112860945B CN202110017595.1A CN202110017595A CN112860945B CN 112860945 B CN112860945 B CN 112860945B CN 202110017595 A CN202110017595 A CN 202110017595A CN 112860945 B CN112860945 B CN 112860945B
Authority
CN
China
Prior art keywords
caption
question
attention
answer
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110017595.1A
Other languages
Chinese (zh)
Other versions
CN112860945A (en
Inventor
张宏达
胡若云
沈然
叶上维
丁麒
王庆娟
陈金威
熊剑峰
丁莹
赵洲
陈哲乾
李一夫
丁丹翔
姜伟昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yizhi Intelligent Technology Co ltd
Zhejiang University ZJU
State Grid Zhejiang Electric Power Co Ltd
Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Original Assignee
Hangzhou Yizhi Intelligent Technology Co ltd
Zhejiang University ZJU
State Grid Zhejiang Electric Power Co Ltd
Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yizhi Intelligent Technology Co ltd, Zhejiang University ZJU, State Grid Zhejiang Electric Power Co Ltd, Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd filed Critical Hangzhou Yizhi Intelligent Technology Co ltd
Priority to CN202110017595.1A priority Critical patent/CN112860945B/en
Publication of CN112860945A publication Critical patent/CN112860945A/en
Application granted granted Critical
Publication of CN112860945B publication Critical patent/CN112860945B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention belongs to the field of video question answering, and particularly relates to a method for performing multi-mode video question answering by using frame-subtitle self-supervision. The method comprises the following steps: extracting video frame characteristics, question and answer characteristics, caption characteristics and caption suggestion characteristics; obtaining the attention frame characteristics and the attention subtitle characteristics to obtain fusion characteristics; calculating a time attention score based on the fusion characteristics; calculating a time boundary of the problem by using the time attention score; calculating to obtain a question answer by utilizing the fusion characteristics and the time attention score; training a neural network by using the time boundary of the question and the answer of the question; network parameters of the neural network are optimized, video question answering is carried out by utilizing the optimal neural network, and a time boundary is defined. The invention does not use time labels with expensive labels, but generates the time boundary related to the problem according to the self-designed time attention score. In addition, the invention obtains more accurate answers by mining the relation between the subtitles and the corresponding video contents.

Description

Method for multi-mode video question answering by using frame-subtitle self-supervision
Technical Field
The invention belongs to the field of video question answering, and particularly relates to a method for performing multi-mode video question answering by using frame-subtitle self-supervision.
Background
The multi-modal video question-answering task is a challenging task and currently attracts the attention of many people. The task designs two fields of computer vision and natural language processing, and needs a system which can provide answers to questions aiming at a specific video and define corresponding time boundaries of the questions in the video. At present, the video question-answering task is still a novel task, and the research on the video question-answering task is not mature.
The existing multi-mode video question-answer task generally uses a convolutional neural network to encode videos, and uses a cyclic neural network to encode a question-answer and subtitles in the videos to respectively fuse the question-answer encoding, the subtitle encoding and the video encoding to obtain fusion characteristics. Designing a decoder, and training the decoder through the question-answering labels and the time labels to obtain question answers and time boundaries.
This scheme requires time-stamp training decoders to improve the effect, but time-stamp labeling is empirical and expensive. In addition, the method splits the video frame and the subtitle, and ignores the corresponding relation between the frame and the subtitle.
Disclosure of Invention
The invention aims to solve the technical problem of providing a method for multi-mode video question answering by utilizing frame-subtitle self-supervision.
In order to solve the technical problems, the invention adopts the following technical scheme: the method for multi-mode video question answering by utilizing frame-subtitle self-supervision comprises the following steps:
s1: extracting video frame characteristics, question and answer characteristics, caption characteristics and caption suggestion characteristics from input videos, question and answer texts and caption texts;
s2: the video frame characteristics and the question-answer characteristics are introduced into an attention mechanism to be fused to obtain frame characteristics with attention; the caption suggestion features and the question and answer features are introduced into an attention mechanism to be fused to obtain the caption features with attention; stacking all the attention frame features and the attention subtitle features to obtain fusion features; calculating a time attention score based on the fusion characteristics;
s3: calculating a time boundary of the problem by using the time attention score;
s4: calculating to obtain a question answer by utilizing the fusion characteristics and the time attention score;
s5: training a neural network by using the time boundary of the question and the answer of the question;
s6: optimizing network parameters of the neural network to obtain an optimal neural network, and performing video question answering and time boundary planning by using the optimal neural network.
Preferably, the extracting the video frame features, the question and answer features, the caption features and the caption suggestion features from the input video, the question and answer text and the caption text comprises:
for an input video, frames are extracted from the video according to a set frequency, for each frame, 20 candidate objects are segmented out by using a FasterR-CNN pre-training model to obtain feature expressions of the candidate objects, the candidate objects are reduced to 300 dimensions by a principal component analysis method, and then the candidate objects are projected to a 128-dimensional space through a layer of full-connection network to obtain video frame feature
Figure GDA0003589607050000021
Where T represents the number of video frames, NoRepresenting the number of object regions;
for a video question-answer input comprising a question q and 5 candidate answers
Figure GDA0003589607050000022
It is formed into 5 groups of question-answer pairs hk=[q,ak]Embedding question-answer pairs and subtitles into 768-dimensional vectors by using a BERT word embedding model, and projecting the 768-dimensional vectors to a 128-dimensional space through a layer of fully-connected network to obtain question-answer characteristics
Figure GDA0003589607050000023
And caption features
Figure GDA0003589607050000024
Wherein L issRepresenting the number of words of each question-answer pair, and T representing the total number of frames;
two sections of captionsSplicing the features to be used as caption suggestions, calculating corresponding video frames according to timestamps of captions, setting corresponding video frame labels to be 1, setting other labels to be 0, using the 01 string as the caption suggestions, embedding the caption suggestions into 768-dimensional vectors by using a BERT word embedding model, projecting the 768-dimensional vectors to a 128-dimensional space through a layer of full-connection network to obtain caption suggestion features
Figure GDA0003589607050000025
Wherein T isspRepresenting the total number of feature subtitle suggestions, LspRepresenting the number of words in the subtitle suggestion.
Preferably, the step of fusing the video frame features and the question-answer features into an attention mechanism to obtain the attention frame features includes:
for a given coded video frame characteristic
Figure GDA0003589607050000031
And individual question-answer characteristics
Figure GDA0003589607050000032
Wherein N isoNumber of representative object regions, LhRepresenting the number of words, the similarity matrix of all word region pairs is calculated using a word-level attention mechanism:
Figure GDA0003589607050000033
wherein SimvSimilarity matrix representing frame features, simvAnd
Figure GDA0003589607050000034
multiplying v and h respectively, and using maximum pooling dimension reduction:
vatt=max(Simvv),vatt∈R128
Figure GDA0003589607050000035
fusing the obtained results through the following formula to obtain the attention frame characteristic vf∈R128
vf=([vatt;hatt;vatt⊙hatt;vatt+hatt])W1+b1
Wherein W1And b1Is the parameter to be trained and reduces the fused feature dimension to 128 dimensions, which represents a Hadamard product.
Preferably, the step of introducing the caption advising feature and the question and answer feature into the attention mechanism for fusion to obtain the caption feature with attention includes:
suggesting features for a single subtitle
Figure GDA0003589607050000036
And individual question-answer characteristics
Figure GDA0003589607050000037
The similarity matrix is calculated for all word region pairs using a word-level attention mechanism:
Figure GDA0003589607050000038
wherein
Figure GDA0003589607050000039
A similarity matrix representing the features of the frame
Figure GDA00035896070500000310
And
Figure GDA00035896070500000311
multiplying v and h respectively, and using maximum pooling dimensionality reduction in words and regions respectively:
Figure GDA00035896070500000312
Figure GDA00035896070500000313
fusing the obtained results through the following formula to obtain the character screen characteristic s with attentionf∈R128
Figure GDA0003589607050000041
Wherein W1And b1Is the parameter to be trained and reduces the fused feature dimension to 128 dimensions, which represents a Hadamard product.
Preferably, the stacking all attention frame features and attention caption features to obtain a fusion feature comprises:
all the attention frame characteristics and the attention caption characteristics are stacked to obtain Vf∈RT×128And Sf∈RT×128
Will VfAnd SfMultiplying to obtain a similar matrix Simf∈RT×T
Figure GDA0003589607050000042
Respectively matching the similarity matrix corresponding to the video frame and the similarity matrix corresponding to the subtitle with the frame characteristic V with attentionfAnd caption feature S with attention mechanismfMultiplication:
Figure GDA0003589607050000043
Sfatt=SimfSf,Sfatt∈RT×128
the obtained results are fused to obtain fusion characteristics F epsilon RT×128
F=([Vfatt;sfatt;Vfatt⊙Sfatt;Vfatt+Sfatt])W2+b2
Wherein W2And b2Representing the weight matrix to be trained.
Preferably, the calculating the time attention score based on the fusion features comprises:
for the resulting fusion signature F ∈ RT×128Calculating to obtain time attention score A through a full connection layer and sigmoid functionk∈RTThe temporal attention score is used to reflect the degree of correlation of the video frame and the problem,
Ak=sigmoid(WF+b)
wherein W is a parameter matrix of F, and is responsible for making F be equal to RT×128Projection to RTSpace, b is an offset term, and sigmoid represents a sigmoid function.
Preferably, the calculating the time boundary of the problem by using the time attention score includes:
the resulting temporal attention score is Ak∈RTSetting a threshold AtIs greater than a threshold value AtAs candidates for time boundaries;
for each time-bounded candidate, compute the refinement score A from start to finishP
Figure GDA0003589607050000051
Where st represents the start time and ed represents the end time; and selecting the highest refined partition score as a problem time boundary partition scheme { st, ed }.
Preferably, the calculating the answer to the question by using the fused feature and the time attention score includes:
set the start and end times st, ed, score A for time attentionkComputing local question-answer representations using maximally pooled computational representations
Figure GDA0003589607050000052
Using maximum pooling at all times simultaneouslyInter-computing global representations
Figure GDA0003589607050000053
Stitching a local representation and a global representation
Figure GDA0003589607050000054
Obtaining answer probability through the softmax layer, and taking the candidate answer with the highest answer probability as the question answer.
Preferably, the network parameters of the optimized neural network include:
acquiring caption feature time attention A and a caption suggestion tag based on the question and answer features, the caption features and the caption suggestion features;
calculating binary cross entropy loss and sequencing loss according to the caption feature time attention A and the caption suggestion label, wherein the sequencing loss formula is as follows:
lossrank=1+avg(Aout)-avg(Ain)
wherein A isoutSuggesting a subtitle feature temporal attention with a tag of 1 for a subtitle, AinSuggesting the caption feature time attention corresponding to the caption with the label of 0 for the caption;
the binary cross entropy loss formula is:
Figure GDA0003589607050000055
wherein T isinIs the number of frames with caption recommended label of 1, ToutThe caption suggested label is 0 frame number;
the total unsupervised loss function is established as follows:
lossself=lossrank+lossbce
after the total self-supervision loss value is calculated through the total self-supervision loss function, the gradient corresponding to each input data is calculated through a chain rule, then the gradient of the next moment is obtained through a gradient descending formula, the new total self-supervision loss value is calculated and output through forward propagation again by combining the gradient of the next moment, and the steps are repeated until the function is converged.
The technical scheme adopted by the invention has the following beneficial effects: the invention does not use time labels with expensive labels, but generates the time boundary related to the problem according to the self-designed time attention score. In addition, the invention obtains more accurate answers by mining the relation between the subtitles and the corresponding video contents.
The following detailed description of the present invention will be provided in conjunction with the accompanying drawings.
Drawings
The invention is further described with reference to the accompanying drawings and the detailed description below:
fig. 1 is a flow chart illustrating a method for multi-modal video question answering using frame-subtitle auto-surveillance according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a method for performing multi-mode video question answering by using frame-subtitle self-supervision, which is shown by referring to fig. 1 and comprises the following steps:
s1: and extracting video frame characteristics, question and answer characteristics, caption characteristics and caption suggestion characteristics from the input video, question and answer texts and the caption texts.
For an input video, extracting frames from the video according to a set frequency, segmenting 20 candidate objects to obtain characteristic expressions of each frame by using a FasterR-CNN pre-training model, reducing the dimension of each candidate object to 300 dimensions by a principal component analysis method, and passing through a layer of full-scaleProjecting the connecting network to 128-dimensional space to obtain video frame characteristics
Figure GDA0003589607050000071
Where T represents the number of video frames, NoRepresenting the number of object regions;
for a video question-answer input comprising a question q and 5 candidate answers
Figure GDA0003589607050000072
It is formed into 5 groups of question and answer pairs hk=[q,ak]Embedding question-answer pairs and subtitles into 768-dimensional vectors by using a BERT word embedding model, and projecting the 768-dimensional vectors to a 128-dimensional space through a layer of fully-connected network to obtain question-answer characteristics
Figure GDA0003589607050000073
And caption features
Figure GDA0003589607050000074
Wherein L issRepresenting the number of words of each question-answer pair, and T representing the total number of frames;
splicing two sections of caption features to be used as caption suggestions, calculating corresponding video frames according to timestamps of captions, setting corresponding video frame labels to be 1, setting the labels to be 0, using the 01 string as the caption suggestions, embedding the caption suggestions into 768-dimensional vectors by using a BERT word embedding model, and projecting the 768-dimensional vectors to a 128-dimensional space through a layer of full-connection network to obtain the caption suggestion features
Figure GDA0003589607050000075
Wherein T isspRepresenting the total number of feature subtitle suggestions, LspRepresenting the number of words in the caption suggestion.
S2: the video frame characteristics and the question-answer characteristics are introduced into an attention mechanism to be fused to obtain frame characteristics with attention; the caption suggestion features and the question and answer features are introduced into an attention mechanism to be fused to obtain the caption features with attention; stacking all the attention frame features and the attention subtitle features to obtain fusion features; a temporal attention score is calculated based on the fused features.
For a given coded video frame characteristic
Figure GDA0003589607050000076
And individual question-answer characteristics
Figure GDA0003589607050000077
Wherein N isoNumber of representative object regions, LhRepresenting the number of words, the similarity matrix for all word region pairs is calculated using a word-level attention mechanism:
Figure GDA0003589607050000078
wherein SimvSimilarity matrix representing frame features, simvAnd
Figure GDA0003589607050000079
multiplying v and h respectively, and using maximum pooling dimension reduction:
vatt=max(Simvv),vatt∈R128
Figure GDA0003589607050000081
fusing the obtained results through the following formula to obtain the attention frame characteristic vf∈R128
vf=([vatt;hatt;vatt⊙hatt;vatt+hatt])W1+b1
Wherein W1And b1Is the parameter to be trained and reduces the fused feature dimension to 128 dimensions, which represents a Hadamard product.
Suggesting features for a single subtitle
Figure GDA0003589607050000082
And individual question-answer characteristics
Figure GDA0003589607050000083
The similarity matrix is calculated for all word region pairs using a word-level attention mechanism:
Figure GDA0003589607050000084
wherein
Figure GDA0003589607050000085
A similarity matrix representing the features of the frame
Figure GDA0003589607050000086
And
Figure GDA0003589607050000087
multiplying with v and h respectively, and using maximum pooling dimensionality reduction in words and regions respectively:
Figure GDA0003589607050000088
Figure GDA0003589607050000089
fusing the obtained results through the following formula to obtain the character screen characteristic s with attentionf∈R128
Figure GDA00035896070500000810
Wherein W1And b1Is the parameter to be trained and reduces the fused feature dimension to 128 dimensions, which represents a Hadamard product.
All the attention frame characteristics and the attention caption characteristics are stacked to obtain Vf∈RT×128And Sf∈RT×128
Will VfAnd SfMultiplying to obtain a similar matrix Simf∈RT×T
Figure GDA00035896070500000811
Respectively matching the similarity matrix corresponding to the video frame and the similarity matrix corresponding to the subtitle with the frame characteristic V with attentionfAnd caption feature S with attention mechanismfMultiplication:
Figure GDA0003589607050000091
Sfatt=SimfSf,Sfatt∈RT×128
the obtained results are fused to obtain fusion characteristics F epsilon RT×128
F=([Vfatt;sfatt;Vfatt⊙Sfatt;Vfatt+Sfatt])W2+b2
Wherein W2And b2Representing the weight matrix to be trained.
For the resulting fusion signature F ∈ RT×128Calculating to obtain time attention score A through a full connection layer and sigmoid functionk∈RTThe temporal attention score is used to reflect the degree of correlation of the video frame and the question,
Ak=sigmoid(WF+b)
wherein W is a parameter matrix of F, and is responsible for making F epsilon RT×128Projection to RTSpace, b is an offset term, and sigmoid represents a sigmoid function.
S3: the time boundary of the problem is calculated using the time attention score.
The resulting temporal attention score is Ak∈RTSetting a threshold AtIs greater than a threshold value AtAs candidates for time boundaries;
for each time-bounded candidate, compute from start to finishRefinement score of bundle AP
Figure GDA0003589607050000092
Where st represents the start time and ed represents the end time; and selecting the highest refined partition score as a problem time boundary partition scheme { st, ed }.
S4: and calculating to obtain the answer of the question by utilizing the fusion characteristics and the time attention score.
Set the start and end times st, ed, score A for time attentionkComputing local question-answer representations using maximally pooled computational representations
Figure GDA0003589607050000093
Computing global representations at all times using maximal pooling simultaneously
Figure GDA0003589607050000094
Stitching a local representation and a global representation
Figure GDA0003589607050000095
Obtaining answer probability through the softmax layer, and taking the candidate answer with the highest answer probability as the question answer.
S5: the neural network is trained using the time boundaries of the questions and answers to the questions.
S6: optimizing network parameters of the neural network to obtain an optimal neural network, and performing video question answering and time boundary planning by using the optimal neural network.
Acquiring caption feature time attention A and a caption suggestion tag based on the question and answer features, the caption features and the caption suggestion features;
calculating binary cross entropy loss and sequencing loss according to the caption feature time attention A and the caption suggestion label, wherein the sequencing loss formula is as follows:
lossrank=1+avg(Aout)-avg(Ain)
wherein A isoutSuggesting a tag of 1 correspondence for subtitlesFeature time attention of subtitles, AinSuggesting the caption feature time attention corresponding to the caption with the label of 0 for the caption;
the binary cross entropy loss formula is:
Figure GDA0003589607050000101
wherein T isinIs the number of frames with caption recommended label of 1, ToutThe caption suggested label is 0 frame number;
the total unsupervised loss function is established as follows:
lossself=lossrank+lossbce
after the total self-supervision loss value is calculated through the total self-supervision loss function, the gradient corresponding to each input data is calculated through a chain rule, then the gradient of the next moment is obtained through a gradient descending formula, the new total self-supervision loss value is calculated and output through forward propagation again by combining the gradient of the next moment, and the steps are repeated until the function is converged.
The invention does not use time labels with expensive labels, but generates the time boundary related to the problem according to the self-designed time attention score. In addition, the invention obtains more accurate answers by mining the relation between the subtitles and the corresponding video contents.
The embodiment of the invention tests in TVQA and TVQA + data sets, wherein TVQA is a large-scale multi-mode video data set and comprises 15.25 ten thousand question and answer pairs and 2.18 ten thousand videos. TVQA is a subset of TVQA, and the data set is augmented with a problem-related label box at the frame level based on TVQA. This example uses challenge-response pairs for supervision, and the results are shown in tables 1 and 2 (where "-" indicates that the method cannot count the data) in comparison with the prior art STAGE. Aiming at the question answering performance, the invention counts the classification accuracy, namely the proportion of samples with correct answers to total samples. For time positioning, the invention counts average time intersection ratio (T.mIoU), namely the average of the intersection of the predicted time boundary and the real time boundary of all samples in the proportion of a union set, and answer-duration joint accuracy rate (ASA), namely the proportion of the time intersection ratio (IoU) in the proportion of total samples when the answer is correct.
Table 1 test results of the present invention and technique-on TVQA data set
Name of method Acc. T.mIoU ASA
STAGE 66.38% - -
WSQG 69.13% 29.28% 20.78%
Table 2 test results of the present invention and technique-on TVQA + data set
Name of method Acc. T.mIoU ASA
STAGE 68.31% - -
WSQG 67.88% 30.30% 21.98%
As can be seen from tables 1 and 2, the method of the present invention has higher accuracy than the prior art STAGE.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that the invention is not limited thereto, and may be embodied in other forms without departing from the spirit or essential characteristics thereof. Any modifications which do not depart from the functional and structural principles of the present invention are intended to be included within the scope of the claims.

Claims (6)

1. The method for multi-mode video question answering by utilizing frame-subtitle self-supervision is characterized by comprising the following steps of:
s1: extracting video frame characteristics, question and answer characteristics, caption characteristics and caption suggestion characteristics from input videos, question and answer texts and caption texts;
s2: the video frame characteristics and the question-answer characteristics are introduced into an attention mechanism to be fused to obtain frame characteristics with attention; introducing the caption suggestion features and the question and answer features into an attention mechanism for fusion to obtain the caption features with attention; stacking all the attention frame features and the attention subtitle features to obtain fusion features; calculating a time attention score based on the fusion characteristics;
s3: calculating a time boundary of the problem by using the time attention score;
s4: calculating to obtain a question answer by utilizing the fusion characteristics and the time attention score;
s5: training a neural network by using the time boundary of the question and the answer of the question;
s6: optimizing network parameters of the neural network to obtain an optimal neural network, and performing video question answering and defining a time boundary by using the optimal neural network;
the stacking all attention frame features and attention caption features to obtain a fusion feature comprises:
all the attention frame characteristics and the attention caption characteristics are stacked to obtain Vf∈RT×128And Sf∈RT×128
Will VfAnd SfMultiplying to obtain a similar matrix Simf∈RT×T
Figure FDA0003589607040000011
Respectively matching the similarity matrix corresponding to the video frame and the similarity matrix corresponding to the subtitle with the frame characteristic V with attentionfAnd caption feature S with attention mechanismfMultiplication:
Figure FDA0003589607040000012
Sfatt=SimfSf,Sfatt∈RT×128
the obtained results are fused to obtain fusion characteristics F epsilon RT×128
F=([Vfatt;sfatt;Vfatt⊙Sfatt;Vfatt+Sfatt])W2+b2
Wherein W2And b2Representing a weight matrix to be trained;
the calculating the time attention score based on the fusion features comprises:
for the resulting fusion signature F ∈ RT×128Calculating to obtain time attention score A through a full connection layer and sigmoid functionk∈RTThe temporal attention score is used to reflect the degree of correlation of the video frame and the problem,
Ak=sigmoid(WF+b)
wherein W is a parameter matrix of F, and is responsible for making F be equal to RT×128Projection to RTSpace, b is an offset item, and sigmoid represents a sigmoid function;
the calculating the time boundary of the problem by using the time attention score comprises the following steps:
the resulting temporal attention score is Ak∈RTSetting a threshold AtIs greater than a threshold value AtAs candidates for time boundaries;
for each time-bounded candidate, compute the refinement score A from start to finishP
Figure FDA0003589607040000021
Where st represents the start time and ed represents the end time; and selecting the highest refined partition score as a problem time boundary partition scheme { st, ed }.
2. The method of claim 1, wherein the extracting video frame features, question and answer features, caption features and caption suggestion features from the input video, question and answer text and caption text comprises:
for an input video, frames are extracted from the video according to a set frequency, for each frame, 20 candidate objects are segmented out by using a FasterR-CNN pre-training model to obtain characteristic expressions of the candidate objects, the candidate objects are reduced to 300 dimensions through a principal component analysis method, and then the candidate objects are projected to the video through a layer of full-connection network128-dimensional space, obtaining video frame characteristics
Figure FDA0003589607040000022
Where T represents the number of video frames, NoRepresenting the number of object regions;
for a video question-answer input comprising a question q and 5 candidate answers
Figure FDA0003589607040000031
It is formed into 5 groups of question-answer pairs hk=[q,ak]Embedding question-answer pairs and subtitles into 768-dimensional vectors by using a BERT word embedding model, and projecting the 768-dimensional vectors to a 128-dimensional space through a layer of fully-connected network to obtain question-answer characteristics
Figure FDA0003589607040000032
And caption features
Figure FDA0003589607040000033
Wherein L issRepresenting the number of words of each question-answer pair, and T representing the total number of frames;
splicing two sections of caption features to be used as caption suggestions, calculating corresponding video frames according to timestamps of captions, setting corresponding video frame labels to be 1, setting the labels to be 0, using the 01 string as the caption suggestions, embedding the caption suggestions into 768-dimensional vectors by using a BERT word embedding model, and projecting the 768-dimensional vectors to a 128-dimensional space through a layer of full-connection network to obtain the caption suggestion features
Figure FDA0003589607040000034
Wherein T isspRepresenting the total number of feature subtitle suggestions, LspRepresenting the number of words in the subtitle suggestion.
3. The method for multi-modal video question answering using frame-subtitle auto-surveillance as claimed in claim 1, wherein the step of fusing the video frame features and the question answering features into an attention mechanism to obtain the frame features with attention comprises:
for a given coded viewFrequency frame characteristics
Figure FDA0003589607040000035
And individual question-answer characteristics
Figure FDA0003589607040000036
Wherein N isoNumber of representative object regions, LhRepresenting the number of words, the similarity matrix for all word region pairs is calculated using a word-level attention mechanism:
Figure FDA0003589607040000037
wherein SimvSimilarity matrix representing frame features, simvAnd
Figure FDA0003589607040000038
multiplying v and h respectively, and using maximum pooling dimension reduction:
vatt=max(Simvv),vatt∈R128
Figure FDA0003589607040000039
fusing the obtained results through the following formula to obtain the attention frame characteristic vf∈R128
vf=([vatt;hatt;vatt⊙hatt;vatt+hatt])W1+b1
Wherein W1And b1Is the parameter to be trained and reduces the fused feature dimension to 128 dimensions, which indicates a Hadamard product.
4. The method for multi-modal video question answering using frame-caption self-supervision as claimed in claim 1, wherein the step of fusing the caption proposal feature and the question answering feature into an attention mechanism to obtain the attention caption feature comprises the steps of:
suggesting features for a single subtitle
Figure FDA0003589607040000041
And individual question-answer characteristics
Figure FDA0003589607040000042
The similarity matrix is calculated for all word region pairs using a word-level attention mechanism:
Figure FDA0003589607040000043
wherein
Figure FDA0003589607040000044
A similarity matrix representing the features of the frame
Figure FDA0003589607040000045
And
Figure FDA0003589607040000046
multiplying v and h respectively, and using maximum pooling dimensionality reduction in words and regions respectively:
Figure FDA0003589607040000047
Figure FDA0003589607040000048
fusing the obtained results through the following formula to obtain the character screen characteristic s with attentionf∈R128
Figure FDA0003589607040000049
Wherein W1And b1Is the parameter to be trained and reduces the fused feature dimension to 128 dimensions, which indicates a Hadamard product.
5. The method of claim 1, wherein the computing the answer to the question using the fused feature and the temporal attention score comprises:
set the start and end times st, ed, score A for time attentionkComputing local question-answer representations using maximally pooled computational representations
Figure FDA00035896070400000410
Computing global representations at all times using maximal pooling simultaneously
Figure FDA00035896070400000411
Stitching a local representation and a global representation
Figure FDA00035896070400000412
Obtaining answer probability through the softmax layer, and taking the candidate answer with the highest answer probability as the question answer.
6. The method for multimodal video question answering using frame-subtitle auto-supervision according to claim 1, wherein the optimizing network parameters of the neural network comprises:
acquiring caption feature time attention A and a caption suggestion tag based on the question and answer features, the caption features and the caption suggestion features;
calculating binary cross entropy loss and sequencing loss according to the caption feature time attention A and the caption suggestion label, wherein the sequencing loss formula is as follows:
lossrank=1+avg(Aout)-avg(Ain)
wherein A isoutSuggesting a subtitle feature temporal attention with a tag of 1 for a subtitle, AinSuggesting a tag for the caption of0 corresponding to the caption feature time attention;
the binary cross entropy loss formula is:
Figure FDA0003589607040000051
wherein T isinIs the number of frames with caption recommended label of 1, ToutThe caption suggested label is 0 frame number;
the total unsupervised loss function is established as follows:
lossself=lossrank+lossbce
after the total self-supervision loss value is calculated through the total self-supervision loss function, the gradient corresponding to each input data is calculated through a chain rule, then the gradient of the next moment is obtained through a gradient descending formula, the new total self-supervision loss value is calculated and output through forward propagation again by combining the gradient of the next moment, and the steps are repeated until the function is converged.
CN202110017595.1A 2021-01-07 2021-01-07 Method for multi-mode video question answering by using frame-subtitle self-supervision Active CN112860945B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110017595.1A CN112860945B (en) 2021-01-07 2021-01-07 Method for multi-mode video question answering by using frame-subtitle self-supervision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110017595.1A CN112860945B (en) 2021-01-07 2021-01-07 Method for multi-mode video question answering by using frame-subtitle self-supervision

Publications (2)

Publication Number Publication Date
CN112860945A CN112860945A (en) 2021-05-28
CN112860945B true CN112860945B (en) 2022-07-08

Family

ID=76004712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110017595.1A Active CN112860945B (en) 2021-01-07 2021-01-07 Method for multi-mode video question answering by using frame-subtitle self-supervision

Country Status (1)

Country Link
CN (1) CN112860945B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590879B (en) * 2021-08-05 2022-05-31 哈尔滨理工大学 System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network
CN113688296B (en) * 2021-08-10 2022-05-31 哈尔滨理工大学 Method for solving video question-answering task based on multi-mode progressive attention model
CN113423004B (en) * 2021-08-23 2021-11-30 杭州一知智能科技有限公司 Video subtitle generating method and system based on decoupling decoding
CN113837259B (en) * 2021-09-17 2023-05-30 中山大学附属第六医院 Education video question-answering method and system for graph-note-meaning fusion of modal interaction
CN114707022B (en) * 2022-05-31 2022-09-06 浙江大学 Video question-answer data set labeling method and device, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9330084B1 (en) * 2014-12-10 2016-05-03 International Business Machines Corporation Automatically generating question-answer pairs during content ingestion by a question answering computing system
CN107766447A (en) * 2017-09-25 2018-03-06 浙江大学 It is a kind of to solve the method for video question and answer using multilayer notice network mechanism
CN107948730A (en) * 2017-10-30 2018-04-20 百度在线网络技术(北京)有限公司 Method, apparatus, equipment and storage medium based on picture generation video
CN110704601A (en) * 2019-10-11 2020-01-17 浙江大学 Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network
CN111625660A (en) * 2020-05-27 2020-09-04 腾讯科技(深圳)有限公司 Dialog generation method, video comment method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018081751A1 (en) * 2016-10-28 2018-05-03 Vilynx, Inc. Video tagging system and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9330084B1 (en) * 2014-12-10 2016-05-03 International Business Machines Corporation Automatically generating question-answer pairs during content ingestion by a question answering computing system
CN107766447A (en) * 2017-09-25 2018-03-06 浙江大学 It is a kind of to solve the method for video question and answer using multilayer notice network mechanism
CN107948730A (en) * 2017-10-30 2018-04-20 百度在线网络技术(北京)有限公司 Method, apparatus, equipment and storage medium based on picture generation video
CN110704601A (en) * 2019-10-11 2020-01-17 浙江大学 Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network
CN111625660A (en) * 2020-05-27 2020-09-04 腾讯科技(深圳)有限公司 Dialog generation method, video comment method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Junyeong Kim等.Gaining Extra Supervision via Multi-task learning for Multi-Modal Video Question Answering.《arXiv》.2019, *
王博.视觉语义表示模型在视频问答中的研究与应用.《中国优秀硕士学位论文全文数据库 信息科技辑》.2020, *

Also Published As

Publication number Publication date
CN112860945A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
CN112860945B (en) Method for multi-mode video question answering by using frame-subtitle self-supervision
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
Gao et al. Hierarchical representation network with auxiliary tasks for video captioning and video question answering
Torabi et al. Learning language-visual embedding for movie understanding with natural-language
CN112528676B (en) Document-level event argument extraction method
Dilawari et al. ASoVS: abstractive summarization of video sequences
US11120268B2 (en) Automatically evaluating caption quality of rich media using context learning
CN111274804A (en) Case information extraction method based on named entity recognition
CN113221571B (en) Entity relation joint extraction method based on entity correlation attention mechanism
CN112580362A (en) Visual behavior recognition method and system based on text semantic supervision and computer readable medium
CN112800292A (en) Cross-modal retrieval method based on modal specificity and shared feature learning
CN112905736A (en) Unsupervised text emotion analysis method based on quantum theory
CN116029305A (en) Chinese attribute-level emotion analysis method, system, equipment and medium based on multitask learning
Chen et al. Image captioning with memorized knowledge
CN114065738B (en) Chinese spelling error correction method based on multitask learning
CN116049557A (en) Educational resource recommendation method based on multi-mode pre-training model
CN114048290A (en) Text classification method and device
CN112651225B (en) Multi-item selection machine reading understanding method based on multi-stage maximum attention
CN113051904B (en) Link prediction method for small-scale knowledge graph
Xue et al. LCSNet: End-to-end lipreading with channel-aware feature selection
Mazaheri et al. Video fill in the blank using lr/rl lstms with spatial-temporal attentions
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN115759102A (en) Chinese poetry wine culture named entity recognition method
CN115659242A (en) Multimode emotion classification method based on mode enhanced convolution graph
CN117195892B (en) Classroom teaching evaluation method and system based on big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant