CN113590879B - System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network - Google Patents

System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network Download PDF

Info

Publication number
CN113590879B
CN113590879B CN202110896068.2A CN202110896068A CN113590879B CN 113590879 B CN113590879 B CN 113590879B CN 202110896068 A CN202110896068 A CN 202110896068A CN 113590879 B CN113590879 B CN 113590879B
Authority
CN
China
Prior art keywords
level
event
answer
video
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110896068.2A
Other languages
Chinese (zh)
Other versions
CN113590879A (en
Inventor
孙广路
梁丽丽
李天麟
刘昕雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN202110896068.2A priority Critical patent/CN113590879B/en
Publication of CN113590879A publication Critical patent/CN113590879A/en
Application granted granted Critical
Publication of CN113590879B publication Critical patent/CN113590879B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Studio Circuits (AREA)

Abstract

The invention provides a system, a method, a computer and a storage medium for shortening a timestamp network and solving multi-event video question-answering, and belongs to the cross field of computer vision and natural language processing. Video and subtitles are extracted as event embedding at multiple levels, and features of questions and candidate answers are extracted. Attention weights of different events are obtained by using problem-oriented attention, and key event embedding in the video is extracted by using an intercept matrix in a fuzzy theory. And respectively focusing on key event embedding of different modalities by using the question and the answer, and generating context information with question guidance and answer guidance. The context of question-oriented and answer-oriented is adaptively fused to generate answers. Compared with a general video question-answering scheme, the method extracts multi-mode embedding of a plurality of events from the video, screens out key events by using theories such as an intercept matrix in fuzzy mathematics and the like, and improves the accuracy of answering by removing redundant information. The effect of the method in video question answering is better than that of the traditional method.

Description

System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network
Technical Field
The invention relates to a video question-answering method, in particular to a system, a method, a computer and a storage medium for shortening timestamp network to solve multi-event video question-answering, and belongs to the cross field of computer vision and natural language processing.
Background
The task is an important problem in the field of artificial intelligence and deep learning, and automatically selects a choice which can answer a given question and best accords with the content of a video from a plurality of given candidate answers as a predicted answer according to the input video containing audio-visual information and corresponding text questions.
Video is usually composed of consecutive frames, which contain more time series information than pure images, such as scene transitions, object motion, etc. In addition, the questions and candidate answers in a video question-answer are typically composed of a continuous sequence of text. Therefore, the core technology for solving the video question-answering task in the prior art mainly derives from the related methods of natural language processing, namely an encoder-decoder structure, an attention mechanism and a memory network. The idea of the encoder-decoder structure is to encode the timing information of the video and the information of the problem by using the encoder, and then generate the answer by using the decoder. The idea of the attention model is to calculate the similarity between the question and the video, assign a higher weight value to the video information related to the question, and generate an answer based thereon. The idea of the memory network is to store all information in a longer video by using a memory array, so as to prevent the problem of information loss in the encoding process. At present, the main solution idea of video question answering is to organically combine the encoder-decoder structure, the attention mechanism and the memory network, and assist in strengthening learning, generating the technical optimization results of the confrontation network and the like, and generating more accurate answers.
However, as video queries are researched and developed, the length of the query video gradually increases, and the average duration of the video based on the video query data sets of movies and television shows reaches 200 seconds and 60 seconds respectively, such video may contain many events instead of a single event. This makes the solution of the multi-event video question-and-answer method involving long videos require two additional capabilities: the ability to discover and locate problem-related events from lengthy videos, and the ability to accurately reason about problem-related inter-event relationships and intra-event information. The existing video question-answering technology encodes the information of the whole video and utilizes an attention mechanism to carry out reasoning, which causes the information to consider excessive redundant information, thereby influencing the understanding of the video information and reducing the accuracy of answer prediction. Experiments prove that if the key event related to the problem can be accurately positioned from the long video, namely the starting time and the ending time of the key event are determined, the accuracy of video question-answer prediction can be effectively improved.
In order to solve the above problems, the present invention uses an attention mechanism and a fuzzy theory to process event information in a video, shorten a time stamp of information related to a problem, precisely locate a key event related to the problem, and respectively infer problem-oriented context information and answer-oriented context information by using the attention mechanism to predict an answer.
Disclosure of Invention
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or critical elements of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
The invention provides a system for shortening timestamp and solving multi-event video question-answering by a network, which comprises a frame-level and clip-level extraction module, a subtitle-level extraction module, a question and answer extraction module, an event embedding module, a key event embedding module, a context information module and an answer selection module, wherein the frame-level and clip-level extraction module is used for extracting a plurality of questions and answers;
the frame-level and clip-level extraction module is used for extracting the frame-level and clip-level event characteristics of the video;
the subtitle level extraction module is used for extracting event characteristics of a subtitle level;
the question and answer extraction module is used for extracting input question and post-selected answer features;
the event embedding module is used for respectively embedding the event characteristics into a memory array to obtain event embedding of a frame level, a clipping level and a subtitle level;
the key event embedding module is used for screening out key event embedding at a frame level, a clipping level and a subtitle level from event features;
the context information module is used for generating context information with question guidance and answer guidance;
the answer selection module is used for obtaining a predicted answer.
A method for shortening timestamp and solving multi-event video question-answering by a network comprises the following steps:
s1, extracting the event characteristics of the frame level and the clip level of the video for the input video;
s2, extracting event features of a caption level for a caption corresponding to an input video;
s3, extracting the characteristics of the questions and the answers for the input questions and candidate answers;
s4, respectively embedding the event characteristics into a memory array to obtain frame-level, clip-level and subtitle-level event embedding;
s5, designing a time stamp shortening module based on an attention mechanism and a fuzzy theory, and respectively screening frame-level, clip-level and subtitle-level key event embedding from the event features according to the extracted problem features;
s6, designing a module for generating context information based on an attention mechanism, and generating context information with question guidance and answer guidance according to the extracted question features and answer features;
and S7, designing a self-adaptive answer selection module, and obtaining a predicted answer according to the extracted answer characteristics.
Preferably, the specific method for extracting the event features at the frame level and the clip level of the video in step S1 is as follows:
extracting frame-level event features:
extracting frame-level event features of an input video using a pre-trained residual neural network for the input video, f ═ f1,f2,...,fNWhere f represents the frame-level event characteristics of the entire video, fiRepresenting the ith frame-level event characteristic of the video, and N representing the number of events in the video;
extracting clipping-level event features:
extracting clip-level event features of an input video using a pre-trained three-dimensional convolutional network for the input video, c ═ c1,c2,...,cNWhere c represents the clip-level event characteristics of the entire video, ciRepresenting the ith clip level event characteristic in the video and N representing the number of events in the video.
Preferably, the specific method for extracting the event feature at the subtitle level in step S2 is as follows: extracting event characteristics of caption level by using a pre-trained BERT model for the caption corresponding to input video, wherein s is { s ═ s1,s2,...,sNWhere s represents the caption-level event characteristics of the entire video, siRepresenting the ith caption-level event characteristic, and N representing the number of events in the video.
Preferably, the specific method for extracting the question and answer features in step S3 and obtaining the event embedding at the frame level, the clip level and the subtitle level in step S4 is as follows:
extracting problem features:
for an input question, problem features are extracted using a pre-trained word embedding model GloVe, where u is { q ═ q1,q2,...,qMWhere u represents the characteristics of the entire problem, qiRepresents the characteristics of the ith word, and M represents the length of the question;
and (3) answer features are extracted:
for input candidate answers, features of the candidate answers are extracted using a pre-trained word embedding model
Figure BDA0003197933770000031
Where g represents the features of all candidate answers,
Figure BDA0003197933770000032
the ith word feature represents the jth candidate answer, and T represents the maximum word number of the candidate answer.
The specific method for obtaining event embedding at the frame level, clip level and subtitle level is as follows:
training a network combining a linear activation layer and a convolution layer, and respectively embedding the frame-level event feature f, the clip-level event feature c and the subtitle-level event feature s into a memory array to obtain a frame-level event embedding MfClip level event embedding McAnd subtitle level event embedding Ms
Preferably, the specific method for screening key event embedding at frame level, clip level and subtitle level from event features in step S5 is as follows:
s51 design attention layer to calculate the correlation r between event embedding and problem feature u at different levelsuf、ruc、rusThe calculation formula is as follows:
ruf=softmax(uMf)
ruc=softmax(uMc)
rus=softmax(uMs)
wherein softmax (·) is a soft target normalization function, and the calculation formula is as follows:
Figure BDA0003197933770000041
wherein x represents any vector comprising n elements, each being (x)1,x2,...,xn),xiAnd xjRespectively represents the ith and jth elements in the vector x, and exp (-) represents an exponential function with e as the base;
s52 construction of frame-level correlation fuzzy matrix RufClipping-level correlation ambiguity matrix RucAnd caption level correlation fuzzy matrix RusThe construction formula is as follows:
Ruf=(ruf)N×3,Ruc=(ruc)N×3,Rus=(rus)N×3,0≤ruf,ruc,rus≤1
s53 calculating frame level, clip level and subtitle level correlation ambiguity matrix Ruf、Ruc、RusOf the boolean type lambda intercept matrix
Figure BDA0003197933770000042
The calculation formula is as follows:
Figure BDA0003197933770000043
Figure BDA0003197933770000044
Figure BDA0003197933770000045
s54 matrix for Boolean type lambda truncation
Figure BDA0003197933770000046
Locking out problem-related valid clues in multiple events using union operations that follow fuzzy set basic operation rules
Figure BDA0003197933770000047
Wherein t ismin=[tstart,tend]The shortest timestamp representing the event associated with the problem is calculated as follows:
Figure BDA0003197933770000048
wherein, tstartRepresenting the start time, t, of the shortest time stampendRepresents the end time of the shortest timestamp;
s55 utilizes valid clues associated with the question
Figure BDA0003197933770000049
Respectively updating the frame-level, clip-level and subtitle-level key event embedding M obtained in step S4f、Mc、MsThe formula is as follows:
Figure BDA00031979337700000410
Figure BDA0003197933770000051
Figure BDA0003197933770000052
wherein, the lines represent element-by-element multiplication for embedding and fusing the valid clues with the events obtained in step S4; respectively using Wf,Wc,WsAs a weight matrix, with bf,bc,bsUpdating event embedding for a linear layer of bias parameters to be key event embedding only containing information related to the problem; i.e. representing an update operation.
Preferably, the specific method for generating the context information with question guidance and with answer guidance in step S6 is as follows:
s61 generating frame level, clip level and subtitle level context information with question guidance
Figure BDA0003197933770000053
S62 frame level, clip level and subtitle level context information with answer guidance
Figure BDA0003197933770000054
S63 fusion of frame level, clip level and subtitle level information to generate problem oriented context information Ou
S64 fusion of frame level, clip level and subtitle level information to generate context information O with answer guidanceg
Preferably, the specific method for obtaining the predicted answer according to the extracted answer features in step S7 is as follows:
s71 combining context information O with problem orientationuContext information O with answer guidancegAnd answer features g, dynamically calculating confidence z for the candidate answer using an adaptive weight α;
s72 selecting the answer with the highest confidence from the candidate answers as the predicted answer
Figure BDA0003197933770000055
S73 predicting the answer
Figure BDA0003197933770000056
And comparing the real answers with the real answers in the training data, and updating the parameters of the shortened timestamp network according to the compared difference values.
A computer comprising a memory storing a computer program and a processor, said processor implementing the steps of a method for solving a multi-event video question-and-answer with a reduced time stamp network when executing said computer program.
A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements a method for reducing time stamp network resolution of multi-event video question answering.
The invention has the following beneficial effects: the invention respectively extracts the characteristics of frame level, clip level and subtitle level for a plurality of events in the video, improves the efficiency of acquiring a plurality of information such as scenes, appearances, motions and the like in the events, and improves the capability of acquiring video information. Aiming at the problem of excessive redundant information in a long video, the invention designs a time stamp shortening module based on an attention mechanism and a fuzzy matrix, selects key events related to the problem from a plurality of events of the video, and removes the redundant information, thereby improving the accuracy of reasoning. The invention designs a method for respectively generating context information by combining the question characteristics and the candidate answer characteristics, fully integrates the acquired event embedding with the question and the candidate answer, and improves the comprehension capability of the provided method to video information.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic diagram of a system according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a module for using shortened timestamps in accordance with an embodiment of the present invention;
fig. 4 is an overall framework of a network for reducing timestamps according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The first embodiment is as follows:
referring to fig. 1 to illustrate the embodiment, the system for shortening the timestamp and solving the multi-event video question-answering by the network of the embodiment includes a frame-level and clip-level extraction module, a subtitle-level extraction module, a question and answer extraction module, an event embedding module, a key event embedding module, a context information module and an answer selection module;
the frame-level and clip-level extraction module is used for extracting the frame-level and clip-level event characteristics of the video;
the caption level extraction module is used for extracting the event characteristics of the caption level;
the question and answer extraction module is used for extracting input question and post-selected answer characteristics;
the event embedding module is used for respectively embedding the event characteristics into a memory array to obtain event embedding of a frame level, a clipping level and a subtitle level;
the key event embedding module is used for screening key event embedding at a frame level, a clipping level and a subtitle level from event features;
the context information module is used for generating context information with question guidance and answer guidance;
the answer selection module is used for obtaining a predicted answer.
Example two:
referring to fig. 2 to 4 to illustrate the present embodiment, a method for solving a multi-event video question-answering problem by a shortened timestamp network of the present embodiment includes the following steps:
s1, extracting the frame-level and clip-level event features of the video respectively by using a residual error neural network and three-dimensional convolution for the input video;
extracting frame-level event features:
extracting frame-level event features of an input video using a pre-trained residual neural network for the input video, f ═ f1,f2,…,fNWhere f represents the frame-level event characteristics of the entire video, fiRepresenting the ith frame-level event characteristic of the video, and N representing the number of events in the video;
extracting clipping-level event features:
extracting clip-level event features of an input video using a pre-trained three-dimensional convolutional network for the input video, c ═ c1,c2,…,cNWhere c represents a clip-level event feature of the entire video, ciRepresenting the ith clip level event characteristic in the video and N representing the number of events in the video.
S2, extracting event features of caption levels by using a pre-training model for the captions corresponding to the input video; extracting event characteristics of caption level by using a pre-trained BERT model for the caption corresponding to input video, wherein s is { s ═ s1,s2,…,sNWhere s represents the caption-level event characteristics of the entire video, siRepresenting the ith caption-level event characteristic, and N representing the number of events in the video.
S3, extracting the characteristics of the question and the answer by using the embedded model for the input question and the candidate answer;
extracting problem features:
for an input question, problem features are extracted using a pre-trained word embedding model GloVe, where u is { q ═ q1,q2,…,qMWhere u represents the characteristics of the entire problem, qiRepresents the characteristics of the ith word, and M represents the length of the question;
and (3) answer features are extracted:
using pre-trained for 5 candidate answers enteredThe word embedding model extracts the features of these candidate answers
Figure BDA0003197933770000071
Where g represents the features of all candidate answers,
Figure BDA0003197933770000072
the ith word feature represents the jth candidate answer, and T represents the maximum word number of the candidate answer.
Inputting 5 candidate answers is a common means of multi-choice questions in video questioning and answering, which typically has 4 to 5 candidate answers.
S4, respectively embedding the event characteristics into a memory array to obtain frame-level, clip-level and subtitle-level event embedding; training a network combining a linear activation layer and a convolution layer, and respectively embedding the frame-level event feature f, the clip-level event feature c and the subtitle-level event feature s into a memory array to obtain a frame-level event embedding MfClip level event embedding McAnd subtitle level event embedding Ms
S5, designing a shortening time stamp module based on an attention mechanism and a fuzzy theory according to the extracted event embedding representation of the frame level, the clipping level and the subtitle level, and screening out key event embedding of the frame level, the clipping level and the subtitle level from the event features respectively according to the extracted problem features;
s51 embedding M according to extracted frame level, clip level and subtitle level eventsf、Mc、MsThe attention layer is designed to calculate the correlation r between the event embedding and the problem feature u at different levelsuf、ruc、rusThe calculation formula is as follows:
ruf=softmax(uMf)
ruc=softmax(uMc)
rus=softmax(uMs)
wherein softmax (·) is a soft objective normalization function, and the calculation formula is as follows:
Figure BDA0003197933770000081
wherein x represents any vector comprising n elements, each being (x)1,x2,...,xn),xiAnd xjRespectively represents the ith and jth elements in the vector x, and exp (-) represents an exponential function with e as the base;
s52 embedding relevance r with question feature for events of different levelsuf、ruc、rusFirstly, the concept of fuzzy matrix in fuzzy theory is introduced to construct frame-level correlation fuzzy matrix RufClipping-level correlation ambiguity matrix RucAnd caption level correlation fuzzy matrix RusThe construction formula is as follows:
Ruf=(ruf)N×3,Ruc=(ruc)N×3,Rus=(rus)N×3,0≤ruf,ruc,rus≤1
where N represents the number of events in the video
S53 calculating frame level, clip level and caption level fuzzy matrix R according to the fuzzy theory' S matrix cutting operationuf、Ruc、RusOf the boolean type lambda intercept matrix
Figure BDA0003197933770000082
The calculation formula is as follows:
Figure BDA0003197933770000083
Figure BDA0003197933770000084
Figure BDA0003197933770000085
s54 matrix for Boolean type lambda truncation
Figure BDA0003197933770000086
Locking out problem-related valid clues in multiple events using union operations that follow fuzzy set basic operation rules
Figure BDA0003197933770000087
Wherein t ismin=[tstart,tend]The shortest timestamp representing the event associated with the problem is calculated as follows:
Figure BDA0003197933770000088
wherein, tstartRepresenting the start time, t, of the shortest time stampendRepresents the end time of the shortest timestamp;
s55 utilizes a valid clue R associated with the questiontminThe frame-level, clip-level and subtitle-level key event embedding M obtained in step S4 are updated separatelyf、Mc、MsThe formula is as follows:
Figure BDA0003197933770000091
Figure BDA0003197933770000092
Figure BDA0003197933770000093
wherein, the lines represent element-by-element multiplication for embedding and fusing the valid clues with the events obtained in step S4; respectively using Wf,Wc,WsAs a weight matrix, with bf,bc,bsEmbedding a linear layer update event for a bias parameter as a key event embedding containing only information related to the problem;i.e. representing an update operation.
S6, designing a module for generating context information based on an attention mechanism, and generating context information with question guidance and answer guidance according to the extracted question features and answer features;
embedding M from extracted frame-level, clip-level, and subtitle-level key eventsf、Mc、MsFocusing on them using question features u and answer features g, respectively, to generate frame-level, clip-level, and subtitle-level context information with question guidance
Figure BDA0003197933770000094
Figure BDA0003197933770000095
And frame level, clip level, and subtitle level contextual information with answer guidance
Figure BDA0003197933770000096
S61 generating frame level, clip level and subtitle level context information with question guidance
Figure BDA0003197933770000097
The calculation formula is as follows:
Figure BDA0003197933770000098
Figure BDA0003197933770000099
Figure BDA00031979337700000910
s62 frame-level, clip-level, and subtitle-level contextual information with answer guidance
Figure BDA00031979337700000911
The calculation formula is as follows:
Figure BDA00031979337700000912
Figure BDA00031979337700000913
Figure BDA00031979337700000914
s63 fusion of frame level, clip level and subtitle level information to generate problem oriented context information Ou(ii) a For the resulting frame-level, clip-level and subtitle-level context information with problem direction
Figure BDA00031979337700000915
Fusing frame-level, clip-level, and subtitle-level information using concatenation and element-level product operations to generate problem-oriented context information OuThe calculation formula is as follows:
Figure BDA0003197933770000101
wherein the Concat () function represents a concatenation of all information inside.
S64 fusion of frame-level, clip-level and subtitle-level information to generate answer-oriented context information Og. For obtaining frame level, clip level and subtitle level context information with answer guidance
Figure BDA0003197933770000102
Fusing frame-level, clip-level, and subtitle-level information using concatenation and element-level product operations to generate answer-oriented context information OgThe calculation formula is as follows:
Figure BDA0003197933770000103
wherein the Concat () function represents a concatenation of all information inside.
And S7, designing a self-adaptive answer selection module, and obtaining a predicted answer according to the extracted answer features.
S71 combining context information O with problem orientationuContext information O with answer guidancegAnd answer features g, dynamically calculating confidence z for the candidate answers by using an adaptive weight alpha; the calculation formula is as follows:
Figure BDA0003197933770000107
s72 selecting the answer with the highest confidence from the 5 candidate answers as the predicted answer
Figure BDA0003197933770000104
The calculation formula is as follows:
Figure BDA0003197933770000105
when z isj=argmaxi∈[1,5](zi) Time of flight
Wherein argmax (·) represents finding the largest element among the plurality of elements, ajRepresents the jth candidate answer, zjRepresents the confidence of the jth candidate answer, ziRepresenting the confidence of the ith candidate answer.
S73 predicting the answer
Figure BDA0003197933770000106
Comparing with the real answer in the training data, and updating the parameters of the shortened timestamp network according to the compared loss difference.
The method provided by the invention is subjected to experimental analysis:
the invention carries out experimental verification on a self-constructed data set, the data set is used as a video source for American life big explosion, the total time of the contained video is 461 hours, 925 scenes are involved, 21793 clips containing 118974 events and 152545 question-answer pairs are generated from the scenes, wherein a training set contains 122,039 question-answer pairs, a verification set contains 15,252 question-answer pairs, and a test set contains 7,623 question-answer pairs.
The generation of question-answer pairs is based on video and subtitles generated as templates, the template of the question first locating the video segment associated with the question with the timestamps of start and end according to the sentence pattern "when … …/before … …/after … …", and then proposing "what/how/where/why" these four types of questions for this video segment. The question types in this dataset are multiple choice questions, each with five candidate answers, but only one is correct.
The video clip is 60 to 90 seconds in length on average, contains a lot of information of character activities and scenes, and has rich dynamicity and real social interaction. In addition, the data set time stamps the beginning and end for each event in each video clip, so that critical parts in the video clip can be accurately located according to the problem.
In order to objectively evaluate the performance of the method proposed by the present invention, the present invention evaluates a shortened timestamp network according to the classification accuracy, which is the ratio of the number of correct answers to the total number of answers, and which is often used to evaluate the performance of the classification task. The formula is as follows:
Figure BDA0003197933770000111
wherein M represents the number of question-answer pairs, QtRepresents a set of questions to be asked,
Figure BDA0003197933770000112
representing the predicted answer and y the true answer.
In order to evaluate the performance of the algorithm, the invention respectively sets three experimental tasks by controlling input data:
the "S + Q" task: answering the given questions only according to the subtitle information of the video;
the "V + Q" task: answering the given question only according to the visual information of the video;
the "S + V + Q" task: and simultaneously, answering the given questions according to the visual information and the subtitle information of the video.
First, the invention was tested according to the procedure described in the embodied method, the test results obtained are shown in table 1, the method is denoted STN, and the measurement of the results is the accuracy (%):
TABLE 1 test results of the method of the invention on three experimental tasks
Name of method Task of "S + Q Task of V + Q Task of S + V + Q
STN 68.90 50.87 70.05
In order to verify the effectiveness of the step S5 in the specific implementation method, five ablation experimental schemes are designed and ablation tests are carried out on three experimental tasks, wherein the five ablation experimental schemes are specifically as follows, the obtained test results are shown in Table 2, and the measurement of the results is the accuracy (%):
Figure BDA0003197933770000113
it is shown that step S5 is removed from the test procedure, and the event generated in step S4 is directly used to embed the generated context information and predict the answer.
Figure BDA0003197933770000114
Removal of M embedding with clip-level events and subtitle-level events in Steps S5 and S6 during presentation testc、MsAll operations related, embedding M only by frame level eventsfExtracting key events to embed and generate context information, and predicting answers according to the context information.
Figure BDA0003197933770000115
Representing the elimination of M embedding with frame level events and caption level events in steps S5 and S6 during the testf、MsAll operations related, embedding M by clipping-level events onlycAnd extracting key events to embed and generate context information, and predicting answers according to the context information.
Figure BDA0003197933770000121
Representing the elimination of embedding M with clip-level events and frame-level events in steps S5 and S6 during the testing processc、MfAll operations involved, embedding M only by caption-level eventssAnd extracting key events to embed and generate context information, and predicting answers according to the context information.
And STN, which means that the test is carried out by using the STN without any modification in the test process.
Table 2 results of ablation tests of the invention on three experimental tasks for step S5 of the proposed method
Figure BDA0003197933770000122
According to the analysis of the experimental result, the accuracy of answer prediction is well improved.
The working principle of the invention is as follows:
video and subtitles are extracted as event embedding at multiple levels, and features of questions and candidate answers are extracted. Attention weights of different events are obtained by using problem-oriented attention, and key event embedding in the video is extracted by using an intercept matrix in a fuzzy theory. And respectively focusing on key event embedding of different modalities by using the question and the answer, and generating context information with question guidance and answer guidance. The context of question-oriented and answer-oriented are adaptively fused to generate answers. Compared with a general video question-answering scheme, the method extracts multi-mode embedding of a plurality of events from the video, screens out key events by using theories such as an intercept matrix in fuzzy mathematics and the like, and improves the accuracy of answering by removing redundant information. Compared with the traditional method, the effect obtained by the method in the video question answering is better.
The computer device of the present invention may be a device including a processor, a memory, and the like, for example, a single chip microcomputer including a central processing unit and the like. And the processor is used for implementing the steps of the recommendation method capable of modifying the relationship-driven recommendation data based on the CREO software when executing the computer program stored in the memory.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Computer-readable storage medium embodiments
The computer readable storage medium of the present invention may be any form of storage medium that can be read by a processor of a computer device, including but not limited to non-volatile memory, ferroelectric memory, etc., and the computer readable storage medium has stored thereon a computer program that, when the computer program stored in the memory is read and executed by the processor of the computer device, can implement the above-mentioned steps of the CREO-based software that can modify the modeling method of the relationship-driven modeling data.
The computer program comprises computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed with respect to the scope of the invention, which is to be considered as illustrative and not restrictive, and the scope of the invention is defined by the appended claims.

Claims (9)

1. A network multi-event video question-answering system capable of shortening time stamps is characterized by comprising a frame-level and clip-level extraction module, a subtitle-level extraction module, a question and answer extraction module, an event embedding module, a key event embedding module, a context information module and an answer selection module;
the frame-level and clip-level extraction module is used for extracting the frame-level and clip-level event features of a video, and the specific method comprises the following steps:
extracting frame-level event features:
extracting frame-level event features of an input video using a pre-trained residual neural network for the input video, f ═ f1,f2,...,fNWhere f represents the frame-level event characteristics of the entire video, fiRepresenting the ith frame-level event characteristic of the video, and N representing the number of events in the video;
extracting clipping-level event features:
extracting clip-level event features of an input video using a pre-trained three-dimensional convolutional network for the input video, c ═ c1,c2,...,cNWhere c represents the clip-level event characteristics of the entire video, ciRepresenting the ith clip-level event characteristic in the video, and N representing the number of events in the video;
the caption level extraction module is used for extracting the event characteristics of the caption level, and the specific method comprises the following steps: extracting event characteristics of caption level by using a pre-trained BERT model for the caption corresponding to input video, wherein s is { s ═ s1,s2,...,sNWhere s represents the caption-level event characteristics of the entire video, siRepresenting the ith subtitle levelEvent characteristics, wherein N represents the number of events in the video;
the question and answer extraction module is used for extracting input question and candidate answer characteristics;
the specific method for extracting the problem features comprises the following steps:
for an input question, problem features are extracted using a pre-trained word embedding model GloVe, where u is { q ═ q1,q2,...,qMWhere u represents the characteristics of the entire problem, qiRepresents the characteristics of the ith word, and M represents the length of the question;
the specific method for extracting the candidate answer features comprises the following steps:
for input candidate answers, features of the candidate answers are extracted using a pre-trained word embedding model
Figure FDA0003514170580000011
Where g represents the features of all candidate answers,
Figure FDA0003514170580000012
representing the ith word characteristic of the jth candidate answer, and T representing the maximum word number of the candidate answer;
the event embedding module is used for respectively embedding the event characteristics into a memory array to obtain event embedding of a frame level, a clip level and a subtitle level, and the specific method comprises the following steps: training a network combining a linear activation layer and a convolution layer, and respectively embedding the frame-level event feature f, the clip-level event feature c and the subtitle-level event feature s into a memory array to obtain a frame-level event embedding MfClip level event embedding McAnd subtitle level event embedding Ms
The key event embedding module is used for screening key event embedding of a frame level, a clip level and a subtitle level from event features, and the specific method comprises the following steps:
training a network combining a linear activation layer and a convolution layer, and respectively embedding the frame-level event feature f, the clip-level event feature c and the subtitle-level event feature s into a memory array to obtain a frame-level event embedding MfClip level event embedding McAnd subtitle level event embedding Ms
The context information module is used for generating context information with question guidance and answer guidance, and the specific method is as follows: generating frame-level, clip-level, and subtitle-level context information with problem direction
Figure FDA0003514170580000021
Frame-level, clip-level, and subtitle-level contextual information with answer guidance
Figure FDA0003514170580000022
Fusing frame-level, clip-level, and subtitle-level information to generate problem-oriented context information OuFusion of frame-level, clip-level and subtitle-level information to generate answer-oriented context information Og
The answer selection module is used for obtaining a predicted answer, and the specific method comprises the following steps: combining problem oriented context information OuContext information O with answer guidancegAnd answer features g, dynamically calculating confidence z for the candidate answers by using an adaptive weight alpha; selecting the answer with the highest confidence degree from the candidate answers as the predicted answer
Figure FDA0003514170580000023
The predicted answer
Figure FDA0003514170580000024
And comparing the real answers with the real answers in the training data, and updating the parameters of the shortened timestamp network according to the compared difference values.
2. A method for shortening timestamp and solving multi-event video question-answering by a network is characterized by comprising the following steps:
s1, extracting the event characteristics of the frame level and the clip level of the video for the input video;
s2, extracting event features of a caption level for a caption corresponding to an input video;
s3, extracting the characteristics of the questions and the answers for the input questions and candidate answers;
s4, respectively embedding the event characteristics into a memory array to obtain frame-level, clip-level and subtitle-level event embedding;
s5, designing a time stamp shortening module based on an attention mechanism and a fuzzy theory, and respectively screening frame-level, clip-level and subtitle-level key event embedding from the event features according to the extracted problem features; the specific method comprises the following steps:
s51 design attention layer to calculate the correlation r between event embedding and problem feature u at different levelsuf、ruc、rusThe calculation formula is as follows:
ruf=softmax(uMf)
ruc=softmax(uMc)
rus=softmax(uMs)
wherein softmax (·) is a soft target normalization function, and the calculation formula is as follows:
Figure FDA0003514170580000031
where x represents any vector containing n elements, each being (x)1,x2,...,xn),xiAnd xjRespectively represents the ith and jth elements in the vector x, and exp (-) represents an exponential function with e as the base;
s52 construction of frame-level correlation ambiguity matrix RufClipping-level correlation ambiguity matrix RucAnd caption level correlation fuzzy matrix RusThe construction formula is as follows:
Ruf=(ruf)N×3,Ruc=(ruc)N×3,Rus=(rus)N×3,0≤ruf,ruc,rus≤1
s53 calculating frame level, clip level and subtitle level correlation ambiguity matrix Ruf、Ruc、RusOf the boolean type lambda intercept matrix
Figure FDA0003514170580000032
The calculation formula is as follows:
Figure FDA0003514170580000033
Figure FDA0003514170580000034
Figure FDA0003514170580000035
s54 matrix for Boolean type lambda truncation
Figure FDA0003514170580000036
Locking out problem-related valid clues in multiple events using union operations that follow fuzzy set basic operation rules
Figure FDA0003514170580000037
Wherein t ismin=[tstart,tend]The shortest timestamp representing the event associated with the problem is calculated as follows:
Figure FDA0003514170580000038
wherein
Figure FDA0003514170580000039
Wherein, tstartRepresenting the start time, t, of the shortest time stampendRepresents the end time of the shortest timestamp;
s55 utilizing valid clues associated with the question
Figure FDA00035141705800000310
Respectively updating the frame-level, clip-level and subtitle-level key event embedding M obtained in step S4f、Mc、MsThe formula is as follows:
Figure FDA00035141705800000311
Figure FDA00035141705800000312
Figure FDA00035141705800000313
wherein, the lines represent element-by-element multiplication for embedding and fusing the valid clues with the events obtained in step S4; respectively using Wf,Wc,WsAs a weight matrix, with bf,bc,bsEmbedding a linear layer update event for a bias parameter as a key event embedding containing only information related to the problem; representing an update operation;
s6, designing a module for generating context information based on an attention mechanism, and generating context information with question guidance and answer guidance according to the extracted question features and answer features;
and S7, designing a self-adaptive answer selection module, and obtaining a predicted answer according to the extracted answer features.
3. The method according to claim 2, wherein the specific method for extracting the event features at the frame level and the clip level of the video at step S1 is as follows:
extracting frame-level event features:
extracting frame-level event features of an input video using a pre-trained residual neural network for the input video, f ═ f1,f2,...,fN} of whichWhere f represents the frame-level event characteristics of the entire video, fiRepresenting the ith frame-level event characteristic of the video, and N representing the number of events in the video;
extracting clipping-level event features:
extracting clip-level event features of an input video using a pre-trained three-dimensional convolutional network for the input video, c ═ c1,c2,...,cNWhere c represents the clip-level event characteristics of the entire video, ciRepresenting the ith clip level event characteristic in the video and N representing the number of events in the video.
4. The method according to claim 3, wherein the step S2 is to extract event features at subtitle level by: extracting event characteristics of caption level by using a pre-trained BERT model for the caption corresponding to input video, wherein s is { s ═ s1,s2,...,sNWhere s represents the caption-level event characteristics of the entire video, siRepresenting the ith caption-level event characteristic, and N representing the number of events in the video.
5. The method of claim 4, wherein the step S3 of extracting question and answer features and the step S4 of obtaining event embedding at frame level, clip level and subtitle level are specific methods of:
extracting problem features:
for an input question, problem features are extracted using a pre-trained word embedding model GloVe, where u is { q ═ q1,q2,...,qMWhere u represents the characteristics of the entire problem, qiRepresents the characteristics of the ith word, and M represents the length of the question;
and (3) answer features are extracted:
for input candidate answers, features of the candidate answers are extracted using a pre-trained word embedding model
Figure FDA0003514170580000041
Where g represents the features of all candidate answers,
Figure FDA0003514170580000042
representing the ith word characteristic of the jth candidate answer, and T representing the maximum word number of the candidate answer;
the specific method for obtaining event embedding at the frame level, clip level and subtitle level is as follows:
training a network combining a linear activation layer and a convolution layer, and respectively embedding the frame-level event feature f, the clip-level event feature c and the subtitle-level event feature s into a memory array to obtain a frame-level event embedding MfClip level event embedding McAnd subtitle level event embedding Ms
6. The method according to claim 5, wherein the specific method for generating the context information with question guidance and with answer guidance in step S6 is as follows:
s61 generating frame level, clip level and subtitle level context information with question guidance
Figure FDA0003514170580000051
S62 frame level, clip level and subtitle level context information with answer guidance
Figure FDA0003514170580000052
S63 fusion of frame level, clip level and subtitle level information to generate problem oriented context information Ou
S64 fusion of frame level, clip level and subtitle level information to generate context information O with answer guidanceg
7. The method of claim 6, wherein the step S7 of obtaining the predicted answer according to the extracted answer features comprises:
s71 combining context information O with problem orientationuContext information O with answer guidancegAnd answer features g, using an adaptive weight αDynamically calculating the confidence coefficient z of the candidate answer;
s72 selecting the answer with the highest confidence from the candidate answers as the predicted answer
Figure FDA0003514170580000053
S73 predicting the answer
Figure FDA0003514170580000054
And comparing the real answers with the real answers in the training data, and updating the parameters of the shortened timestamp network according to the compared difference values.
8. A computer comprising a memory storing a computer program and a processor, wherein the processor when executing the computer program performs the steps of the method for reducing time stamp networking resolution of multi-event video questioning and answering according to any one of claims 2 to 7.
9. A computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements a method for a reduced time stamp network for resolving multi-event video questions and answers as claimed in any one of claims 2 to 7.
CN202110896068.2A 2021-08-05 2021-08-05 System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network Active CN113590879B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110896068.2A CN113590879B (en) 2021-08-05 2021-08-05 System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110896068.2A CN113590879B (en) 2021-08-05 2021-08-05 System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network

Publications (2)

Publication Number Publication Date
CN113590879A CN113590879A (en) 2021-11-02
CN113590879B true CN113590879B (en) 2022-05-31

Family

ID=78255354

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110896068.2A Active CN113590879B (en) 2021-08-05 2021-08-05 System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network

Country Status (1)

Country Link
CN (1) CN113590879B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115712760B (en) * 2022-11-29 2023-04-21 哈尔滨理工大学 Binary code abstract generation method and system based on BERT model and deep equal-length convolutional neural network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763444A (en) * 2018-05-25 2018-11-06 杭州知智能科技有限公司 The method for solving video question and answer using hierarchical coding decoder network mechanism
CN110322423A (en) * 2019-04-29 2019-10-11 天津大学 A kind of multi-modality images object detection method based on image co-registration
CN112488316A (en) * 2020-12-11 2021-03-12 合肥讯飞数码科技有限公司 Event intention reasoning method, device, equipment and storage medium
CN112559698A (en) * 2020-11-02 2021-03-26 山东师范大学 Method and system for improving video question-answering precision based on multi-mode fusion model
CN112860945A (en) * 2021-01-07 2021-05-28 国网浙江省电力有限公司 Method for multi-mode video question-answering by using frame-subtitle self-supervision

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160350653A1 (en) * 2015-06-01 2016-12-01 Salesforce.Com, Inc. Dynamic Memory Network
US10375237B1 (en) * 2016-09-12 2019-08-06 Verint Americas Inc. Virtual communications assessment system in a multimedia environment
CN110516791B (en) * 2019-08-20 2022-04-22 北京影谱科技股份有限公司 Visual question-answering method and system based on multiple attention
CN110990628A (en) * 2019-12-06 2020-04-10 浙江大学 Method for solving video question and answer by utilizing multi-granularity convolutional network self-attention context network mechanism

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763444A (en) * 2018-05-25 2018-11-06 杭州知智能科技有限公司 The method for solving video question and answer using hierarchical coding decoder network mechanism
CN110322423A (en) * 2019-04-29 2019-10-11 天津大学 A kind of multi-modality images object detection method based on image co-registration
CN112559698A (en) * 2020-11-02 2021-03-26 山东师范大学 Method and system for improving video question-answering precision based on multi-mode fusion model
CN112488316A (en) * 2020-12-11 2021-03-12 合肥讯飞数码科技有限公司 Event intention reasoning method, device, equipment and storage medium
CN112860945A (en) * 2021-01-07 2021-05-28 国网浙江省电力有限公司 Method for multi-mode video question-answering by using frame-subtitle self-supervision

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Video Question Answering:a Survey of Models and Datasets;Guanglu Sun等;《Springer》;20210125;第1904-1937页 *
针对长视频问答的深度记忆融合模型;孙广路等;《哈尔滨理工大学学报》;20210228;第26卷(第1期);第1-8页 *

Also Published As

Publication number Publication date
CN113590879A (en) 2021-11-02

Similar Documents

Publication Publication Date Title
CN109544524B (en) Attention mechanism-based multi-attribute image aesthetic evaluation system
CN110097094B (en) Multiple semantic fusion few-sample classification method for character interaction
CN110929515A (en) Reading understanding method and system based on cooperative attention and adaptive adjustment
CN111159414B (en) Text classification method and system, electronic equipment and computer readable storage medium
CN113886626B (en) Visual question-answering method of dynamic memory network model based on multi-attention mechanism
CN112487139A (en) Text-based automatic question setting method and device and computer equipment
CN115861995B (en) Visual question-answering method and device, electronic equipment and storage medium
CN114218379A (en) Intelligent question-answering system-oriented method for attributing questions which cannot be answered
CN112163560A (en) Video information processing method and device, electronic equipment and storage medium
CN110263218A (en) Video presentation document creation method, device, equipment and medium
CN113590879B (en) System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network
CN116881470A (en) Method and device for generating question-answer pairs
CN115908613A (en) Artificial intelligence-based AI model generation method, system and storage medium
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
CN117315249A (en) Image segmentation model training and segmentation method, system, equipment and medium
CN117112743A (en) Method, system and storage medium for evaluating answers of text automatic generation questions
CN113609330B (en) Video question-answering system, method, computer and storage medium based on text attention and fine-grained information
CN115599954B (en) Video question-answering method based on scene graph reasoning
CN112861580A (en) Video information processing method and device based on video information processing model
CN113609355B (en) Video question-answering system, method, computer and storage medium based on dynamic attention and graph network reasoning
CN116186312A (en) Multi-mode data enhancement method for data sensitive information discovery model
CN116127085A (en) Question rewriting method and equipment for dialogue-oriented knowledge graph questions and answers
CN113516182B (en) Visual question-answering model training and visual question-answering method and device
CN113392722A (en) Method and device for recognizing emotion of object in video, electronic equipment and storage medium
CN112446206A (en) Menu title generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant