CN113590879B - System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network - Google Patents
System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network Download PDFInfo
- Publication number
- CN113590879B CN113590879B CN202110896068.2A CN202110896068A CN113590879B CN 113590879 B CN113590879 B CN 113590879B CN 202110896068 A CN202110896068 A CN 202110896068A CN 113590879 B CN113590879 B CN 113590879B
- Authority
- CN
- China
- Prior art keywords
- level
- event
- answer
- video
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/732—Query formulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Studio Circuits (AREA)
Abstract
The invention provides a system, a method, a computer and a storage medium for shortening a timestamp network and solving multi-event video question-answering, and belongs to the cross field of computer vision and natural language processing. Video and subtitles are extracted as event embedding at multiple levels, and features of questions and candidate answers are extracted. Attention weights of different events are obtained by using problem-oriented attention, and key event embedding in the video is extracted by using an intercept matrix in a fuzzy theory. And respectively focusing on key event embedding of different modalities by using the question and the answer, and generating context information with question guidance and answer guidance. The context of question-oriented and answer-oriented is adaptively fused to generate answers. Compared with a general video question-answering scheme, the method extracts multi-mode embedding of a plurality of events from the video, screens out key events by using theories such as an intercept matrix in fuzzy mathematics and the like, and improves the accuracy of answering by removing redundant information. The effect of the method in video question answering is better than that of the traditional method.
Description
Technical Field
The invention relates to a video question-answering method, in particular to a system, a method, a computer and a storage medium for shortening timestamp network to solve multi-event video question-answering, and belongs to the cross field of computer vision and natural language processing.
Background
The task is an important problem in the field of artificial intelligence and deep learning, and automatically selects a choice which can answer a given question and best accords with the content of a video from a plurality of given candidate answers as a predicted answer according to the input video containing audio-visual information and corresponding text questions.
Video is usually composed of consecutive frames, which contain more time series information than pure images, such as scene transitions, object motion, etc. In addition, the questions and candidate answers in a video question-answer are typically composed of a continuous sequence of text. Therefore, the core technology for solving the video question-answering task in the prior art mainly derives from the related methods of natural language processing, namely an encoder-decoder structure, an attention mechanism and a memory network. The idea of the encoder-decoder structure is to encode the timing information of the video and the information of the problem by using the encoder, and then generate the answer by using the decoder. The idea of the attention model is to calculate the similarity between the question and the video, assign a higher weight value to the video information related to the question, and generate an answer based thereon. The idea of the memory network is to store all information in a longer video by using a memory array, so as to prevent the problem of information loss in the encoding process. At present, the main solution idea of video question answering is to organically combine the encoder-decoder structure, the attention mechanism and the memory network, and assist in strengthening learning, generating the technical optimization results of the confrontation network and the like, and generating more accurate answers.
However, as video queries are researched and developed, the length of the query video gradually increases, and the average duration of the video based on the video query data sets of movies and television shows reaches 200 seconds and 60 seconds respectively, such video may contain many events instead of a single event. This makes the solution of the multi-event video question-and-answer method involving long videos require two additional capabilities: the ability to discover and locate problem-related events from lengthy videos, and the ability to accurately reason about problem-related inter-event relationships and intra-event information. The existing video question-answering technology encodes the information of the whole video and utilizes an attention mechanism to carry out reasoning, which causes the information to consider excessive redundant information, thereby influencing the understanding of the video information and reducing the accuracy of answer prediction. Experiments prove that if the key event related to the problem can be accurately positioned from the long video, namely the starting time and the ending time of the key event are determined, the accuracy of video question-answer prediction can be effectively improved.
In order to solve the above problems, the present invention uses an attention mechanism and a fuzzy theory to process event information in a video, shorten a time stamp of information related to a problem, precisely locate a key event related to the problem, and respectively infer problem-oriented context information and answer-oriented context information by using the attention mechanism to predict an answer.
Disclosure of Invention
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or critical elements of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
The invention provides a system for shortening timestamp and solving multi-event video question-answering by a network, which comprises a frame-level and clip-level extraction module, a subtitle-level extraction module, a question and answer extraction module, an event embedding module, a key event embedding module, a context information module and an answer selection module, wherein the frame-level and clip-level extraction module is used for extracting a plurality of questions and answers;
the frame-level and clip-level extraction module is used for extracting the frame-level and clip-level event characteristics of the video;
the subtitle level extraction module is used for extracting event characteristics of a subtitle level;
the question and answer extraction module is used for extracting input question and post-selected answer features;
the event embedding module is used for respectively embedding the event characteristics into a memory array to obtain event embedding of a frame level, a clipping level and a subtitle level;
the key event embedding module is used for screening out key event embedding at a frame level, a clipping level and a subtitle level from event features;
the context information module is used for generating context information with question guidance and answer guidance;
the answer selection module is used for obtaining a predicted answer.
A method for shortening timestamp and solving multi-event video question-answering by a network comprises the following steps:
s1, extracting the event characteristics of the frame level and the clip level of the video for the input video;
s2, extracting event features of a caption level for a caption corresponding to an input video;
s3, extracting the characteristics of the questions and the answers for the input questions and candidate answers;
s4, respectively embedding the event characteristics into a memory array to obtain frame-level, clip-level and subtitle-level event embedding;
s5, designing a time stamp shortening module based on an attention mechanism and a fuzzy theory, and respectively screening frame-level, clip-level and subtitle-level key event embedding from the event features according to the extracted problem features;
s6, designing a module for generating context information based on an attention mechanism, and generating context information with question guidance and answer guidance according to the extracted question features and answer features;
and S7, designing a self-adaptive answer selection module, and obtaining a predicted answer according to the extracted answer characteristics.
Preferably, the specific method for extracting the event features at the frame level and the clip level of the video in step S1 is as follows:
extracting frame-level event features:
extracting frame-level event features of an input video using a pre-trained residual neural network for the input video, f ═ f1,f2,...,fNWhere f represents the frame-level event characteristics of the entire video, fiRepresenting the ith frame-level event characteristic of the video, and N representing the number of events in the video;
extracting clipping-level event features:
extracting clip-level event features of an input video using a pre-trained three-dimensional convolutional network for the input video, c ═ c1,c2,...,cNWhere c represents the clip-level event characteristics of the entire video, ciRepresenting the ith clip level event characteristic in the video and N representing the number of events in the video.
Preferably, the specific method for extracting the event feature at the subtitle level in step S2 is as follows: extracting event characteristics of caption level by using a pre-trained BERT model for the caption corresponding to input video, wherein s is { s ═ s1,s2,...,sNWhere s represents the caption-level event characteristics of the entire video, siRepresenting the ith caption-level event characteristic, and N representing the number of events in the video.
Preferably, the specific method for extracting the question and answer features in step S3 and obtaining the event embedding at the frame level, the clip level and the subtitle level in step S4 is as follows:
extracting problem features:
for an input question, problem features are extracted using a pre-trained word embedding model GloVe, where u is { q ═ q1,q2,...,qMWhere u represents the characteristics of the entire problem, qiRepresents the characteristics of the ith word, and M represents the length of the question;
and (3) answer features are extracted:
for input candidate answers, features of the candidate answers are extracted using a pre-trained word embedding modelWhere g represents the features of all candidate answers,the ith word feature represents the jth candidate answer, and T represents the maximum word number of the candidate answer.
The specific method for obtaining event embedding at the frame level, clip level and subtitle level is as follows:
training a network combining a linear activation layer and a convolution layer, and respectively embedding the frame-level event feature f, the clip-level event feature c and the subtitle-level event feature s into a memory array to obtain a frame-level event embedding MfClip level event embedding McAnd subtitle level event embedding Ms。
Preferably, the specific method for screening key event embedding at frame level, clip level and subtitle level from event features in step S5 is as follows:
s51 design attention layer to calculate the correlation r between event embedding and problem feature u at different levelsuf、ruc、rusThe calculation formula is as follows:
ruf=softmax(uMf)
ruc=softmax(uMc)
rus=softmax(uMs)
wherein softmax (·) is a soft target normalization function, and the calculation formula is as follows:
wherein x represents any vector comprising n elements, each being (x)1,x2,...,xn),xiAnd xjRespectively represents the ith and jth elements in the vector x, and exp (-) represents an exponential function with e as the base;
s52 construction of frame-level correlation fuzzy matrix RufClipping-level correlation ambiguity matrix RucAnd caption level correlation fuzzy matrix RusThe construction formula is as follows:
Ruf=(ruf)N×3,Ruc=(ruc)N×3,Rus=(rus)N×3,0≤ruf,ruc,rus≤1
s53 calculating frame level, clip level and subtitle level correlation ambiguity matrix Ruf、Ruc、RusOf the boolean type lambda intercept matrixThe calculation formula is as follows:
s54 matrix for Boolean type lambda truncationLocking out problem-related valid clues in multiple events using union operations that follow fuzzy set basic operation rulesWherein t ismin=[tstart,tend]The shortest timestamp representing the event associated with the problem is calculated as follows:
wherein, tstartRepresenting the start time, t, of the shortest time stampendRepresents the end time of the shortest timestamp;
s55 utilizes valid clues associated with the questionRespectively updating the frame-level, clip-level and subtitle-level key event embedding M obtained in step S4f、Mc、MsThe formula is as follows:
wherein, the lines represent element-by-element multiplication for embedding and fusing the valid clues with the events obtained in step S4; respectively using Wf,Wc,WsAs a weight matrix, with bf,bc,bsUpdating event embedding for a linear layer of bias parameters to be key event embedding only containing information related to the problem; i.e. representing an update operation.
Preferably, the specific method for generating the context information with question guidance and with answer guidance in step S6 is as follows:
s61 generating frame level, clip level and subtitle level context information with question guidance
S63 fusion of frame level, clip level and subtitle level information to generate problem oriented context information Ou;
S64 fusion of frame level, clip level and subtitle level information to generate context information O with answer guidanceg。
Preferably, the specific method for obtaining the predicted answer according to the extracted answer features in step S7 is as follows:
s71 combining context information O with problem orientationuContext information O with answer guidancegAnd answer features g, dynamically calculating confidence z for the candidate answer using an adaptive weight α;
s72 selecting the answer with the highest confidence from the candidate answers as the predicted answer
S73 predicting the answerAnd comparing the real answers with the real answers in the training data, and updating the parameters of the shortened timestamp network according to the compared difference values.
A computer comprising a memory storing a computer program and a processor, said processor implementing the steps of a method for solving a multi-event video question-and-answer with a reduced time stamp network when executing said computer program.
A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements a method for reducing time stamp network resolution of multi-event video question answering.
The invention has the following beneficial effects: the invention respectively extracts the characteristics of frame level, clip level and subtitle level for a plurality of events in the video, improves the efficiency of acquiring a plurality of information such as scenes, appearances, motions and the like in the events, and improves the capability of acquiring video information. Aiming at the problem of excessive redundant information in a long video, the invention designs a time stamp shortening module based on an attention mechanism and a fuzzy matrix, selects key events related to the problem from a plurality of events of the video, and removes the redundant information, thereby improving the accuracy of reasoning. The invention designs a method for respectively generating context information by combining the question characteristics and the candidate answer characteristics, fully integrates the acquired event embedding with the question and the candidate answer, and improves the comprehension capability of the provided method to video information.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic diagram of a system according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a module for using shortened timestamps in accordance with an embodiment of the present invention;
fig. 4 is an overall framework of a network for reducing timestamps according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The first embodiment is as follows:
referring to fig. 1 to illustrate the embodiment, the system for shortening the timestamp and solving the multi-event video question-answering by the network of the embodiment includes a frame-level and clip-level extraction module, a subtitle-level extraction module, a question and answer extraction module, an event embedding module, a key event embedding module, a context information module and an answer selection module;
the frame-level and clip-level extraction module is used for extracting the frame-level and clip-level event characteristics of the video;
the caption level extraction module is used for extracting the event characteristics of the caption level;
the question and answer extraction module is used for extracting input question and post-selected answer characteristics;
the event embedding module is used for respectively embedding the event characteristics into a memory array to obtain event embedding of a frame level, a clipping level and a subtitle level;
the key event embedding module is used for screening key event embedding at a frame level, a clipping level and a subtitle level from event features;
the context information module is used for generating context information with question guidance and answer guidance;
the answer selection module is used for obtaining a predicted answer.
Example two:
referring to fig. 2 to 4 to illustrate the present embodiment, a method for solving a multi-event video question-answering problem by a shortened timestamp network of the present embodiment includes the following steps:
s1, extracting the frame-level and clip-level event features of the video respectively by using a residual error neural network and three-dimensional convolution for the input video;
extracting frame-level event features:
extracting frame-level event features of an input video using a pre-trained residual neural network for the input video, f ═ f1,f2,…,fNWhere f represents the frame-level event characteristics of the entire video, fiRepresenting the ith frame-level event characteristic of the video, and N representing the number of events in the video;
extracting clipping-level event features:
extracting clip-level event features of an input video using a pre-trained three-dimensional convolutional network for the input video, c ═ c1,c2,…,cNWhere c represents a clip-level event feature of the entire video, ciRepresenting the ith clip level event characteristic in the video and N representing the number of events in the video.
S2, extracting event features of caption levels by using a pre-training model for the captions corresponding to the input video; extracting event characteristics of caption level by using a pre-trained BERT model for the caption corresponding to input video, wherein s is { s ═ s1,s2,…,sNWhere s represents the caption-level event characteristics of the entire video, siRepresenting the ith caption-level event characteristic, and N representing the number of events in the video.
S3, extracting the characteristics of the question and the answer by using the embedded model for the input question and the candidate answer;
extracting problem features:
for an input question, problem features are extracted using a pre-trained word embedding model GloVe, where u is { q ═ q1,q2,…,qMWhere u represents the characteristics of the entire problem, qiRepresents the characteristics of the ith word, and M represents the length of the question;
and (3) answer features are extracted:
using pre-trained for 5 candidate answers enteredThe word embedding model extracts the features of these candidate answersWhere g represents the features of all candidate answers,the ith word feature represents the jth candidate answer, and T represents the maximum word number of the candidate answer.
Inputting 5 candidate answers is a common means of multi-choice questions in video questioning and answering, which typically has 4 to 5 candidate answers.
S4, respectively embedding the event characteristics into a memory array to obtain frame-level, clip-level and subtitle-level event embedding; training a network combining a linear activation layer and a convolution layer, and respectively embedding the frame-level event feature f, the clip-level event feature c and the subtitle-level event feature s into a memory array to obtain a frame-level event embedding MfClip level event embedding McAnd subtitle level event embedding Ms。
S5, designing a shortening time stamp module based on an attention mechanism and a fuzzy theory according to the extracted event embedding representation of the frame level, the clipping level and the subtitle level, and screening out key event embedding of the frame level, the clipping level and the subtitle level from the event features respectively according to the extracted problem features;
s51 embedding M according to extracted frame level, clip level and subtitle level eventsf、Mc、MsThe attention layer is designed to calculate the correlation r between the event embedding and the problem feature u at different levelsuf、ruc、rusThe calculation formula is as follows:
ruf=softmax(uMf)
ruc=softmax(uMc)
rus=softmax(uMs)
wherein softmax (·) is a soft objective normalization function, and the calculation formula is as follows:
wherein x represents any vector comprising n elements, each being (x)1,x2,...,xn),xiAnd xjRespectively represents the ith and jth elements in the vector x, and exp (-) represents an exponential function with e as the base;
s52 embedding relevance r with question feature for events of different levelsuf、ruc、rusFirstly, the concept of fuzzy matrix in fuzzy theory is introduced to construct frame-level correlation fuzzy matrix RufClipping-level correlation ambiguity matrix RucAnd caption level correlation fuzzy matrix RusThe construction formula is as follows:
Ruf=(ruf)N×3,Ruc=(ruc)N×3,Rus=(rus)N×3,0≤ruf,ruc,rus≤1
where N represents the number of events in the video
S53 calculating frame level, clip level and caption level fuzzy matrix R according to the fuzzy theory' S matrix cutting operationuf、Ruc、RusOf the boolean type lambda intercept matrixThe calculation formula is as follows:
s54 matrix for Boolean type lambda truncationLocking out problem-related valid clues in multiple events using union operations that follow fuzzy set basic operation rulesWherein t ismin=[tstart,tend]The shortest timestamp representing the event associated with the problem is calculated as follows:
wherein, tstartRepresenting the start time, t, of the shortest time stampendRepresents the end time of the shortest timestamp;
s55 utilizes a valid clue R associated with the questiontminThe frame-level, clip-level and subtitle-level key event embedding M obtained in step S4 are updated separatelyf、Mc、MsThe formula is as follows:
wherein, the lines represent element-by-element multiplication for embedding and fusing the valid clues with the events obtained in step S4; respectively using Wf,Wc,WsAs a weight matrix, with bf,bc,bsEmbedding a linear layer update event for a bias parameter as a key event embedding containing only information related to the problem;i.e. representing an update operation.
S6, designing a module for generating context information based on an attention mechanism, and generating context information with question guidance and answer guidance according to the extracted question features and answer features;
embedding M from extracted frame-level, clip-level, and subtitle-level key eventsf、Mc、MsFocusing on them using question features u and answer features g, respectively, to generate frame-level, clip-level, and subtitle-level context information with question guidance And frame level, clip level, and subtitle level contextual information with answer guidance
S61 generating frame level, clip level and subtitle level context information with question guidanceThe calculation formula is as follows:
s62 frame-level, clip-level, and subtitle-level contextual information with answer guidanceThe calculation formula is as follows:
s63 fusion of frame level, clip level and subtitle level information to generate problem oriented context information Ou(ii) a For the resulting frame-level, clip-level and subtitle-level context information with problem directionFusing frame-level, clip-level, and subtitle-level information using concatenation and element-level product operations to generate problem-oriented context information OuThe calculation formula is as follows:
wherein the Concat () function represents a concatenation of all information inside.
S64 fusion of frame-level, clip-level and subtitle-level information to generate answer-oriented context information Og. For obtaining frame level, clip level and subtitle level context information with answer guidanceFusing frame-level, clip-level, and subtitle-level information using concatenation and element-level product operations to generate answer-oriented context information OgThe calculation formula is as follows:
wherein the Concat () function represents a concatenation of all information inside.
And S7, designing a self-adaptive answer selection module, and obtaining a predicted answer according to the extracted answer features.
S71 combining context information O with problem orientationuContext information O with answer guidancegAnd answer features g, dynamically calculating confidence z for the candidate answers by using an adaptive weight alpha; the calculation formula is as follows:
s72 selecting the answer with the highest confidence from the 5 candidate answers as the predicted answerThe calculation formula is as follows:
Wherein argmax (·) represents finding the largest element among the plurality of elements, ajRepresents the jth candidate answer, zjRepresents the confidence of the jth candidate answer, ziRepresenting the confidence of the ith candidate answer.
S73 predicting the answerComparing with the real answer in the training data, and updating the parameters of the shortened timestamp network according to the compared loss difference.
The method provided by the invention is subjected to experimental analysis:
the invention carries out experimental verification on a self-constructed data set, the data set is used as a video source for American life big explosion, the total time of the contained video is 461 hours, 925 scenes are involved, 21793 clips containing 118974 events and 152545 question-answer pairs are generated from the scenes, wherein a training set contains 122,039 question-answer pairs, a verification set contains 15,252 question-answer pairs, and a test set contains 7,623 question-answer pairs.
The generation of question-answer pairs is based on video and subtitles generated as templates, the template of the question first locating the video segment associated with the question with the timestamps of start and end according to the sentence pattern "when … …/before … …/after … …", and then proposing "what/how/where/why" these four types of questions for this video segment. The question types in this dataset are multiple choice questions, each with five candidate answers, but only one is correct.
The video clip is 60 to 90 seconds in length on average, contains a lot of information of character activities and scenes, and has rich dynamicity and real social interaction. In addition, the data set time stamps the beginning and end for each event in each video clip, so that critical parts in the video clip can be accurately located according to the problem.
In order to objectively evaluate the performance of the method proposed by the present invention, the present invention evaluates a shortened timestamp network according to the classification accuracy, which is the ratio of the number of correct answers to the total number of answers, and which is often used to evaluate the performance of the classification task. The formula is as follows:
wherein M represents the number of question-answer pairs, QtRepresents a set of questions to be asked,representing the predicted answer and y the true answer.
In order to evaluate the performance of the algorithm, the invention respectively sets three experimental tasks by controlling input data:
the "S + Q" task: answering the given questions only according to the subtitle information of the video;
the "V + Q" task: answering the given question only according to the visual information of the video;
the "S + V + Q" task: and simultaneously, answering the given questions according to the visual information and the subtitle information of the video.
First, the invention was tested according to the procedure described in the embodied method, the test results obtained are shown in table 1, the method is denoted STN, and the measurement of the results is the accuracy (%):
TABLE 1 test results of the method of the invention on three experimental tasks
Name of method | Task of "S + Q | Task of V + Q | Task of S + V + Q |
STN | 68.90 | 50.87 | 70.05 |
In order to verify the effectiveness of the step S5 in the specific implementation method, five ablation experimental schemes are designed and ablation tests are carried out on three experimental tasks, wherein the five ablation experimental schemes are specifically as follows, the obtained test results are shown in Table 2, and the measurement of the results is the accuracy (%):
it is shown that step S5 is removed from the test procedure, and the event generated in step S4 is directly used to embed the generated context information and predict the answer.
Removal of M embedding with clip-level events and subtitle-level events in Steps S5 and S6 during presentation testc、MsAll operations related, embedding M only by frame level eventsfExtracting key events to embed and generate context information, and predicting answers according to the context information.
Representing the elimination of M embedding with frame level events and caption level events in steps S5 and S6 during the testf、MsAll operations related, embedding M by clipping-level events onlycAnd extracting key events to embed and generate context information, and predicting answers according to the context information.
Representing the elimination of embedding M with clip-level events and frame-level events in steps S5 and S6 during the testing processc、MfAll operations involved, embedding M only by caption-level eventssAnd extracting key events to embed and generate context information, and predicting answers according to the context information.
And STN, which means that the test is carried out by using the STN without any modification in the test process.
Table 2 results of ablation tests of the invention on three experimental tasks for step S5 of the proposed method
According to the analysis of the experimental result, the accuracy of answer prediction is well improved.
The working principle of the invention is as follows:
video and subtitles are extracted as event embedding at multiple levels, and features of questions and candidate answers are extracted. Attention weights of different events are obtained by using problem-oriented attention, and key event embedding in the video is extracted by using an intercept matrix in a fuzzy theory. And respectively focusing on key event embedding of different modalities by using the question and the answer, and generating context information with question guidance and answer guidance. The context of question-oriented and answer-oriented are adaptively fused to generate answers. Compared with a general video question-answering scheme, the method extracts multi-mode embedding of a plurality of events from the video, screens out key events by using theories such as an intercept matrix in fuzzy mathematics and the like, and improves the accuracy of answering by removing redundant information. Compared with the traditional method, the effect obtained by the method in the video question answering is better.
The computer device of the present invention may be a device including a processor, a memory, and the like, for example, a single chip microcomputer including a central processing unit and the like. And the processor is used for implementing the steps of the recommendation method capable of modifying the relationship-driven recommendation data based on the CREO software when executing the computer program stored in the memory.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Computer-readable storage medium embodiments
The computer readable storage medium of the present invention may be any form of storage medium that can be read by a processor of a computer device, including but not limited to non-volatile memory, ferroelectric memory, etc., and the computer readable storage medium has stored thereon a computer program that, when the computer program stored in the memory is read and executed by the processor of the computer device, can implement the above-mentioned steps of the CREO-based software that can modify the modeling method of the relationship-driven modeling data.
The computer program comprises computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed with respect to the scope of the invention, which is to be considered as illustrative and not restrictive, and the scope of the invention is defined by the appended claims.
Claims (9)
1. A network multi-event video question-answering system capable of shortening time stamps is characterized by comprising a frame-level and clip-level extraction module, a subtitle-level extraction module, a question and answer extraction module, an event embedding module, a key event embedding module, a context information module and an answer selection module;
the frame-level and clip-level extraction module is used for extracting the frame-level and clip-level event features of a video, and the specific method comprises the following steps:
extracting frame-level event features:
extracting frame-level event features of an input video using a pre-trained residual neural network for the input video, f ═ f1,f2,...,fNWhere f represents the frame-level event characteristics of the entire video, fiRepresenting the ith frame-level event characteristic of the video, and N representing the number of events in the video;
extracting clipping-level event features:
extracting clip-level event features of an input video using a pre-trained three-dimensional convolutional network for the input video, c ═ c1,c2,...,cNWhere c represents the clip-level event characteristics of the entire video, ciRepresenting the ith clip-level event characteristic in the video, and N representing the number of events in the video;
the caption level extraction module is used for extracting the event characteristics of the caption level, and the specific method comprises the following steps: extracting event characteristics of caption level by using a pre-trained BERT model for the caption corresponding to input video, wherein s is { s ═ s1,s2,...,sNWhere s represents the caption-level event characteristics of the entire video, siRepresenting the ith subtitle levelEvent characteristics, wherein N represents the number of events in the video;
the question and answer extraction module is used for extracting input question and candidate answer characteristics;
the specific method for extracting the problem features comprises the following steps:
for an input question, problem features are extracted using a pre-trained word embedding model GloVe, where u is { q ═ q1,q2,...,qMWhere u represents the characteristics of the entire problem, qiRepresents the characteristics of the ith word, and M represents the length of the question;
the specific method for extracting the candidate answer features comprises the following steps:
for input candidate answers, features of the candidate answers are extracted using a pre-trained word embedding modelWhere g represents the features of all candidate answers,representing the ith word characteristic of the jth candidate answer, and T representing the maximum word number of the candidate answer;
the event embedding module is used for respectively embedding the event characteristics into a memory array to obtain event embedding of a frame level, a clip level and a subtitle level, and the specific method comprises the following steps: training a network combining a linear activation layer and a convolution layer, and respectively embedding the frame-level event feature f, the clip-level event feature c and the subtitle-level event feature s into a memory array to obtain a frame-level event embedding MfClip level event embedding McAnd subtitle level event embedding Ms;
The key event embedding module is used for screening key event embedding of a frame level, a clip level and a subtitle level from event features, and the specific method comprises the following steps:
training a network combining a linear activation layer and a convolution layer, and respectively embedding the frame-level event feature f, the clip-level event feature c and the subtitle-level event feature s into a memory array to obtain a frame-level event embedding MfClip level event embedding McAnd subtitle level event embedding Ms;
The context information module is used for generating context information with question guidance and answer guidance, and the specific method is as follows: generating frame-level, clip-level, and subtitle-level context information with problem directionFrame-level, clip-level, and subtitle-level contextual information with answer guidanceFusing frame-level, clip-level, and subtitle-level information to generate problem-oriented context information OuFusion of frame-level, clip-level and subtitle-level information to generate answer-oriented context information Og;
The answer selection module is used for obtaining a predicted answer, and the specific method comprises the following steps: combining problem oriented context information OuContext information O with answer guidancegAnd answer features g, dynamically calculating confidence z for the candidate answers by using an adaptive weight alpha; selecting the answer with the highest confidence degree from the candidate answers as the predicted answerThe predicted answerAnd comparing the real answers with the real answers in the training data, and updating the parameters of the shortened timestamp network according to the compared difference values.
2. A method for shortening timestamp and solving multi-event video question-answering by a network is characterized by comprising the following steps:
s1, extracting the event characteristics of the frame level and the clip level of the video for the input video;
s2, extracting event features of a caption level for a caption corresponding to an input video;
s3, extracting the characteristics of the questions and the answers for the input questions and candidate answers;
s4, respectively embedding the event characteristics into a memory array to obtain frame-level, clip-level and subtitle-level event embedding;
s5, designing a time stamp shortening module based on an attention mechanism and a fuzzy theory, and respectively screening frame-level, clip-level and subtitle-level key event embedding from the event features according to the extracted problem features; the specific method comprises the following steps:
s51 design attention layer to calculate the correlation r between event embedding and problem feature u at different levelsuf、ruc、rusThe calculation formula is as follows:
ruf=softmax(uMf)
ruc=softmax(uMc)
rus=softmax(uMs)
wherein softmax (·) is a soft target normalization function, and the calculation formula is as follows:
where x represents any vector containing n elements, each being (x)1,x2,...,xn),xiAnd xjRespectively represents the ith and jth elements in the vector x, and exp (-) represents an exponential function with e as the base;
s52 construction of frame-level correlation ambiguity matrix RufClipping-level correlation ambiguity matrix RucAnd caption level correlation fuzzy matrix RusThe construction formula is as follows:
Ruf=(ruf)N×3,Ruc=(ruc)N×3,Rus=(rus)N×3,0≤ruf,ruc,rus≤1
s53 calculating frame level, clip level and subtitle level correlation ambiguity matrix Ruf、Ruc、RusOf the boolean type lambda intercept matrixThe calculation formula is as follows:
s54 matrix for Boolean type lambda truncationLocking out problem-related valid clues in multiple events using union operations that follow fuzzy set basic operation rulesWherein t ismin=[tstart,tend]The shortest timestamp representing the event associated with the problem is calculated as follows:
Wherein, tstartRepresenting the start time, t, of the shortest time stampendRepresents the end time of the shortest timestamp;
s55 utilizing valid clues associated with the questionRespectively updating the frame-level, clip-level and subtitle-level key event embedding M obtained in step S4f、Mc、MsThe formula is as follows:
wherein, the lines represent element-by-element multiplication for embedding and fusing the valid clues with the events obtained in step S4; respectively using Wf,Wc,WsAs a weight matrix, with bf,bc,bsEmbedding a linear layer update event for a bias parameter as a key event embedding containing only information related to the problem; representing an update operation;
s6, designing a module for generating context information based on an attention mechanism, and generating context information with question guidance and answer guidance according to the extracted question features and answer features;
and S7, designing a self-adaptive answer selection module, and obtaining a predicted answer according to the extracted answer features.
3. The method according to claim 2, wherein the specific method for extracting the event features at the frame level and the clip level of the video at step S1 is as follows:
extracting frame-level event features:
extracting frame-level event features of an input video using a pre-trained residual neural network for the input video, f ═ f1,f2,...,fN} of whichWhere f represents the frame-level event characteristics of the entire video, fiRepresenting the ith frame-level event characteristic of the video, and N representing the number of events in the video;
extracting clipping-level event features:
extracting clip-level event features of an input video using a pre-trained three-dimensional convolutional network for the input video, c ═ c1,c2,...,cNWhere c represents the clip-level event characteristics of the entire video, ciRepresenting the ith clip level event characteristic in the video and N representing the number of events in the video.
4. The method according to claim 3, wherein the step S2 is to extract event features at subtitle level by: extracting event characteristics of caption level by using a pre-trained BERT model for the caption corresponding to input video, wherein s is { s ═ s1,s2,...,sNWhere s represents the caption-level event characteristics of the entire video, siRepresenting the ith caption-level event characteristic, and N representing the number of events in the video.
5. The method of claim 4, wherein the step S3 of extracting question and answer features and the step S4 of obtaining event embedding at frame level, clip level and subtitle level are specific methods of:
extracting problem features:
for an input question, problem features are extracted using a pre-trained word embedding model GloVe, where u is { q ═ q1,q2,...,qMWhere u represents the characteristics of the entire problem, qiRepresents the characteristics of the ith word, and M represents the length of the question;
and (3) answer features are extracted:
for input candidate answers, features of the candidate answers are extracted using a pre-trained word embedding modelWhere g represents the features of all candidate answers,representing the ith word characteristic of the jth candidate answer, and T representing the maximum word number of the candidate answer;
the specific method for obtaining event embedding at the frame level, clip level and subtitle level is as follows:
training a network combining a linear activation layer and a convolution layer, and respectively embedding the frame-level event feature f, the clip-level event feature c and the subtitle-level event feature s into a memory array to obtain a frame-level event embedding MfClip level event embedding McAnd subtitle level event embedding Ms。
6. The method according to claim 5, wherein the specific method for generating the context information with question guidance and with answer guidance in step S6 is as follows:
s61 generating frame level, clip level and subtitle level context information with question guidance
S63 fusion of frame level, clip level and subtitle level information to generate problem oriented context information Ou;
S64 fusion of frame level, clip level and subtitle level information to generate context information O with answer guidanceg。
7. The method of claim 6, wherein the step S7 of obtaining the predicted answer according to the extracted answer features comprises:
s71 combining context information O with problem orientationuContext information O with answer guidancegAnd answer features g, using an adaptive weight αDynamically calculating the confidence coefficient z of the candidate answer;
s72 selecting the answer with the highest confidence from the candidate answers as the predicted answer
8. A computer comprising a memory storing a computer program and a processor, wherein the processor when executing the computer program performs the steps of the method for reducing time stamp networking resolution of multi-event video questioning and answering according to any one of claims 2 to 7.
9. A computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements a method for a reduced time stamp network for resolving multi-event video questions and answers as claimed in any one of claims 2 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110896068.2A CN113590879B (en) | 2021-08-05 | 2021-08-05 | System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110896068.2A CN113590879B (en) | 2021-08-05 | 2021-08-05 | System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113590879A CN113590879A (en) | 2021-11-02 |
CN113590879B true CN113590879B (en) | 2022-05-31 |
Family
ID=78255354
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110896068.2A Active CN113590879B (en) | 2021-08-05 | 2021-08-05 | System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113590879B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115712760B (en) * | 2022-11-29 | 2023-04-21 | 哈尔滨理工大学 | Binary code abstract generation method and system based on BERT model and deep equal-length convolutional neural network |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763444A (en) * | 2018-05-25 | 2018-11-06 | 杭州知智能科技有限公司 | The method for solving video question and answer using hierarchical coding decoder network mechanism |
CN110322423A (en) * | 2019-04-29 | 2019-10-11 | 天津大学 | A kind of multi-modality images object detection method based on image co-registration |
CN112488316A (en) * | 2020-12-11 | 2021-03-12 | 合肥讯飞数码科技有限公司 | Event intention reasoning method, device, equipment and storage medium |
CN112559698A (en) * | 2020-11-02 | 2021-03-26 | 山东师范大学 | Method and system for improving video question-answering precision based on multi-mode fusion model |
CN112860945A (en) * | 2021-01-07 | 2021-05-28 | 国网浙江省电力有限公司 | Method for multi-mode video question-answering by using frame-subtitle self-supervision |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160350653A1 (en) * | 2015-06-01 | 2016-12-01 | Salesforce.Com, Inc. | Dynamic Memory Network |
US10375237B1 (en) * | 2016-09-12 | 2019-08-06 | Verint Americas Inc. | Virtual communications assessment system in a multimedia environment |
CN110516791B (en) * | 2019-08-20 | 2022-04-22 | 北京影谱科技股份有限公司 | Visual question-answering method and system based on multiple attention |
CN110990628A (en) * | 2019-12-06 | 2020-04-10 | 浙江大学 | Method for solving video question and answer by utilizing multi-granularity convolutional network self-attention context network mechanism |
-
2021
- 2021-08-05 CN CN202110896068.2A patent/CN113590879B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763444A (en) * | 2018-05-25 | 2018-11-06 | 杭州知智能科技有限公司 | The method for solving video question and answer using hierarchical coding decoder network mechanism |
CN110322423A (en) * | 2019-04-29 | 2019-10-11 | 天津大学 | A kind of multi-modality images object detection method based on image co-registration |
CN112559698A (en) * | 2020-11-02 | 2021-03-26 | 山东师范大学 | Method and system for improving video question-answering precision based on multi-mode fusion model |
CN112488316A (en) * | 2020-12-11 | 2021-03-12 | 合肥讯飞数码科技有限公司 | Event intention reasoning method, device, equipment and storage medium |
CN112860945A (en) * | 2021-01-07 | 2021-05-28 | 国网浙江省电力有限公司 | Method for multi-mode video question-answering by using frame-subtitle self-supervision |
Non-Patent Citations (2)
Title |
---|
Video Question Answering:a Survey of Models and Datasets;Guanglu Sun等;《Springer》;20210125;第1904-1937页 * |
针对长视频问答的深度记忆融合模型;孙广路等;《哈尔滨理工大学学报》;20210228;第26卷(第1期);第1-8页 * |
Also Published As
Publication number | Publication date |
---|---|
CN113590879A (en) | 2021-11-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109544524B (en) | Attention mechanism-based multi-attribute image aesthetic evaluation system | |
CN110097094B (en) | Multiple semantic fusion few-sample classification method for character interaction | |
CN110929515A (en) | Reading understanding method and system based on cooperative attention and adaptive adjustment | |
CN111159414B (en) | Text classification method and system, electronic equipment and computer readable storage medium | |
CN113886626B (en) | Visual question-answering method of dynamic memory network model based on multi-attention mechanism | |
CN112487139A (en) | Text-based automatic question setting method and device and computer equipment | |
CN115861995B (en) | Visual question-answering method and device, electronic equipment and storage medium | |
CN114218379A (en) | Intelligent question-answering system-oriented method for attributing questions which cannot be answered | |
CN112163560A (en) | Video information processing method and device, electronic equipment and storage medium | |
CN110263218A (en) | Video presentation document creation method, device, equipment and medium | |
CN113590879B (en) | System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network | |
CN116881470A (en) | Method and device for generating question-answer pairs | |
CN115908613A (en) | Artificial intelligence-based AI model generation method, system and storage medium | |
CN110852071A (en) | Knowledge point detection method, device, equipment and readable storage medium | |
CN117315249A (en) | Image segmentation model training and segmentation method, system, equipment and medium | |
CN117112743A (en) | Method, system and storage medium for evaluating answers of text automatic generation questions | |
CN113609330B (en) | Video question-answering system, method, computer and storage medium based on text attention and fine-grained information | |
CN115599954B (en) | Video question-answering method based on scene graph reasoning | |
CN112861580A (en) | Video information processing method and device based on video information processing model | |
CN113609355B (en) | Video question-answering system, method, computer and storage medium based on dynamic attention and graph network reasoning | |
CN116186312A (en) | Multi-mode data enhancement method for data sensitive information discovery model | |
CN116127085A (en) | Question rewriting method and equipment for dialogue-oriented knowledge graph questions and answers | |
CN113516182B (en) | Visual question-answering model training and visual question-answering method and device | |
CN113392722A (en) | Method and device for recognizing emotion of object in video, electronic equipment and storage medium | |
CN112446206A (en) | Menu title generation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |