CN113590879B

CN113590879B - System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network

Info

Publication number: CN113590879B
Application number: CN202110896068.2A
Authority: CN
Inventors: 孙广路; 梁丽丽; 李天麟; 刘昕雨
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2022-05-31
Anticipated expiration: 2041-08-05
Also published as: CN113590879A

Abstract

The invention provides a system, a method, a computer and a storage medium for shortening a timestamp network and solving multi-event video question-answering, and belongs to the cross field of computer vision and natural language processing. Video and subtitles are extracted as event embedding at multiple levels, and features of questions and candidate answers are extracted. Attention weights of different events are obtained by using problem-oriented attention, and key event embedding in the video is extracted by using an intercept matrix in a fuzzy theory. And respectively focusing on key event embedding of different modalities by using the question and the answer, and generating context information with question guidance and answer guidance. The context of question-oriented and answer-oriented is adaptively fused to generate answers. Compared with a general video question-answering scheme, the method extracts multi-mode embedding of a plurality of events from the video, screens out key events by using theories such as an intercept matrix in fuzzy mathematics and the like, and improves the accuracy of answering by removing redundant information. The effect of the method in video question answering is better than that of the traditional method.

Description

System, method, computer and storage medium for shortening timestamp and solving multi-event video question-answering through network

Technical Field

The invention relates to a video question-answering method, in particular to a system, a method, a computer and a storage medium for shortening timestamp network to solve multi-event video question-answering, and belongs to the cross field of computer vision and natural language processing.

Background

The task is an important problem in the field of artificial intelligence and deep learning, and automatically selects a choice which can answer a given question and best accords with the content of a video from a plurality of given candidate answers as a predicted answer according to the input video containing audio-visual information and corresponding text questions.

Video is usually composed of consecutive frames, which contain more time series information than pure images, such as scene transitions, object motion, etc. In addition, the questions and candidate answers in a video question-answer are typically composed of a continuous sequence of text. Therefore, the core technology for solving the video question-answering task in the prior art mainly derives from the related methods of natural language processing, namely an encoder-decoder structure, an attention mechanism and a memory network. The idea of the encoder-decoder structure is to encode the timing information of the video and the information of the problem by using the encoder, and then generate the answer by using the decoder. The idea of the attention model is to calculate the similarity between the question and the video, assign a higher weight value to the video information related to the question, and generate an answer based thereon. The idea of the memory network is to store all information in a longer video by using a memory array, so as to prevent the problem of information loss in the encoding process. At present, the main solution idea of video question answering is to organically combine the encoder-decoder structure, the attention mechanism and the memory network, and assist in strengthening learning, generating the technical optimization results of the confrontation network and the like, and generating more accurate answers.

However, as video queries are researched and developed, the length of the query video gradually increases, and the average duration of the video based on the video query data sets of movies and television shows reaches 200 seconds and 60 seconds respectively, such video may contain many events instead of a single event. This makes the solution of the multi-event video question-and-answer method involving long videos require two additional capabilities: the ability to discover and locate problem-related events from lengthy videos, and the ability to accurately reason about problem-related inter-event relationships and intra-event information. The existing video question-answering technology encodes the information of the whole video and utilizes an attention mechanism to carry out reasoning, which causes the information to consider excessive redundant information, thereby influencing the understanding of the video information and reducing the accuracy of answer prediction. Experiments prove that if the key event related to the problem can be accurately positioned from the long video, namely the starting time and the ending time of the key event are determined, the accuracy of video question-answer prediction can be effectively improved.

In order to solve the above problems, the present invention uses an attention mechanism and a fuzzy theory to process event information in a video, shorten a time stamp of information related to a problem, precisely locate a key event related to the problem, and respectively infer problem-oriented context information and answer-oriented context information by using the attention mechanism to predict an answer.

Disclosure of Invention

The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or critical elements of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.

The invention provides a system for shortening timestamp and solving multi-event video question-answering by a network, which comprises a frame-level and clip-level extraction module, a subtitle-level extraction module, a question and answer extraction module, an event embedding module, a key event embedding module, a context information module and an answer selection module, wherein the frame-level and clip-level extraction module is used for extracting a plurality of questions and answers;

the frame-level and clip-level extraction module is used for extracting the frame-level and clip-level event characteristics of the video;

the subtitle level extraction module is used for extracting event characteristics of a subtitle level;

the question and answer extraction module is used for extracting input question and post-selected answer features;

the event embedding module is used for respectively embedding the event characteristics into a memory array to obtain event embedding of a frame level, a clipping level and a subtitle level;

the key event embedding module is used for screening out key event embedding at a frame level, a clipping level and a subtitle level from event features;

the context information module is used for generating context information with question guidance and answer guidance;

the answer selection module is used for obtaining a predicted answer.

A method for shortening timestamp and solving multi-event video question-answering by a network comprises the following steps:

s1, extracting the event characteristics of the frame level and the clip level of the video for the input video;

s2, extracting event features of a caption level for a caption corresponding to an input video;

s3, extracting the characteristics of the questions and the answers for the input questions and candidate answers;

s4, respectively embedding the event characteristics into a memory array to obtain frame-level, clip-level and subtitle-level event embedding;

s5, designing a time stamp shortening module based on an attention mechanism and a fuzzy theory, and respectively screening frame-level, clip-level and subtitle-level key event embedding from the event features according to the extracted problem features;

s6, designing a module for generating context information based on an attention mechanism, and generating context information with question guidance and answer guidance according to the extracted question features and answer features;

and S7, designing a self-adaptive answer selection module, and obtaining a predicted answer according to the extracted answer characteristics.

Preferably, the specific method for extracting the event features at the frame level and the clip level of the video in step S1 is as follows:

extracting frame-level event features:

extracting frame-level event features of an input video using a pre-trained residual neural network for the input video, f ═ f₁,f₂,...,f_NWhere f represents the frame-level event characteristics of the entire video, f_iRepresenting the ith frame-level event characteristic of the video, and N representing the number of events in the video;

extracting clipping-level event features:

extracting clip-level event features of an input video using a pre-trained three-dimensional convolutional network for the input video, c ═ c₁,c₂,...,c_NWhere c represents the clip-level event characteristics of the entire video, c_iRepresenting the ith clip level event characteristic in the video and N representing the number of events in the video.

Preferably, the specific method for extracting the event feature at the subtitle level in step S2 is as follows: extracting event characteristics of caption level by using a pre-trained BERT model for the caption corresponding to input video, wherein s is { s ═ s₁,s₂,...,s_NWhere s represents the caption-level event characteristics of the entire video, s_iRepresenting the ith caption-level event characteristic, and N representing the number of events in the video.

Preferably, the specific method for extracting the question and answer features in step S3 and obtaining the event embedding at the frame level, the clip level and the subtitle level in step S4 is as follows:

extracting problem features:

for an input question, problem features are extracted using a pre-trained word embedding model GloVe, where u is { q ═ q₁,q₂,...,q_MWhere u represents the characteristics of the entire problem, q_iRepresents the characteristics of the ith word, and M represents the length of the question;

and (3) answer features are extracted:

for input candidate answers, features of the candidate answers are extracted using a pre-trained word embedding model

Where g represents the features of all candidate answers,

the ith word feature represents the jth candidate answer, and T represents the maximum word number of the candidate answer.

The specific method for obtaining event embedding at the frame level, clip level and subtitle level is as follows:

training a network combining a linear activation layer and a convolution layer, and respectively embedding the frame-level event feature f, the clip-level event feature c and the subtitle-level event feature s into a memory array to obtain a frame-level event embedding M^fClip level event embedding M^cAnd subtitle level event embedding M^s。

Preferably, the specific method for screening key event embedding at frame level, clip level and subtitle level from event features in step S5 is as follows:

s51 design attention layer to calculate the correlation r between event embedding and problem feature u at different levels_uf、r_uc、r_usThe calculation formula is as follows:

r_uf＝softmax(uM^f)

r_uc＝softmax(uM^c)

r_us＝softmax(uM^s)

wherein softmax (·) is a soft target normalization function, and the calculation formula is as follows:

wherein x represents any vector comprising n elements, each being (x)₁,x₂,...,x_n)，x_iAnd x_jRespectively represents the ith and jth elements in the vector x, and exp (-) represents an exponential function with e as the base;

s52 construction of frame-level correlation fuzzy matrix R^ufClipping-level correlation ambiguity matrix R^ucAnd caption level correlation fuzzy matrix R^usThe construction formula is as follows:

R^uf＝(r_uf)_N×3,R^uc＝(r_uc)_N×3,R^us＝(r_us)_N×3,0≤r_uf,r_uc,r_us≤1

s53 calculating frame level, clip level and subtitle level correlation ambiguity matrix R^uf、R^uc、R^usOf the boolean type lambda intercept matrix

The calculation formula is as follows:

s54 matrix for Boolean type lambda truncation

Locking out problem-related valid clues in multiple events using union operations that follow fuzzy set basic operation rules

Wherein t is_min＝[t_start,t_end]The shortest timestamp representing the event associated with the problem is calculated as follows:

wherein, t_startRepresenting the start time, t, of the shortest time stamp_endRepresents the end time of the shortest timestamp;

s55 utilizes valid clues associated with the question

Respectively updating the frame-level, clip-level and subtitle-level key event embedding M obtained in step S4^f、M^c、M^sThe formula is as follows:

wherein, the lines represent element-by-element multiplication for embedding and fusing the valid clues with the events obtained in step S4; respectively using W_f,W_c,W_sAs a weight matrix, with b_f,b_c,b_sUpdating event embedding for a linear layer of bias parameters to be key event embedding only containing information related to the problem; i.e. representing an update operation.

Preferably, the specific method for generating the context information with question guidance and with answer guidance in step S6 is as follows:

s61 generating frame level, clip level and subtitle level context information with question guidance

S62 frame level, clip level and subtitle level context information with answer guidance

S63 fusion of frame level, clip level and subtitle level information to generate problem oriented context information O_u；

S64 fusion of frame level, clip level and subtitle level information to generate context information O with answer guidance_g。

Preferably, the specific method for obtaining the predicted answer according to the extracted answer features in step S7 is as follows:

s71 combining context information O with problem orientation_uContext information O with answer guidance_gAnd answer features g, dynamically calculating confidence z for the candidate answer using an adaptive weight α;

s72 selecting the answer with the highest confidence from the candidate answers as the predicted answer

S73 predicting the answer

And comparing the real answers with the real answers in the training data, and updating the parameters of the shortened timestamp network according to the compared difference values.

A computer comprising a memory storing a computer program and a processor, said processor implementing the steps of a method for solving a multi-event video question-and-answer with a reduced time stamp network when executing said computer program.

A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements a method for reducing time stamp network resolution of multi-event video question answering.

The invention has the following beneficial effects: the invention respectively extracts the characteristics of frame level, clip level and subtitle level for a plurality of events in the video, improves the efficiency of acquiring a plurality of information such as scenes, appearances, motions and the like in the events, and improves the capability of acquiring video information. Aiming at the problem of excessive redundant information in a long video, the invention designs a time stamp shortening module based on an attention mechanism and a fuzzy matrix, selects key events related to the problem from a plurality of events of the video, and removes the redundant information, thereby improving the accuracy of reasoning. The invention designs a method for respectively generating context information by combining the question characteristics and the candidate answer characteristics, fully integrates the acquired event embedding with the question and the candidate answer, and improves the comprehension capability of the provided method to video information.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic diagram of a system according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a module for using shortened timestamps in accordance with an embodiment of the present invention;

fig. 4 is an overall framework of a network for reducing timestamps according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The first embodiment is as follows:

referring to fig. 1 to illustrate the embodiment, the system for shortening the timestamp and solving the multi-event video question-answering by the network of the embodiment includes a frame-level and clip-level extraction module, a subtitle-level extraction module, a question and answer extraction module, an event embedding module, a key event embedding module, a context information module and an answer selection module;

the caption level extraction module is used for extracting the event characteristics of the caption level;

the question and answer extraction module is used for extracting input question and post-selected answer characteristics;

the key event embedding module is used for screening key event embedding at a frame level, a clipping level and a subtitle level from event features;

the answer selection module is used for obtaining a predicted answer.

Example two:

referring to fig. 2 to 4 to illustrate the present embodiment, a method for solving a multi-event video question-answering problem by a shortened timestamp network of the present embodiment includes the following steps:

s1, extracting the frame-level and clip-level event features of the video respectively by using a residual error neural network and three-dimensional convolution for the input video;

extracting frame-level event features:

extracting frame-level event features of an input video using a pre-trained residual neural network for the input video, f ═ f₁,f₂,…,f_NWhere f represents the frame-level event characteristics of the entire video, f_iRepresenting the ith frame-level event characteristic of the video, and N representing the number of events in the video;

extracting clipping-level event features:

extracting clip-level event features of an input video using a pre-trained three-dimensional convolutional network for the input video, c ═ c₁,c₂,…,c_NWhere c represents a clip-level event feature of the entire video, c_iRepresenting the ith clip level event characteristic in the video and N representing the number of events in the video.

S2, extracting event features of caption levels by using a pre-training model for the captions corresponding to the input video; extracting event characteristics of caption level by using a pre-trained BERT model for the caption corresponding to input video, wherein s is { s ═ s₁,s₂,…,s_NWhere s represents the caption-level event characteristics of the entire video, s_iRepresenting the ith caption-level event characteristic, and N representing the number of events in the video.

S3, extracting the characteristics of the question and the answer by using the embedded model for the input question and the candidate answer;

extracting problem features:

for an input question, problem features are extracted using a pre-trained word embedding model GloVe, where u is { q ═ q₁,q₂,…,q_MWhere u represents the characteristics of the entire problem, q_iRepresents the characteristics of the ith word, and M represents the length of the question;

and (3) answer features are extracted:

using pre-trained for 5 candidate answers enteredThe word embedding model extracts the features of these candidate answers

Where g represents the features of all candidate answers,

Inputting 5 candidate answers is a common means of multi-choice questions in video questioning and answering, which typically has 4 to 5 candidate answers.

S4, respectively embedding the event characteristics into a memory array to obtain frame-level, clip-level and subtitle-level event embedding; training a network combining a linear activation layer and a convolution layer, and respectively embedding the frame-level event feature f, the clip-level event feature c and the subtitle-level event feature s into a memory array to obtain a frame-level event embedding M^fClip level event embedding M^cAnd subtitle level event embedding M^s。

S5, designing a shortening time stamp module based on an attention mechanism and a fuzzy theory according to the extracted event embedding representation of the frame level, the clipping level and the subtitle level, and screening out key event embedding of the frame level, the clipping level and the subtitle level from the event features respectively according to the extracted problem features;

s51 embedding M according to extracted frame level, clip level and subtitle level events^f、M^c、M^sThe attention layer is designed to calculate the correlation r between the event embedding and the problem feature u at different levels_uf、r_uc、r_usThe calculation formula is as follows:

r_uf＝softmax(uM^f)

r_uc＝softmax(uM^c)

r_us＝softmax(uM^s)

wherein softmax (·) is a soft objective normalization function, and the calculation formula is as follows:

s52 embedding relevance r with question feature for events of different levels_uf、r_uc、r_usFirstly, the concept of fuzzy matrix in fuzzy theory is introduced to construct frame-level correlation fuzzy matrix R^ufClipping-level correlation ambiguity matrix R^ucAnd caption level correlation fuzzy matrix R^usThe construction formula is as follows:

R^uf＝(r_uf)_N×3,R^uc＝(r_uc)_N×3,R^us＝(r_us)_N×3,0≤r_uf,r_uc,r_us≤1

where N represents the number of events in the video

S53 calculating frame level, clip level and caption level fuzzy matrix R according to the fuzzy theory' S matrix cutting operation^uf、R^uc、R^usOf the boolean type lambda intercept matrix

The calculation formula is as follows:

s54 matrix for Boolean type lambda truncation

s55 utilizes a valid clue R associated with the question_tminThe frame-level, clip-level and subtitle-level key event embedding M obtained in step S4 are updated separately^f、M^c、M^sThe formula is as follows:

wherein, the lines represent element-by-element multiplication for embedding and fusing the valid clues with the events obtained in step S4; respectively using W_f,W_c,W_sAs a weight matrix, with b_f,b_c,b_sEmbedding a linear layer update event for a bias parameter as a key event embedding containing only information related to the problem;i.e. representing an update operation.

embedding M from extracted frame-level, clip-level, and subtitle-level key events^f、M^c、M^sFocusing on them using question features u and answer features g, respectively, to generate frame-level, clip-level, and subtitle-level context information with question guidance

And frame level, clip level, and subtitle level contextual information with answer guidance

The calculation formula is as follows:

s62 frame-level, clip-level, and subtitle-level contextual information with answer guidance

The calculation formula is as follows:

s63 fusion of frame level, clip level and subtitle level information to generate problem oriented context information O_u(ii) a For the resulting frame-level, clip-level and subtitle-level context information with problem direction

Fusing frame-level, clip-level, and subtitle-level information using concatenation and element-level product operations to generate problem-oriented context information O_uThe calculation formula is as follows:

wherein the Concat () function represents a concatenation of all information inside.

S64 fusion of frame-level, clip-level and subtitle-level information to generate answer-oriented context information O_g. For obtaining frame level, clip level and subtitle level context information with answer guidance

Fusing frame-level, clip-level, and subtitle-level information using concatenation and element-level product operations to generate answer-oriented context information O_gThe calculation formula is as follows:

And S7, designing a self-adaptive answer selection module, and obtaining a predicted answer according to the extracted answer features.

S71 combining context information O with problem orientation_uContext information O with answer guidance_gAnd answer features g, dynamically calculating confidence z for the candidate answers by using an adaptive weight alpha; the calculation formula is as follows:

s72 selecting the answer with the highest confidence from the 5 candidate answers as the predicted answer

The calculation formula is as follows:

when z is_j＝argmax_i∈[1,5](z_i) Time of flight

Wherein argmax (·) represents finding the largest element among the plurality of elements, a_jRepresents the jth candidate answer, z_jRepresents the confidence of the jth candidate answer, z_iRepresenting the confidence of the ith candidate answer.

S73 predicting the answer

Comparing with the real answer in the training data, and updating the parameters of the shortened timestamp network according to the compared loss difference.

The method provided by the invention is subjected to experimental analysis:

the invention carries out experimental verification on a self-constructed data set, the data set is used as a video source for American life big explosion, the total time of the contained video is 461 hours, 925 scenes are involved, 21793 clips containing 118974 events and 152545 question-answer pairs are generated from the scenes, wherein a training set contains 122,039 question-answer pairs, a verification set contains 15,252 question-answer pairs, and a test set contains 7,623 question-answer pairs.

The generation of question-answer pairs is based on video and subtitles generated as templates, the template of the question first locating the video segment associated with the question with the timestamps of start and end according to the sentence pattern "when … …/before … …/after … …", and then proposing "what/how/where/why" these four types of questions for this video segment. The question types in this dataset are multiple choice questions, each with five candidate answers, but only one is correct.

The video clip is 60 to 90 seconds in length on average, contains a lot of information of character activities and scenes, and has rich dynamicity and real social interaction. In addition, the data set time stamps the beginning and end for each event in each video clip, so that critical parts in the video clip can be accurately located according to the problem.

In order to objectively evaluate the performance of the method proposed by the present invention, the present invention evaluates a shortened timestamp network according to the classification accuracy, which is the ratio of the number of correct answers to the total number of answers, and which is often used to evaluate the performance of the classification task. The formula is as follows:

wherein M represents the number of question-answer pairs, Q_tRepresents a set of questions to be asked,

representing the predicted answer and y the true answer.

In order to evaluate the performance of the algorithm, the invention respectively sets three experimental tasks by controlling input data:

the "S + Q" task: answering the given questions only according to the subtitle information of the video;

the "V + Q" task: answering the given question only according to the visual information of the video;

the "S + V + Q" task: and simultaneously, answering the given questions according to the visual information and the subtitle information of the video.

First, the invention was tested according to the procedure described in the embodied method, the test results obtained are shown in table 1, the method is denoted STN, and the measurement of the results is the accuracy (%):

TABLE 1 test results of the method of the invention on three experimental tasks

Name of method	Task of "S + Q	Task of V + Q	Task of S + V + Q
				STN	68.90	50.87	70.05

In order to verify the effectiveness of the step S5 in the specific implementation method, five ablation experimental schemes are designed and ablation tests are carried out on three experimental tasks, wherein the five ablation experimental schemes are specifically as follows, the obtained test results are shown in Table 2, and the measurement of the results is the accuracy (%):

it is shown that step S5 is removed from the test procedure, and the event generated in step S4 is directly used to embed the generated context information and predict the answer.

Removal of M embedding with clip-level events and subtitle-level events in Steps S5 and S6 during presentation test^c、M^sAll operations related, embedding M only by frame level events^fExtracting key events to embed and generate context information, and predicting answers according to the context information.

Representing the elimination of M embedding with frame level events and caption level events in steps S5 and S6 during the test^f、M^sAll operations related, embedding M by clipping-level events only^cAnd extracting key events to embed and generate context information, and predicting answers according to the context information.

Representing the elimination of embedding M with clip-level events and frame-level events in steps S5 and S6 during the testing process^c、M^fAll operations involved, embedding M only by caption-level events^sAnd extracting key events to embed and generate context information, and predicting answers according to the context information.

And STN, which means that the test is carried out by using the STN without any modification in the test process.

Table 2 results of ablation tests of the invention on three experimental tasks for step S5 of the proposed method

According to the analysis of the experimental result, the accuracy of answer prediction is well improved.

The working principle of the invention is as follows:

video and subtitles are extracted as event embedding at multiple levels, and features of questions and candidate answers are extracted. Attention weights of different events are obtained by using problem-oriented attention, and key event embedding in the video is extracted by using an intercept matrix in a fuzzy theory. And respectively focusing on key event embedding of different modalities by using the question and the answer, and generating context information with question guidance and answer guidance. The context of question-oriented and answer-oriented are adaptively fused to generate answers. Compared with a general video question-answering scheme, the method extracts multi-mode embedding of a plurality of events from the video, screens out key events by using theories such as an intercept matrix in fuzzy mathematics and the like, and improves the accuracy of answering by removing redundant information. Compared with the traditional method, the effect obtained by the method in the video question answering is better.

The computer device of the present invention may be a device including a processor, a memory, and the like, for example, a single chip microcomputer including a central processing unit and the like. And the processor is used for implementing the steps of the recommendation method capable of modifying the relationship-driven recommendation data based on the CREO software when executing the computer program stored in the memory.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Computer-readable storage medium embodiments

The computer readable storage medium of the present invention may be any form of storage medium that can be read by a processor of a computer device, including but not limited to non-volatile memory, ferroelectric memory, etc., and the computer readable storage medium has stored thereon a computer program that, when the computer program stored in the memory is read and executed by the processor of the computer device, can implement the above-mentioned steps of the CREO-based software that can modify the modeling method of the relationship-driven modeling data.

The computer program comprises computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed with respect to the scope of the invention, which is to be considered as illustrative and not restrictive, and the scope of the invention is defined by the appended claims.

Claims

1. A network multi-event video question-answering system capable of shortening time stamps is characterized by comprising a frame-level and clip-level extraction module, a subtitle-level extraction module, a question and answer extraction module, an event embedding module, a key event embedding module, a context information module and an answer selection module;

the frame-level and clip-level extraction module is used for extracting the frame-level and clip-level event features of a video, and the specific method comprises the following steps:

extracting frame-level event features:

extracting clipping-level event features:

extracting clip-level event features of an input video using a pre-trained three-dimensional convolutional network for the input video, c ═ c₁,c₂,...,c_NWhere c represents the clip-level event characteristics of the entire video, c_iRepresenting the ith clip-level event characteristic in the video, and N representing the number of events in the video;

the caption level extraction module is used for extracting the event characteristics of the caption level, and the specific method comprises the following steps: extracting event characteristics of caption level by using a pre-trained BERT model for the caption corresponding to input video, wherein s is { s ═ s₁,s₂,...,s_NWhere s represents the caption-level event characteristics of the entire video, s_iRepresenting the ith subtitle levelEvent characteristics, wherein N represents the number of events in the video;

the question and answer extraction module is used for extracting input question and candidate answer characteristics;

the specific method for extracting the problem features comprises the following steps:

the specific method for extracting the candidate answer features comprises the following steps:

Where g represents the features of all candidate answers,

representing the ith word characteristic of the jth candidate answer, and T representing the maximum word number of the candidate answer;

the event embedding module is used for respectively embedding the event characteristics into a memory array to obtain event embedding of a frame level, a clip level and a subtitle level, and the specific method comprises the following steps: training a network combining a linear activation layer and a convolution layer, and respectively embedding the frame-level event feature f, the clip-level event feature c and the subtitle-level event feature s into a memory array to obtain a frame-level event embedding M^fClip level event embedding M^cAnd subtitle level event embedding M^s；

The key event embedding module is used for screening key event embedding of a frame level, a clip level and a subtitle level from event features, and the specific method comprises the following steps:

training a network combining a linear activation layer and a convolution layer, and respectively embedding the frame-level event feature f, the clip-level event feature c and the subtitle-level event feature s into a memory array to obtain a frame-level event embedding M^fClip level event embedding M^cAnd subtitle level event embedding M^s；

The context information module is used for generating context information with question guidance and answer guidance, and the specific method is as follows: generating frame-level, clip-level, and subtitle-level context information with problem direction

Frame-level, clip-level, and subtitle-level contextual information with answer guidance

Fusing frame-level, clip-level, and subtitle-level information to generate problem-oriented context information O_uFusion of frame-level, clip-level and subtitle-level information to generate answer-oriented context information O_g；

The answer selection module is used for obtaining a predicted answer, and the specific method comprises the following steps: combining problem oriented context information O_uContext information O with answer guidance_gAnd answer features g, dynamically calculating confidence z for the candidate answers by using an adaptive weight alpha; selecting the answer with the highest confidence degree from the candidate answers as the predicted answer

The predicted answer

2. A method for shortening timestamp and solving multi-event video question-answering by a network is characterized by comprising the following steps:

s5, designing a time stamp shortening module based on an attention mechanism and a fuzzy theory, and respectively screening frame-level, clip-level and subtitle-level key event embedding from the event features according to the extracted problem features; the specific method comprises the following steps:

r_uf＝softmax(uM^f)

r_uc＝softmax(uM^c)

r_us＝softmax(uM^s)

where x represents any vector containing n elements, each being (x)₁,x₂,...,x_n)，x_iAnd x_jRespectively represents the ith and jth elements in the vector x, and exp (-) represents an exponential function with e as the base;

s52 construction of frame-level correlation ambiguity matrix R^ufClipping-level correlation ambiguity matrix R^ucAnd caption level correlation fuzzy matrix R^usThe construction formula is as follows:

R^uf＝(r_uf)_N×3,R^uc＝(r_uc)_N×3,R^us＝(r_us)_N×3,0≤r_uf,r_uc,r_us≤1

The calculation formula is as follows:

s54 matrix for Boolean type lambda truncation

wherein

s55 utilizing valid clues associated with the question

wherein, the lines represent element-by-element multiplication for embedding and fusing the valid clues with the events obtained in step S4; respectively using W_f,W_c,W_sAs a weight matrix, with b_f,b_c,b_sEmbedding a linear layer update event for a bias parameter as a key event embedding containing only information related to the problem; representing an update operation;

3. The method according to claim 2, wherein the specific method for extracting the event features at the frame level and the clip level of the video at step S1 is as follows:

extracting frame-level event features:

extracting frame-level event features of an input video using a pre-trained residual neural network for the input video, f ═ f₁,f₂,...,f_N} of whichWhere f represents the frame-level event characteristics of the entire video, f_iRepresenting the ith frame-level event characteristic of the video, and N representing the number of events in the video;

extracting clipping-level event features:

4. The method according to claim 3, wherein the step S2 is to extract event features at subtitle level by: extracting event characteristics of caption level by using a pre-trained BERT model for the caption corresponding to input video, wherein s is { s ═ s₁,s₂,...,s_NWhere s represents the caption-level event characteristics of the entire video, s_iRepresenting the ith caption-level event characteristic, and N representing the number of events in the video.

5. The method of claim 4, wherein the step S3 of extracting question and answer features and the step S4 of obtaining event embedding at frame level, clip level and subtitle level are specific methods of:

extracting problem features:

and (3) answer features are extracted:

Where g represents the features of all candidate answers,

6. The method according to claim 5, wherein the specific method for generating the context information with question guidance and with answer guidance in step S6 is as follows:

7. The method of claim 6, wherein the step S7 of obtaining the predicted answer according to the extracted answer features comprises:

s71 combining context information O with problem orientation_uContext information O with answer guidance_gAnd answer features g, using an adaptive weight αDynamically calculating the confidence coefficient z of the candidate answer;

S73 predicting the answer

8. A computer comprising a memory storing a computer program and a processor, wherein the processor when executing the computer program performs the steps of the method for reducing time stamp networking resolution of multi-event video questioning and answering according to any one of claims 2 to 7.

9. A computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements a method for a reduced time stamp network for resolving multi-event video questions and answers as claimed in any one of claims 2 to 7.