CN115391511A

CN115391511A - Video question-answering method, device, system and storage medium

Info

Publication number: CN115391511A
Application number: CN202211043431.7A
Authority: CN
Inventors: 王炳乾
Original assignee: BOE Technology Group Co Ltd
Current assignee: BOE Technology Group Co Ltd
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2022-11-25
Also published as: WO2024046038A1

Abstract

A video question-answering method, device, system and storage medium includes: extracting video characteristic vectors aiming at an input video, and extracting text characteristic vectors aiming at a question text and a candidate answer text; splicing the video feature vector and the text feature vector to obtain a spliced feature vector, inputting the spliced feature vector into a first pre-training model, and learning cross-modal information between the video feature vector and the text feature vector by the first pre-training model through a self-attention mechanism to obtain a second spliced feature vector; dividing the second splicing feature vector into a second video feature vector and a second text feature vector, inputting a modal fusion model, processing the second video feature vector and the second text feature vector by the modal fusion model through a mutual attention mechanism to obtain a video expression and a text expression, and respectively pooling and fusing the video expression and the text expression to obtain a fusion feature vector; and predicting correct candidate answers according to the fused feature vectors.

Description

Video question-answering method, device, system and storage medium

Technical Field

The disclosed embodiments relate to, but not limited to, the field of natural language processing technologies, and in particular, to a video question-answering method, device, system, and storage medium.

Background

In the current mobile internet and big data era, video data on the network is in explosive growth, and as increasingly abundant information bearing media, understanding the semantics of videos is a technology for a plurality of video intelligent applications, and has important research significance and practical application value. Video question answering (Video QA) is a task of deducing correct answers from a candidate set given a Video clip and question, and with the progress of computer vision and natural language processing, the wide application of Video question answering in Video retrieval, intelligent question answering systems, auxiliary driving systems, automatic driving and the like is receiving more and more extensive attention.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the disclosure provides a video question-answering method, which comprises the following steps:

extracting video feature vectors for an input video, and extracting text feature vectors for a question text and a candidate answer text, wherein the question text is used for describing a question, and the candidate answer text is used for providing a plurality of candidate answers; splicing the video feature vector and the text feature vector to obtain a spliced feature vector, inputting the spliced feature vector into a first pre-training model, and learning cross-modal information between the video feature vector and the text feature vector by the first pre-training model through a self-attention mechanism to obtain a coded second spliced feature vector;

dividing the second mosaicing feature vector into a second video feature vector and a second text feature vector; inputting the second video feature vector and the second text feature vector into a modal fusion model, processing the second video feature vector and the second text feature vector by the modal fusion model through a mutual attention mechanism to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector;

the fused feature vectors are input to a decoding layer to predict correct candidate answers.

In an exemplary embodiment, the extracting a video feature vector for an input video includes:

and extracting frames of the input video at a preset speed, and extracting video characteristic vectors of the extracted frames by adopting a second pre-training model.

In an exemplary embodiment, the extracting text feature vectors for the question text and the candidate answer text includes:

generating a sequence string according to the question text and the candidate answer text, wherein the sequence string comprises a plurality of sequences, and each word or character in the question text and the candidate answer text corresponds to one or more sequences;

and inputting the sequence string into the first pre-training model to obtain a text feature vector.

In an exemplary embodiment, the method further comprises, before:

constructing the first pre-training model and initializing;

pre-training the first pre-training model through a plurality of self-supervision tasks, wherein the plurality of self-supervision tasks comprise a label classification task, a mask language mode task and a mask frame mode task, the label classification task is used for performing multi-label classification on videos, the mask language mode task is used for performing random shielding on texts and predicting shielding words, and the mask frame mode task is used for performing random shielding on video frames and predicting shielding frames;

calculating a loss of the first pre-trained model by a weighted sum of losses of a plurality of the self-supervised tasks.

In an exemplary embodiment, the penalty for the tag classification task and the mask language mode task is computed based on binary cross entropy and the penalty for the mask frame mode task is computed based on noise contrast estimation.

In an exemplary embodiment, the first pre-training model is a 24-layer depth transform encoder cascaded neural network, with hidden layer dimensions of 1024 and attention head number of 16, initialized with parameters from the transform's bi-directional encoder representation BERT pre-training.

In an exemplary embodiment, the processing the second video feature vector and the second text feature vector through a mutual attention mechanism includes:

taking the second video feature vector as a query vector, taking the second text feature vector as a key vector and a value vector, and performing multi-head attention;

and taking the second text feature vector as a query vector, taking the second video feature vector as a key vector and a value vector, and performing multi-head attention.

In an exemplary embodiment, the method further comprises, before:

receiving a voice input of a user;

converting the speech input into the question text by speech recognition.

In an exemplary embodiment, the method further comprises, before:

acquiring the question text;

and generating the candidate answer text corresponding to the question text according to the question text.

In an exemplary embodiment, the generating the candidate answer text corresponding to the question text according to the question text includes:

searching the triples matched with the question texts from the common knowledge graph through a keyword matching or attention mechanism model;

and generating the candidate answer text corresponding to the question text according to the matched triple.

In an exemplary embodiment, the method further comprises:

processing the video feature vector and/or the text feature vector such that a dimension of the video feature vector and a dimension of the text feature vector are the same when the video feature vector and the text feature vector are stitched.

The embodiment of the present disclosure further provides a video question answering device, which includes a memory; and a processor coupled to the memory, the processor configured to perform the steps of the video question answering method according to any one of the embodiments of the present disclosure based on instructions stored in the memory.

The embodiments of the present disclosure further provide a storage medium, on which a computer program is stored, where the program, when executed by a processor, implements the video question answering method according to any one of the embodiments of the present disclosure.

The embodiment of the present disclosure further provides a video question-answering system, which includes a video question-answering device, a monitoring system, a voice recognition device, a voice input device, and a knowledge base, wherein:

the monitoring system is configured to acquire one or more monitoring videos, process the monitoring videos according to an instruction text and output the monitoring videos to the video question-answering device;

the voice input device is configured to receive voice input and output the voice input to the voice recognition device;

the voice recognition device is configured to convert voice input into an instruction text or a question text through voice recognition, input the instruction text into the monitoring system, and input the question text into the video question-answering device;

the knowledge base is configured to store a common sense knowledge graph;

the video question-answering device is configured to receive a question text and a monitoring video, and generate a candidate answer text according to the question text, wherein the question text is used for describing a question, and the candidate answer text is used for providing a plurality of candidate answers; the method comprises the steps of extracting video feature vectors from a received surveillance video, extracting text feature vectors from a question text and a candidate answer text, splicing the video feature vectors and the text feature vectors to obtain spliced feature vectors, inputting the spliced feature vectors into a first pre-training model, and learning cross-modal information between the video feature vectors and the text feature vectors by the first pre-training model through a self-attention mechanism to obtain a second coded spliced feature vector; dividing the second mosaic feature vector into a second video feature vector and a second text feature vector; inputting the second video feature vector and the second text feature vector into a modal fusion model, wherein the modal fusion model adopts a mutual attention mechanism to process the second video feature vector and the second text feature vector to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector; the fused feature vectors are input to a decoding layer to predict correct candidate answers.

Other aspects will become apparent upon reading the attached drawings and the detailed description.

Drawings

The accompanying drawings are included to provide an understanding of the disclosed embodiments and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the example serve to explain the principles of the disclosure and not to limit the disclosure.

Fig. 1 is a schematic flow chart diagram of a video question answering method according to an exemplary embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of a visual language cross-modal model created by an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a modality fusion model created by an exemplary embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a traffic accident common sense knowledge graph according to an exemplary embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a video question answering device according to an exemplary embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a video question answering system according to an exemplary embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. Embodiments may be embodied in many different forms. One of ordinary skill in the art can readily appreciate the fact that the manner and content may be altered into one or more forms without departing from the spirit and scope of the present disclosure. Therefore, the present disclosure should not be construed as being limited to the contents described in the following embodiments. The embodiments and features of the embodiments in the present disclosure may be arbitrarily combined with each other without conflict.

In the drawings, the size of one or more constituent elements, the thickness of layers, or regions may be exaggerated for clarity. Therefore, one embodiment of the present disclosure is not necessarily limited to the dimensions, and the shapes and sizes of a plurality of components in the drawings do not reflect a true scale. Further, the drawings schematically show ideal examples, and one embodiment of the present disclosure is not limited to the shapes, numerical values, and the like shown in the drawings.

The ordinal numbers such as "first", "second", "third", and the like in the present disclosure are provided to avoid confusion of the constituent elements, and are not limited in number. "plurality" in this disclosure means two or more than two.

Computer Vision (CV) and Natural Language Processing (NLP) are two major branches of artificial intelligence that focus on simulating human intelligence in vision and language. In the past decade, deep learning has greatly advanced the development of single-modality learning in both areas. Some visual question-answering techniques encode an input image by an image encoder, encode an input question by a question encoder, dot-product-combine the encoded image and question features, and then predict the probability of candidate answers by a fully connected layer. However, these video question-answering techniques only consider unilateral modality information, and lack semantic comprehension ability. While real-world problems tend to be multi-modal. For example, an autonomous automobile should be able to handle human commands (speech), traffic signals (vision), road conditions (vision and sound).

As shown in fig. 1, an embodiment of the present disclosure provides a video question answering method, including the following steps:

step 101: extracting video characteristic vectors aiming at an input video, and extracting text characteristic vectors aiming at a question text and a candidate answer text, wherein the question text is used for describing a question, and the candidate answer text is used for providing a plurality of candidate answers; splicing the video feature vector and the text feature vector to obtain a spliced feature vector, inputting the spliced feature vector into a first pre-training model, learning cross-modal information between the video feature vector and the text feature vector by the first pre-training model through a self-attention mechanism, and obtaining a coded second spliced feature vector;

step 102: dividing the second splicing feature vector into a second video feature vector and a second text feature vector; inputting the second video feature vector and the second text feature vector into a modal fusion model, processing the second video feature vector and the second text feature vector by the modal fusion model through a mutual attention mechanism to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector;

step 103: the fused feature vectors are input to the decoding layer to predict the correct candidate answer.

According to the video question-answering method, the video feature vectors and the text feature vectors are spliced and then input into the first pre-training model, the video question-answering semantic representation capability is improved by using the vision-language cross-modal information, the video and text option pairs are subjected to deeper semantic interaction by using the mutual attention mechanism, the video content is expected to be subjected to semantic understanding more deeply, and therefore the model reasoning effect is improved. The video question-answering method provided by the embodiment of the disclosure can be applied to various application fields such as video retrieval, an intelligent question-answering system, an auxiliary driving system and automatic driving.

In some exemplary embodiments, before step 101, the method further comprises before:

constructing a first pre-training model and initializing;

pre-training the first pre-training model through a plurality of self-supervision tasks, wherein the plurality of self-supervision tasks comprise a label classification task, a mask language model task and a mask frame model task, the label classification task is used for performing multi-label classification on videos, the mask language model task is used for performing random shielding and predicting shielding words on texts, and the mask frame model task is used for performing random shielding and predicting shielding frames on video frames;

the loss of the first pre-trained model is calculated by a weighted sum of the losses of the self-supervised tasks.

Fig. 2 is a schematic structural diagram of a visual language cross-modal model created by an embodiment of the present disclosure. As shown in fig. 2, the visual language cross-modal model includes a first pre-training model, a modal fusion model, and a decoding layer in cascade.

In some exemplary embodiments, the first pre-trained model may be a 24-layer depth transform encoder cascade neural network, the hidden layer dimension is 1024, the number of attention heads is 16, and the first pre-trained model is initialized by parameters pre-trained by BERT.

The Transformer model is composed of an encoder and a decoder, wherein the encoder and the decoder respectively comprise a plurality of network blocks, and each network block of the encoder is composed of a multi-head Attention (Attention) sub-layer and a feedforward neural network sub-layer. The structure of the decoder is similar to that of the encoder, except that there is one more attention layer in each network block of the decoder.

Bidirectional Encoder Representation (BERT) from Transformer is a successful application of Transformer that utilizes a Transformer encoder and introduces a bidirectional masking technique that allows each language token to be bidirectionally focused on other tokens.

The first pre-training model of the embodiments of the present disclosure employs a multi-layer transform encoder cascade structure. Illustratively, the first pre-training model may be a 24-layer deep transform encoder cascade neural network, the hidden layer dimension is 1024, the number of attention heads is 16, at this time, the network structure of the first pre-training model is consistent with BERT Large (a natural language pre-training model proposed by Google), and therefore the first pre-training model of the embodiment of the present disclosure may be directly initialized with the pre-training results of BERT and other natural language pre-training models.

In some exemplary embodiments, extracting a video feature vector for an input video comprises:

and (3) extracting frames of the input video at a preset speed, and extracting video characteristic vectors of the extracted frames by adopting a second pre-training model.

The input of the video question-answer is a video clip, a question and a candidate answer set, and unlike the processing of images, a continuous video stream can be understood as a set of fast-playing pictures, wherein each picture is defined as a Frame (Frame). For video information, we frame the input video at a preset rate (exemplary, 1 frame per second (fps) rate), extract the highest N frames per video (exemplary, N may be 30), and extract video feature vectors (Visual Features) using a pre-trained model of Computer Vision (CV), efficientNetB 3. Optionally, in the embodiment of the present disclosure, a video feature vector may also be extracted from an input video by using a pre-training model such as MobileNet or ResNet 101.

In some exemplary embodiments, extracting the text feature vector for the question text and the candidate answer text comprises:

and inputting the sequence string into a first pre-training model to obtain a text feature vector.

In the embodiment of the present disclosure, for the question text and the candidate answer text, a word list with a size of n _ vocab is firstly used to mark the question text and the candidate answer text, each word or character is converted into one or more sequences, and all the words and/or characters in the question text and the candidate answer text form a sequence string, where the sequence may be a numerical ID. Illustratively, the value ID may be between 0 and n _ vocab-1. For example, "what type of road lane highway street is currently" is converted into "2560 2013 1300 100 567 \8230;" such a sequence string. Inputting the sequence string into a first pre-training model (where the first pre-training model may be the initialized first pre-training model), and obtaining n _ word × 1024 text feature vectors, where n _ word is the number of sequences in the sequence string, and 1024 is the dimension of each text feature vector.

In the embodiment of the present disclosure, when the Video feature vector and the text feature vector are spliced and input into the first pre-training model, the input form of the spliced feature vector may be [ CLS ] Video _ frame [ SEP ] Question-Answer [ SEP ], where Video _ frame is the Video feature vector, question-Answer is the text feature vector, [ CLS ] flag is the head flag of the spliced feature vector, and [ SEP ] flag is used to separate the Video feature vector and the text feature vector, and the same position embedding vector and segment embedding vector as BERT are added to the input of the first pre-training model, where the position embedding vector is used to specify a position in a sequence, and the segment embedding vector is used to specify a frame segment to which the segment belongs.

In some exemplary embodiments, before stitching the video feature vector with the text feature vector, the method further comprises: and processing the video feature vector and/or the text feature vector so that the dimensionality of the video feature vector is the same as the dimensionality of the text feature vector when the video feature vector is spliced with the text feature vector.

In the embodiment of the disclosure, when the video feature vector extracted by the pre-training model EfficientNetB3 is a feature vector with 1536 dimensions, the feature vector with 1536 dimensions is subjected to dimension reduction through a full connection layer, and the feature dimension is reduced to 1024 dimensions, which are the same as the dimension of the text feature vector.

In the embodiment of the present disclosure, training a visual language cross-modal model includes two stages: and performing a pre-training phase on the first pre-training model and performing a fine-tuning phase on the visual language cross-modal model.

In the pre-training stage of the first pre-training model, three tasks, namely label classification (TC), mask Language Model (MLM), and Mask Frame Model (MFM), are adopted to pre-train the first pre-training model. The label classification task is used for performing multi-label classification on the video, the mask language mode task is used for randomly shielding the text and predicting shielding words, and the mask frame mode task is used for randomly shielding the video frame and predicting shielding frames.

For example, in the tag classification task, a tag of the top 100 classes with a high frequency of occurrence may be used as a multi-tag classification task, where the tag is a video tag that is manually labeled, for example, whether an accident occurs, the accident reason, the accident occurrence location, the vehicle type, the weather condition, the time information, and the like. And (3) connecting a [ CLS ] corresponding vector of the last layer of the Bert with a full-connection layer to obtain a prediction label of the tag, and calculating Binary Cross Entropy loss (BCE loss) with a real label.

In the mask language model task, a mask (mask) is performed on random 15% of the text, and the mask text is predicted. Under a multi-mode scene, the mask words are predicted by combining the information of the video, and multi-mode information can be effectively fused.

In the mask frame mode task, mask is carried out on random 15% of video frames, and all 0 vectors are adopted for the mask video frames. Because the video feature vector is a continuous real-valued vector, without segmentation (token), it is difficult to do a classification task similar to the mask language model task. Since it is desirable that the predicted frame of a mask is as similar as possible to the frame to be the mask in all frames within the entire batch (batch), the embodiments of the present disclosure calculate the loss (loss) of mask frame modulo based on Noise Contrast Estimation (NCE), maximizing the mutual information of the mask frame and the predicted frame.

And (2) adopting multi-task combined training, wherein the Loss of the total pre-training task adopts the weighted sum of the Loss of the three pre-training tasks, and the Loss = Loss (TC) a + Loss (mlm) b + Loss (mfm) c, wherein a, b and c are respectively the weight of the Loss of the three pre-training tasks.

According to the video question-answering method, information of two modes of a video and a text is represented through a first pre-training model, a video feature vector corresponding to the video and a text feature vector corresponding to a question text and a candidate answer text are spliced, the first pre-training model is input, and a second spliced feature vector after coding is obtained. The second stitched feature vector may be divided into a second video feature vector and a second text feature vector according to a flag in the second stitched feature vector.

Assuming that V = [ V1, V2, … vm ], Q = [ Q1, Q2, \8230, Q3], a = [ a1, a2, \8230, a3] are respectively video, question and answer, wherein vi, qi, ai are respectively corresponding sequences, VLP (·) is used to represent the VLP model (i.e. the first pre-training model), the encoded representation E obtained by inputting [ CLS ] V [ SEP ] Q + a [ SEP ] into the VLP model can be expressed as:

wherein E = [ E1, E2, \8230 ], E (m + n + k)]Dividing the video frame and the question option vector obtained by encoding into two parts to obtain a second video feature vector and a second text feature vector,

and

and then, connecting the pre-trained first pre-training model with the modal fusion model and the decoding layer to fine-tune the whole visual language cross-modal model. The processing procedure of the first pre-training model is the same as the aforementioned procedure, and is not described herein again. And inputting the second video feature vector and the second text feature vector into a modal fusion model, processing the second video feature vector and the second text feature vector by the modal fusion model through a mutual attention mechanism to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector.

Fig. 3 is a schematic structural diagram of a modality fusion model created according to an embodiment of the present disclosure. As shown in fig. 3, in the modality fusion model, the Query vector (Query) comes from one modality, and the Key vector (Key) and the Value vector (Value) come from another modality. The video question-answering method provided by the embodiment of the disclosure further aligns and fuses the language and the image semantically by introducing a mutual attention mechanism.

In some exemplary embodiments, processing the second video feature vector and the second text feature vector by a mutual attention mechanism includes:

taking the second video feature vector as a query vector, taking the second text feature vector as a key vector and a value vector, and carrying out Multi-Head Attention (Multi Head Attention);

and taking the second text feature vector as a query vector and the second video feature vector as a key vector and a value vector to carry out multi-head attention.

In the embodiment of the present disclosure, the modal fusion model uses the second video feature vector as Query, uses the second text feature vector as Key and Value, and performs multi-head attention, and uses the second text feature vector as Query, and uses the second video feature vector as Key and Value, and performs multi-head attention, so that the attention exchanged in this way is one layer, and in some exemplary embodiments, multiple layers of series connection may be performed. Pooling (Pooling) a video expression representing the video frame sequence, pooling a text expression representing the question option sequence, and fusing (Fuse) the two pooled vectors to obtain a fused feature vector. The fused feature vectors are then input into the decoding layer, predicting the probability that each candidate is the correct candidate answer.

The above process is formulated as:

MHA(E ^V ，E ^QA ，E ^QA )＝concat(head ₁ ，...head _h )；

MHA ₁ ＝MHA(E ^V ，E ^QA ，E ^QA )；

MHA ₂ ＝MHA(E ^QA ，E ^V ，E ^V )；

DUMA(E ^V ，E ^QA )＝Fuse(MHA ₁ ，MHA ₂ )；

wherein the content of the first and second substances,

the parameter matrix, MHA (-) is a multi-head attention mechanism, DUMA (-) is a bidirectional multi-head co-attention network, fuse (-) represents the average pooling and fusion of the results of DUMA output, and illustratively, the fusion can be performed in a splicing manner.

For each one<V，Q，A _i >The triplet, the output over the DUMA network, is:

the loss function of the network is:

wherein, W ^T Is a parameter matrix, and s is the number of candidate answers.

In some exemplary embodiments, the method further comprises:

receiving a voice input of a user;

the speech input is converted to a question text by speech recognition.

In some exemplary embodiments, the method further comprises:

receiving a voice input of a user;

the voice input is converted into voice instructions and/or question text through voice recognition.

For example, the voice command may be "help me switch to the surveillance video of the XXXX road", and the like, and the system performs a corresponding response operation according to the received voice command, for example, switching the display screen to the surveillance video of the XXXX road. For example, the question text may be "whether there is an accident on the current road segment", "what is the cause of the accident? And the system generates a candidate answer text according to the received question text, and then predicts a correct candidate answer through the video question-answering method of the embodiment of the disclosure.

In some exemplary embodiments, the method further comprises:

and broadcasting the predicted correct candidate answer in a voice form through voice synthesis.

In the embodiment of the disclosure, the predicted correct candidate answer can be directly displayed on the display screen, or the predicted correct candidate answer can be broadcasted in a voice form.

In some exemplary embodiments, the method further comprises:

acquiring a question text;

and generating candidate answer texts corresponding to the question texts according to the question texts.

In some exemplary embodiments, generating a candidate answer text corresponding to the question text from the question text includes:

searching triples matched with the question text from the knowledge graph through a keyword matching or attention mechanism model;

and generating candidate answer texts corresponding to the question texts according to the matched triples.

As knowledge expansion grows, the use of knowledge-graphs as a concept is becoming more common. The knowledge graph can use a proper knowledge representation method to mine the connection between data through a relation layer, so that knowledge can be more easily circulated and cooperatively processed between computers. The knowledge graph represents the relationship between the entities by the data structure of the graph, and compared with the simple text information, the representation method of the graph is easier to understand and accept. The knowledge graph is formed by connecting edges of relationships of entity-relationship-entity or entity-attribute value, and the relationship between entities or the relationship between the entities and the attributes is represented. The relation search based on the knowledge graph is widely applied to entity matching and question-answering systems in various industries, can process and store known knowledge, and can quickly perform knowledge matching and answer search.

The basic composition of a knowledge graph is a triplet (S, P, O). Wherein S and O are nodes in the knowledge graph and represent entities. S specifically represents a host, and O specifically represents an object. P is an edge in the knowledge-graph connecting two entities (S and O) representing the relationship that the two entities have. For example, in a traffic accident common sense knowledge map, a triplet may be (rear-end accident, accident type, traffic accident) and so on.

For example, in an intelligent traffic multi-screen monitoring system, a large screen can be scheduled and interactively asked and answered through voice control, and monitoring of an accident is switched to a main screen for accident analysis, for example, a corresponding interactive instruction may be "help me switch to a monitoring video of a XXXX road"; "whether an accident occurs in the current road section"; "what is the cause of the accident? "in the specific implementation process, the knowledge related to the traffic accident may be implemented by introducing a knowledge base, for example, a traffic accident knowledge map as shown in fig. 4 may be constructed in the knowledge base, and a triple matched with the question text is queried from the traffic accident knowledge map through keyword matching or attention mechanism as the knowledge (i.e., candidate answer) corresponding to the displayed semantic information, for example, according to the traffic accident knowledge map as shown in fig. 4, as to the question of the user," what is the cause of the accident? ", automatically generating a candidate answer: weather, road, driver, vehicle conditions are not good.

As shown in fig. 5, in one example, the video question answering device may include: a processor 510, a memory 520 and a bus system 530, wherein the processor 510 and the memory 520 are connected via the bus system 530, the memory 520 is configured to store instructions, and the processor 510 is configured to execute the instructions stored in the memory 520 to extract video feature vectors for an input video, text feature vectors for a question text describing a question and a candidate answer text providing a plurality of candidate answers; splicing the video feature vector and the text feature vector to obtain a spliced feature vector, inputting the spliced feature vector into a first pre-training model, and learning cross-modal information between the video feature vector and the text feature vector by the first pre-training model through a self-attention mechanism to obtain a coded second spliced feature vector; dividing the second mosaic feature vector into a second video feature vector and a second text feature vector; inputting the second video feature vector and the second text feature vector into a modal fusion model, processing the second video feature vector and the second text feature vector by the modal fusion model through a mutual attention mechanism to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector; the fused feature vectors are input to a decoding layer to predict correct candidate answers.

It should be understood that processor 510 may be a Central Processing Unit (CPU), and processor 510 may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Memory 520 may include both read-only memory and random access memory and provides instructions and data to processor 510. A portion of memory 520 may also include non-volatile random access memory. For example, the memory 520 may also store device type information.

The bus system 530 may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 530 in FIG. 5.

In implementation, the processing performed by the processing device may be performed by instructions in the form of hardware, integrated logic circuits, or software in the processor 510. That is, the method steps of the embodiments of the present disclosure may be implemented by a hardware processor, or implemented by a combination of hardware and software modules in a processor. The software module may be located in a storage medium such as a random access memory, a flash memory, a read only memory, a programmable read only memory or an electrically erasable programmable memory, a register, etc. The storage medium is located in the memory 520, and the processor 510 reads the information in the memory 520, and combines the hardware to complete the steps of the method. To avoid repetition, it is not described in detail here.

As shown in fig. 6, an embodiment of the present disclosure further provides a video question-answering system, which includes the video question-answering apparatus according to any embodiment of the present disclosure, and further includes a monitoring system, a voice recognition apparatus, a voice input apparatus, and a knowledge base, where:

the monitoring system is configured to acquire one or more monitoring videos, process the monitoring videos according to the instruction text and output the monitoring videos to the video question-answering device;

a voice input device configured to receive a voice input and output to the voice recognition device;

the voice recognition device is configured to convert voice input into instruction text or question text through voice recognition, input the instruction text into the monitoring system, and input the question text into the video question-answering device;

a knowledge base configured to store a knowledge graph;

the video question-answering device is configured to receive a question text and a monitoring video, and generate a candidate answer text according to the question text, wherein the question text is used for describing a question, and the candidate answer text is used for providing a plurality of candidate answers; the video feature vector is extracted from the received surveillance video, the text feature vector is extracted according to the question text and the candidate answer text, the video feature vector and the text feature vector are spliced to obtain a spliced feature vector, the spliced feature vector is input into a first pre-training model, the first pre-training model learns cross-modal information between the video feature vector and the text feature vector through a self-attention mechanism, and a coded second spliced feature vector is obtained; dividing the second splicing feature vector into a second video feature vector and a second text feature vector; inputting the second video feature vector and the second text feature vector into a modal fusion model, processing the second video feature vector and the second text feature vector by the modal fusion model by adopting a mutual attention mechanism to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector; the fused feature vectors are input to the decoding layer to predict the correct candidate answer.

In some exemplary embodiments, the video question-answering system further comprises a speech synthesis output device, wherein:

and the voice synthesis output device is used for broadcasting the predicted correct candidate answer in a voice form through voice synthesis.

The video question-answering system of the embodiment of the disclosure is composed of several modules shown in fig. 6, wherein a voice instruction or a question is given by a user side through a voice input device, a natural language is converted into a text through a voice recognition device, an answer fed back by the system is broadcasted in a voice mode through a voice synthesis output device, and the video question-answering device performs semantic understanding and multi-mode interactive reasoning to give a corresponding answer according to the question or the instruction transmitted by the user by combining a monitoring system and a knowledge base.

In some possible embodiments, the aspects of the video question answering method provided in this application may also be implemented in the form of a program product including program code for causing a computer device to perform the steps in the video question answering method according to various exemplary embodiments of this application described above in this specification when the program product runs on the computer device, for example, the computer device may perform the video question answering method described in this application.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example but not limited to: an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The drawings in this disclosure relate only to the structures to which this disclosure relates and other structures may be referred to in the general design. Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.

It will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the true spirit and scope of the disclosed embodiments and it is intended to cover all modifications and equivalents included within the scope of the claims of the present disclosure.

Claims

1. A video question answering method is characterized by comprising the following steps:

extracting video feature vectors for input videos, and extracting text feature vectors for question texts and candidate answer texts, wherein the question texts are used for describing questions, and the candidate answer texts are used for providing a plurality of candidate answers; splicing the video feature vector and the text feature vector to obtain a spliced feature vector, inputting the spliced feature vector into a first pre-training model, and learning cross-modal information between the video feature vector and the text feature vector by the first pre-training model through a self-attention mechanism to obtain a coded second spliced feature vector;

dividing the second mosaic feature vector into a second video feature vector and a second text feature vector; inputting the second video feature vector and the second text feature vector into a modal fusion model, processing the second video feature vector and the second text feature vector by the modal fusion model through a mutual attention mechanism to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector;

2. The video question-answering method according to claim 1, wherein the extracting video feature vectors for the input video comprises:

3. The video question-answering method according to claim 1, wherein the extracting text feature vectors for the question text and the candidate answer text comprises:

4. The video question-answering method according to claim 1, characterized by further comprising before the method:

constructing the first pre-training model and initializing;

5. The video question-answering method according to claim 4, wherein the loss of the tag classification task and the mask language mode task is calculated based on binary cross entropy, and the loss of the mask frame mode task is calculated based on noise contrast estimation.

6. The video question answering method according to claim 1, characterized in that the first pre-training model is a 24-layer depth transform encoder cascade neural network, a hidden layer dimension is 1024, the number of attention heads is 16, and the first pre-training model is initialized by representing parameters pre-trained by BERT via a bidirectional encoder from transforms.

7. The video question-answering method according to claim 1, wherein the processing the second video feature vector and the second text feature vector through a mutual attention mechanism comprises:

8. The video question-answering method according to claim 1, characterized in that the method further comprises, before:

receiving a voice input of a user;

converting the speech input into the question text by speech recognition.

9. The video question-answering method according to claim 1, characterized in that the method further comprises, before:

acquiring the question text;

10. The video question-answering method according to claim 9, wherein the generating of the candidate answer text corresponding to the question text from the question text comprises:

11. The video question-answering method according to claim 1, characterized in that the method further comprises:

12. A video question answering device, comprising a memory; and a processor coupled to the memory, the processor configured to perform the steps of the video question answering method according to any one of claims 1 to 11, based on instructions stored in the memory.

13. A storage medium, characterized in that a computer program is stored thereon, which when executed by a processor implements the video question-answering method according to any one of claims 1 to 11.

14. A video question-answering system is characterized by comprising a video question-answering device, a monitoring system, a voice recognition device, a voice input device and a knowledge base, wherein:

the knowledge base is configured to store a common sense knowledge graph;

the video question-answering device is configured to receive a question text and a monitoring video, and generate a candidate answer text according to the question text, wherein the question text is used for describing a question, and the candidate answer text is used for providing a plurality of candidate answers; the method comprises the steps of extracting video feature vectors from a received surveillance video, extracting text feature vectors from question texts and candidate answer texts, splicing the video feature vectors and the text feature vectors to obtain spliced feature vectors, inputting the spliced feature vectors into a first pre-training model, learning cross-modal information between the video feature vectors and the text feature vectors through a self-attention mechanism by the first pre-training model, and obtaining encoded second spliced feature vectors; dividing the second mosaicing feature vector into a second video feature vector and a second text feature vector; inputting the second video feature vector and the second text feature vector into a modal fusion model, wherein the modal fusion model adopts a mutual attention mechanism to process the second video feature vector and the second text feature vector to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector; the fused feature vectors are input to a decoding layer to predict correct candidate answers.