WO2024046038A1

WO2024046038A1 - Video question-answer method, device and system, and storage medium

Info

Publication number: WO2024046038A1
Application number: PCT/CN2023/111455
Authority: WO
Inventors: 王炳乾
Original assignee: 京东方科技集团股份有限公司
Priority date: 2022-08-29
Filing date: 2023-08-07
Publication date: 2024-03-07
Also published as: CN115391511A

Abstract

A video question-answer method, device and system, and a storage medium. The method comprises: extracting a video feature vector for an inputted video, and extracting a text feature vector for a question text and a candidate answer text; splicing the video feature vector and the text feature vector to obtain a spliced feature vector, inputting the spliced feature vector into a first pre-training model, and the first pre-training model learning cross-modal information between the video feature vector and the text feature vector by means of a self-attention mechanism to obtain a second spliced feature vector; dividing the second spliced feature vector into a second video feature vector and a second text feature vector, inputting the second video feature vector and the second text feature vector into a modal fusion model, the modal fusion model processing the second video feature vector and the second text feature vector by means of a mutual attention mechanism to obtain a video expression and a text expression, and respectively pooling and fusing the video expression and the text expression to obtain a fused feature vector; predicting a correct candidate answer according to the fusion feature vector.

Description

Video question and answer method, device, system and storage medium

This application claims priority to the Chinese patent application filed with the China Patent Office on August 29, 2022, with the application number 202211043431.7 and the invention name "Video Question and Answer Method, Device, System and Storage Medium", and its content should be understood as being by reference incorporated into this application.

Technical field

Embodiments of the present disclosure relate to, but are not limited to, the technical field of natural language processing, and in particular, to a video question and answer method, device, system and storage medium.

Background technique

In the current era of mobile Internet and big data, video data on the Internet is growing explosively. As an increasingly rich information-carrying medium, understanding the semantics of videos is a technology for many video intelligent applications, which has important research significance and practical applications. value. Video question answering (Video QA) is the task of inferring the correct answer from a candidate set given a video clip and question. With the advancement of computer vision and natural language processing, video question answering has been widely used in video retrieval, intelligent question answering systems, and assisted driving systems. The wide range of applications in autonomous driving and autonomous driving are attracting more and more attention.

Contents of the invention

The following is an overview of the topics described in detail in this article. This summary is not intended to limit the scope of the claims.

The embodiment of the present disclosure provides a video question and answer method, including:

Extract video feature vectors from the input video, and extract text feature vectors from the question text and candidate answer text, where the question text is used to describe the question, and the candidate answer text is used to provide multiple candidate answers; the video features are The vector is spliced with the text feature vector to obtain a spliced feature vector, and the spliced feature vector is input into the first pre-training model. The first pre-trained model learns the video feature vector and the video feature vector through a self-attention mechanism. The cross-modal information between text feature vectors is used to obtain the encoded second spliced feature vector;

The second splicing feature vector is divided into a second video feature vector and a second text feature vector; the second video feature vector and the second text feature vector are input into a modal fusion model, and the modal fusion model uses mutual attention The force mechanism processes the second video feature vector and the second text feature vector to obtain a video expression and a text expression, and pools and fuses the video expression and the text expression respectively to obtain a fusion feature vector;

The fused feature vector is input to the decoding layer to predict the correct candidate answer.

Embodiments of the present disclosure also provide a video question and answer device, including a memory; and a processor coupled to the memory, the processor being configured to execute any of the instructions of the present disclosure based on instructions stored in the memory. The steps of the video question and answer method described in the embodiment.

An embodiment of the present disclosure also provides a storage medium on which a computer program is stored. When the program is executed by a processor, the video question and answer method as described in any embodiment of the present disclosure is implemented.

Embodiments of the present disclosure also provide a video question and answer system, including a video question and answer device, a monitoring system, a speech recognition device, a speech input device and a knowledge base, wherein:

The monitoring system is configured to acquire one or more surveillance videos, process the surveillance videos according to the instruction text, and output the surveillance videos to the video question and answer device;

The voice input device is configured to receive voice input and output it to the voice recognition device;

The speech recognition device is configured to convert speech input into instruction text or question text through speech recognition, input the instruction text into the monitoring system, and input the question text into the video question and answer device;

The knowledge base is configured to store a common sense knowledge graph;

The video question and answer device is configured to receive question text and surveillance video, and generate candidate answer text according to the question text, wherein the question text is used to describe a question, and the candidate answer text is used to provide multiple candidate answers; It is also configured to extract a video feature vector from the received surveillance video, extract a text feature vector from the question text and the candidate answer text, splice the video feature vector and the text feature vector to obtain a spliced feature vector, and splice the spliced feature vector. The feature vector is input into the first pre-training model. The first pre-training model learns the cross-modal information between the video feature vector and the text feature vector through a self-attention mechanism to obtain the encoded second splicing feature. feature vector; divide the second splicing feature vector into a second video feature vector and a second text feature vector; input the second video feature vector and the second text feature vector into a modal fusion model, and the modal fusion model Using a mutual attention mechanism, the second video feature vector and the second text feature vector are processed to obtain a video expression and a text expression, and the video expression and text expression are pooled and fused respectively to obtain a fusion feature vector ; Input the fused feature vector into the decoding layer to predict the correct candidate answer.

After reading and understanding the drawings and detailed description, other aspects can be understood.

Description of drawings

The drawings are used to provide an understanding of the technical solution of the present disclosure and constitute a part of the specification. They are used to explain the technical solution of the present disclosure together with the embodiments of the present disclosure and do not constitute a limitation of the technical solution of the present disclosure.

Figure 1 is a schematic flow chart of a video question and answer method according to an exemplary embodiment of the present disclosure;

Figure 2 is a schematic structural diagram of the visual language cross-modal model created by an embodiment of the present disclosure;

Figure 3 is a schematic structural diagram of a modal fusion model created according to an exemplary embodiment of the present disclosure;

Figure 4 is a schematic structural diagram of a traffic accident common sense knowledge graph according to an exemplary embodiment of the present disclosure;

Figure 5 is a schematic structural diagram of a video question and answer device according to an exemplary embodiment of the present disclosure;

Figure 6 is a schematic structural diagram of a video question and answer system according to an exemplary embodiment of the present disclosure.

Detailed ways

The embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. Embodiments may be implemented in many different forms. Those of ordinary skill in the art can readily appreciate the fact that the manner and content may be transformed into one or more forms without departing from the spirit and scope of the present disclosure. Therefore, the present disclosure should not be construed as being limited only to the contents described in the following embodiments. The embodiments and features in the embodiments of the present disclosure may be arbitrarily combined with each other unless there is any conflict.

In the drawings, the size, size, and size of one or more components are sometimes exaggerated for clarity. The thickness or area of a layer. Therefore, one aspect of the present disclosure is not necessarily limited to such dimensions, and the shapes and sizes of various components in the drawings do not reflect true proportions. In addition, the drawings schematically show ideal examples, and one aspect of the present disclosure is not limited to shapes, numerical values, etc. shown in the drawings.

Ordinal numbers such as "first", "second", and "third" in this disclosure are provided to avoid confusion of constituent elements and are not intended to limit the quantity. "A plurality" in this disclosure means a quantity of two or more.

Computer vision (CV) and natural language processing (NLP) are two major branches of artificial intelligence that focus on simulating human intelligence in vision and language. Over the past decade, deep learning has greatly advanced the development of unimodal learning in both fields. Some visual question answering techniques encode the input image through an image encoder, encode the input question through a question encoder, perform dot product merging of the encoded image and question features, and then predict the probability of candidate answers through a fully connected layer. However, these video question and answer technologies only consider unilateral modal information and lack semantic understanding capabilities. Real-world problems often involve multimodality. For example, self-driving cars should be able to handle human commands (language), traffic signals (vision), and road conditions (sight and sound).

As shown in Figure 1, an embodiment of the present disclosure provides a video question and answer method, which includes the following steps:

Step 101: Extract video feature vectors from the input video, and extract text feature vectors from the question text and candidate answer text, where the question text is used to describe the question, and the candidate answer text is used to provide multiple candidate answers; combine the video feature vector with the text The feature vectors are spliced to obtain a spliced feature vector. The spliced feature vector is input into the first pre-training model. The first pre-training model uses the self-attention mechanism to learn the cross-modal information between the video feature vector and the text feature vector to obtain the encoding. The second concatenated feature vector after;

Step 102: Divide the second splicing feature vector into a second video feature vector and a second text feature vector; input the second video feature vector and the second text feature vector into the modal fusion model, and the modal fusion model uses the mutual attention mechanism. Process the second video feature vector and the second text feature vector to obtain a video expression and a text expression, and pool and fuse the video expression and the text expression respectively to obtain a fusion feature vector;

Step 103: Input the fused feature vector into the decoding layer to predict the correct candidate answer.

The video question and answer method in the embodiment of the present disclosure combines video feature vectors and text feature vectors. After row splicing, the first pre-training model is input, using visual-linguistic cross-modal information to improve the semantic representation ability of video Q&A, and using the mutual attention mechanism to conduct deeper semantic interaction between video and text option pairs, which is expected to have a deeper understanding of video The content is semantically understood to improve the model reasoning effect. The video question and answer method in the embodiment of the present disclosure can be applied to many application fields such as video retrieval, intelligent question and answer system, assisted driving system and automatic driving.

In some exemplary embodiments, before step 101, the method further includes:

Build the first pre-trained model and initialize it;

The first pre-training model is pre-trained through multiple self-supervised tasks. The multiple self-supervised tasks include a label classification task, a masked language model task and a masked frame model task. The label classification task is used to multi-label the video. For classification, the masked language module task is used to randomly mask text and predict masked words, and the masked frame module task is used to randomly mask video frames and predict masked frames;

The loss of the first pre-trained model is calculated by the weighted sum of the losses of the self-supervised tasks.

Figure 2 is a schematic structural diagram of the visual language cross-modal model created by an embodiment of the present disclosure. As shown in Figure 2, the visual language cross-modal model includes a cascaded first pre-training model, a modal fusion model and a decoding layer.

In some exemplary embodiments, the first pre-trained model may be a 24-layer deep Transformer encoder cascade neural network, with a hidden layer dimension of 1024 and an attention head number of 16. The parameters pre-trained by BERT are suitable for the first Initialize the pre-trained model.

The Transformer model consists of an encoder and a decoder. The encoder and decoder each include multiple network blocks. Each encoder network block consists of a multi-head attention (Attention) sub-layer and a feed-forward neural network sub-layer. The structure of the decoder is similar to that of the encoder, except that there is an additional multi-head attention layer in each network block of the decoder.

Bidirectional Encoder Representation from Transformer (BERT) is a successful application of Transformer that leverages the Transformer encoder and introduces bidirectional masking technology, allowing each language token to bidirectionally focus on other tokens.

The first pre-training model in the embodiment of the present disclosure adopts a multi-layer Transformer encoder cascade structure. For example, the first pre-trained model can be a 24-layer deep Transformer encoder cascade neural network, the hidden layer dimension is 1024, and the number of attention heads is 16. At this time, the network of the first pre-trained model The structure is consistent with BERT Large (a natural language pre-training model proposed by Google), so the pre-training results of natural language pre-training models such as BERT can be directly used to initialize the first pre-training model of the disclosed embodiment.

In some exemplary implementations, extracting video feature vectors for input videos includes:

Frames are extracted from the input video at a preset speed, and a second pre-trained model is used to extract video feature vectors from the extracted frames.

The input of video question and answer is a collection of video clips, questions and candidate answers. Different from image processing, the continuous video stream can be understood as a set of rapidly played pictures, where each picture is defined as a frame. For video information, we extract frames from the input video at a preset speed (exemplarily, 1 frame per second (fps)), and each video extracts up to N frames (exemplarily, N can be 30), Use the pre-trained model EfficientNetB3 in the field of computer vision (CV) to extract video feature vectors (Visual Features). Optionally, embodiments of the present disclosure may also use pre-trained models such as MobileNet or ResNet101 to extract video feature vectors from input videos.

In some exemplary implementations, extracting text feature vectors for question text and candidate answer text includes:

Generate a sequence string based on the question text and candidate answer text. The sequence string includes multiple sequences. Each word or character in the question text and candidate answer text corresponds to one or more sequences;

Input the sequence string into the first pre-training model to obtain a text feature vector.

In this disclosed embodiment, for the question text and candidate answer text, we first use a word list of size n_vocab to mark them as sequence lists, and convert each word or character into one or more sequences. The question text and candidate answer text All words and/or characters in form a sequence string, where the sequence can be a numeric ID. For example, the numerical ID can be between 0 and n_vocab-1. For example, convert "What type of road is it currently, a trail, a highway, a street, or a street?" into a sequence string such as "2560 2013 1300 100 567...". Input the sequence string into the first pre-training model (the first pre-training model here can be the initialized first pre-training model), and obtain a text feature vector of n_word*1024, where n_word is the number in the sequence string. The number of sequences, 1024 is the dimension of each text feature vector.

In the embodiment of the present disclosure, the video feature vector and the text feature vector are spliced and input into the first preset When training the model, the input form of the splicing feature vector can be [CLS]Video_frame[SEP]Question-Answer[SEP], where Video_frame is the video feature vector, Question-Answer is the text feature vector, and the [CLS] flag is the splicing feature vector. The first flag, [SEP] flag is used to separate the video feature vector and the text feature vector, and add the same position embedding vector and segment embedding vector as BERT to the input of the first pre-training model, where the position embedding vector is used to specify The position in the sequence, segment embedding vector is used to specify the frame segment it belongs to.

In some exemplary embodiments, before splicing the video feature vector and the text feature vector, the method further includes: processing the video feature vector and/or the text feature vector, such that when the video feature vector and the text feature are spliced, When vectors are spliced, the dimensions of the video feature vector and the text feature vector are the same.

In the embodiment of the present disclosure, when the video feature vector extracted by the pre-training model EfficientNetB3 is a 1536-dimensional feature vector, the 1536-dimensional feature vector is reduced in dimension through a fully connected layer, and the feature dimension is reduced to 1024 dimensions, which is consistent with the text The feature vectors have the same dimensions.

In the embodiment of the present disclosure, the training of the visual language cross-modal model includes two stages: a pre-training stage for the first pre-training model and a fine-tuning stage for the visual language cross-modal model.

In the pre-training stage of the first pre-training model, we use three tasks: tag classification (TC), mask language model (MLM), and mask frame model (MFM). The first pre-trained model is pre-trained. Among them, the label classification task is used to classify videos with multiple labels, the mask language module task is used to randomly mask text and predict masked words, and the mask frame module task is used to randomly mask video frames and predict masked frames.

For example, in the label classification task, the top 100 tags with higher frequency of occurrence can be used for multi-label classification tasks, where the tags are manually labeled video tags, for example, whether an accident occurred, the cause of the accident, and the location of the accident. , vehicle type, weather conditions, time information, etc. Take the [CLS] corresponding vector of the last layer of Bert and connect it to the fully connected layer to get the predicted label of the tag, and calculate the binary cross entropy loss (Binary Cross Entropy loss, BCE loss) with the real label.

In the masked language module task, random 15% of the text is masked and the masked text is predicted. In multi-modal scenarios, combining video information to predict mask words can effectively integrate multi-modal information.

In the mask frame task, random 15% of the video frames are masked, and the masked video frames are filled with vectors of all 0s. Because the video feature vector is a continuous real-valued vector without word segmentation (token), it is difficult to perform a classification task similar to the masked language module task. Since it is hoped that the predicted frame of the mask is as similar as possible to the frame being masked within all frame ranges within the entire batch, the embodiment of the present disclosure calculates the mask frame model based on Noise Contrastive Estimation (NCE). The loss of the task is to maximize the mutual information between the mask frame and the prediction frame.

Using multi-task joint training, the total pre-training task loss uses the weighted sum of the above three pre-training task losses, Loss=Loss(TC)*a+Loss(mlm)*b+Loss(mfm)*c, where , a, b, and c are the weights of the three pre-training task losses respectively.

The video question and answer method of the embodiment of the present disclosure uses a first pre-trained model to represent the information of the two modalities of video and text, splice the video feature vector corresponding to the video and the text feature vector corresponding to the question text and the text feature vector corresponding to the candidate answer text, and input The first pre-trained model obtains the encoded second concatenated feature vector. According to the flag bit in the second splicing feature vector, the second splicing feature vector can be divided into a second video feature vector and a second text feature vector.

Assume that V=[v1,v2,…vm], Q=[q1,q2,…,q3], A=[a1,a2,…,a3] are videos, questions and answers respectively, where vi, qi, ai are the corresponding sequences respectively, and use VLP(·) to represent the VLP model (i.e. the first pre-training model), then the encoding representation E obtained by inputting [CLS]V[SEP]Q+A[SEP] into the VLP model can be expressed as:

Among them, E=[e1,e2,…,e(m+n+k)], the second video feature vector and the second text feature vector can be obtained by dividing the encoded video frame and question option vector into two parts, and

Next, the pre-trained first pre-training model is connected to the modal fusion model and the decoding layer to fine-tune the overall visual language cross-modal model. The processing process of the first pre-trained model is the same as the aforementioned process and will not be described again here. The second video feature vector and the second text feature vector are input into the modal fusion model. The modal fusion model processes the second video feature vector and the second text feature vector through a mutual attention mechanism to obtain a video expression and a text expression. formula, and pool and fuse the video expression and text expression respectively to obtain the fusion feature vector.

Figure 3 is a schematic structural diagram of a modal fusion model created according to an embodiment of the present disclosure. As shown in Figure 3, in the modal fusion model, the query vector (Query) comes from one modality, while the key vector (Key) and value vector (Value) come from another modality. The video question and answer method in the embodiment of the present disclosure further semantically aligns and integrates language and images by introducing a mutual attention mechanism.

In some exemplary implementations, the second video feature vector and the second text feature vector are processed through a mutual attention mechanism, including:

Use the second video feature vector as the query vector and the second text feature vector as the key vector and value vector to perform Multi Head Attention;

The second text feature vector is used as a query vector, and the second video feature vector is used as a key vector and a value vector to perform multi-head attention.

In the embodiment of the present disclosure, the modal fusion model uses the second video feature vector as Query and the second text feature vector as Key and Value to perform multi-head attention. At the same time, the second text feature vector is used as Query and the second video feature vector As Key and Value, multi-head attention is performed, so that the exchanged attention is one layer. In some exemplary embodiments, multi-layer concatenation can be performed. Then the video expression representing the video frame sequence is pooled, and the text expression representing the question option sequence is pooled, and then the two pooled vectors are fused to obtain the fusion feature vector. The fused feature vector is then fed into the decoding layer to predict the probability that each candidate is the correct candidate answer.

The above process is expressed as:

MHA(E ^V ,E ^QA ,E ^QA )=concat(head ₁ ,…head _h );
MHA ₁ =MHA( ^EV ,E ^QA ,E ^QA );
MHA ₂ =MHA(E ^QA ,E ^V ,E ^V );
DUMA(E ^V ,E ^QA )=Fuse(MHA ₁ ,MHA ₂ );

in, Parameter matrix, MHA(·) is the multi-head attention mechanism, DUMA(·) is the bidirectional multi-head co-attention network, and Fuse(·) represents the average pooling and fusion of the results output by DUMA. For example, the fusion can be performed by splicing.

For each <V, Q, A _i > triplet, the output of the DUMA network is:

The loss function of the network is:

Among them, W ^T is the parameter matrix, and s is the number of candidate answers.

In some exemplary embodiments, the method further includes:

Receive user voice input;

Convert voice input into question text through speech recognition.

In some exemplary embodiments, the method further includes:

Receive user voice input;

Speech recognition converts voice input into voice commands and/or question text.

For example, the voice command can be "Help me switch to the surveillance video of XXXX Road" and so on. The system performs corresponding response operations according to the received voice command, for example, switching the display screen to the surveillance video of XXXX Road. For example, the question text can be "Is there an accident on the current road section?", "What is the cause of the accident?", etc. The system generates candidate answer texts based on the received question text, and then uses the video Q&A according to the embodiment of the present disclosure. Method predicts the correct candidate answer.

In some exemplary embodiments, the method further includes:

Through speech synthesis, the predicted correct candidate answers are broadcast in voice form.

In the embodiment of the present disclosure, the predicted correct candidate answer can be directly displayed on the display screen, or the predicted correct candidate answer can be broadcast in voice form.

In some exemplary embodiments, the method further includes:

Get question text;

Based on the question text, candidate answer text corresponding to the question text is generated.

In some exemplary embodiments, based on the question text, candidates corresponding to the question text are generated Answer text, including:

Query triples matching the question text from the knowledge graph through keyword matching or attention mechanism models;

Based on the matched triples, candidate answer text corresponding to the question text is generated.

With the expansion and growth of knowledge, the concept of knowledge graph is becoming more and more widely used. Knowledge graphs can use appropriate knowledge representation methods through the relationship layer to mine connections between data, making it easier for knowledge to circulate and be collaboratively processed between computers. Knowledge graph uses the data structure of graph to represent the relationship between entities. Compared with pure text information, the representation method of graph is easier to understand and accept. The knowledge graph is composed of the relationship edges of "entity-relationship-entity" or "entity-attribute-attribute value", focusing on expressing the relationship between entities or between entities and attributes. Relational search based on knowledge graphs is widely used in entity matching and question answering systems in various industries. It can process and store known knowledge, and can quickly perform knowledge matching and answer search.

The basic composition of the knowledge graph is a triplet (S, P, O). Among them, S and O are nodes in the knowledge graph, representing entities. S specifically represents the subject, and O specifically represents the object. P is the edge connecting two entities (S and O) in the knowledge graph, indicating the relationship between the two entities. For example, in the traffic accident common sense knowledge graph, a triplet can be (rear-end accident, accident type, traffic accident), etc.

Exemplarily, the video question and answer method of the embodiment of the present disclosure can be used in a video surveillance system. For example, in a smart transportation multi-screen surveillance system, the large screen can be scheduled and interacted with questions and answers through voice control, and the monitoring of accidents can be switched to Carry out accident analysis on the main screen. For example, the corresponding interactive instructions can be "Help me switch to the surveillance video of XXXX road"; "Is there an accident on the current road section?"; "What is the cause of the accident?". The specific implementation process will involve traffic. Accident-related common sense knowledge can be achieved by introducing a knowledge base. For example, a traffic accident common sense knowledge graph as shown in Figure 4 can be constructed in the knowledge base, and questions can be queried from the traffic accident common sense knowledge graph through keyword matching or attention mechanisms. The triplet of text matching is used as the common sense knowledge (i.e. candidate answer) corresponding to the displayed semantic information. For example, according to the traffic accident common sense knowledge graph in Figure 4, for the user's question "What is the cause of the accident?", candidates are automatically generated. Answer: Weather, roads, driver, poor vehicle condition.

An embodiment of the present disclosure also provides a video question and answer device, including a memory; and coupled to the A processor of a memory, the processor being configured to perform the steps of the video question answering method according to any embodiment of the present disclosure based on instructions stored in the memory.

As shown in Figure 5, in one example, the video question and answer device may include: a processor 510, a memory 520, and a bus system 530. The processor 510 and the memory 520 are connected through the bus system 530, and the memory 520 is used to store instructions. The processor 510 is configured to execute instructions stored in the memory 520 to extract a video feature vector for the input video, and extract a text feature vector for the question text and the candidate answer text, where the question text is used to describe the question, and the candidate answer text Used to provide multiple candidate answers; splice the video feature vector and the text feature vector to obtain a spliced feature vector, input the spliced feature vector into the first pre-training model, and the first pre-training model automatically The attention mechanism learns the cross-modal information between the video feature vector and the text feature vector to obtain the encoded second splicing feature vector; divides the second splicing feature vector into the second video feature vector and the second splicing feature vector. Two text feature vectors; input the second video feature vector and the second text feature vector into a modal fusion model, and the modal fusion model uses a mutual attention mechanism to compare the second video feature vector and the second text feature The vector is processed to obtain a video expression and a text expression, and the video expression and the text expression are pooled and fused respectively to obtain a fused feature vector; the fused feature vector is input to the decoding layer to predict the correct candidate answer. .

It should be understood that the processor 510 can be a central processing unit (Central Processing Unit, CPU). The processor 510 can also be other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASICs), and off-the-shelf programmable gate arrays. (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.

Memory 520 may include read-only memory and random access memory and provides instructions and data to processor 510 . A portion of memory 520 may also include non-volatile random access memory. For example, memory 520 may also store device type information.

In addition to the data bus, the bus system 530 may also include a power bus, a control bus, a status signal bus, etc. However, for the sake of clarity, the various buses are labeled as bus system 530 in FIG. 5 .

During implementation, the processing performed by the processing device may be completed by instructions in the form of hardware integrated logic circuits or software in the processor 510 . That is, the method steps of the embodiments of the present disclosure can be embodied as The execution is now completed by the hardware processor, or by a combination of hardware and software modules in the processor. Software modules can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media. The storage medium is located in the memory 520. The processor 510 reads the information in the memory 520 and completes the steps of the above method in combination with its hardware. To avoid repetition, it will not be described in detail here.

As shown in Figure 6, an embodiment of the present disclosure also provides a video question and answer system, which includes the video question and answer device as described in any embodiment of the present disclosure, and also includes a monitoring system, a speech recognition device, a voice input device and a knowledge base, in:

A surveillance system configured to acquire one or more surveillance videos, process the surveillance videos according to the instruction text, and output the surveillance videos to the video question and answer device;

a voice input device configured to receive voice input and output it to the voice recognition device;

A speech recognition device configured to convert the speech input into instruction text or question text through speech recognition, input the instruction text into the monitoring system, and input the question text into the video question and answer device;

Knowledge base, configured to store knowledge graphs;

A video question and answer device configured to receive a question text and a surveillance video, and generate a candidate answer text based on the question text, where the question text is used to describe the question, and the candidate answer text is used to provide multiple candidate answers; and is also configured to monitor the received Extract video feature vectors from the video, extract text feature vectors from the question text and candidate answer text, splice the video feature vector and the text feature vector to obtain a spliced feature vector, and input the spliced feature vector into the first pre-training model, the first pre-training model Through the self-attention mechanism, learn the cross-modal information between the video feature vector and the text feature vector to obtain the encoded second splicing feature vector; divide the second splicing feature vector into the second video feature vector and the second text Feature vector; input the second video feature vector and the second text feature vector into the modal fusion model. The modal fusion model uses a mutual attention mechanism to process the second video feature vector and the second text feature vector to obtain a video expression and text expressions, and pool and fuse the video expressions and text expressions respectively to obtain a fused feature vector; the fused feature vector is input to the decoding layer to predict the correct candidate answer.

In some exemplary implementations, the video question and answer system further includes a speech synthesis output device, wherein:

A speech synthesis output device is used to broadcast the predicted correct candidate answer in speech form through speech synthesis.

The video question and answer system of the embodiment of the present disclosure consists of several modules as shown in Figure 6, in which voice instructions or questions are given by the user through the voice input device, the voice recognition device converts natural language into text, and the speech synthesis output device converts natural language into text. The answers fed back by the system are broadcast in the form of voice, and the video question and answer device combines the monitoring system and knowledge base to perform semantic understanding and multi-modal interactive reasoning to give corresponding answers based on the questions or instructions passed in by the user.

In some possible implementations, various aspects of the video question and answer method provided by this application can also be implemented in the form of a program product, which includes program code. When the program product is run on a computer device, the program code It is used to cause the computer device to execute the steps in the video question and answer method according to various exemplary embodiments of the present application described above in this specification. For example, the computer device can execute the video question and answer method recorded in the embodiment of the present application.

The program product may take the form of any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to: electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

The drawings in this disclosure only refer to the structures involved in this disclosure, and other structures may refer to common designs. Without conflict, the embodiments of the present disclosure and the features in the embodiments may be combined with each other to obtain new embodiments.

Those of ordinary skill in the art should understand that the technical solutions of the present disclosure can be modified or equivalently substituted without departing from the spirit and scope of the technical solutions of the present disclosure, and all should be covered by the scope of the claims of the present disclosure.

Claims

A video question and answer method including:

Extract video feature vectors from the input video, and extract text feature vectors from the question text and candidate answer text, where the question text is used to describe the question, and the candidate answer text is used to provide multiple candidate answers; the video features are The vector is spliced with the text feature vector to obtain a spliced feature vector, and the spliced feature vector is input into the first pre-training model. The first pre-trained model learns the video feature vector and the video feature vector through a self-attention mechanism. The cross-modal information between text feature vectors is used to obtain the encoded second spliced feature vector;

The second splicing feature vector is divided into a second video feature vector and a second text feature vector; the second video feature vector and the second text feature vector are input into a modal fusion model, and the modal fusion model uses mutual attention The force mechanism processes the second video feature vector and the second text feature vector to obtain a video expression and a text expression, and pools and fuses the video expression and the text expression respectively to obtain a fusion feature vector;

The fused feature vector is input to the decoding layer to predict the correct candidate answer.
The video question and answer method according to claim 1, wherein the extracting a video feature vector for the input video includes:

Frames are extracted from the input video at a preset speed, and a second pre-trained model is used to extract video feature vectors from the extracted frames.
The video question and answer method according to claim 1, wherein the extracting text feature vectors for question text and candidate answer text includes:

Generate a sequence string based on the question text and candidate answer text, where the sequence string includes multiple sequences, and each word or character in the question text and candidate answer text corresponds to one or more sequences;

The sequence string is input into the first pre-training model to obtain a text feature vector.
The video question and answer method according to claim 1, the method further includes:

Construct the first pre-trained model and initialize it;

The first pre-training model is pre-trained through a plurality of self-supervised tasks. The supervision task includes a label classification task, a masked language model task and a masked frame model task. The label classification task is used to classify videos with multiple labels. The masked language model task is used to randomly block text and predict masking. word, the mask frame module task is used to randomly mask video frames and predict masked frames;

The loss of the first pre-trained model is calculated by a weighted sum of losses of multiple self-supervised tasks.
The video question and answer method according to claim 4, wherein the loss of the label classification task and the masked language module task is calculated based on binary cross entropy, and the loss of the masked frame module task is calculated based on noise contrast estimation.
The video question and answer method according to claim 1, wherein the first pre-training model is a 24-layer deep Transformer encoder cascade neural network, the hidden layer dimension is 1024, and the number of attention heads is 16. The bidirectional encoder indicates that the parameters pre-trained by BERT initialize the first pre-trained model.
The video question and answer method according to claim 1, wherein processing the second video feature vector and the second text feature vector through a mutual attention mechanism includes:

Use the second video feature vector as a query vector and the second text feature vector as a key vector and a value vector to perform multi-head attention;

The second text feature vector is used as a query vector, and the second video feature vector is used as a key vector and a value vector to perform multi-head attention.
The video question and answer method according to claim 1, the method further includes:

Receive user voice input;

Through speech recognition, the speech input is converted into the question text.
The video question and answer method according to claim 1, the method further includes:

Get the text of said question;

According to the question text, the candidate answer text corresponding to the question text is generated.
The video question and answer method according to claim 9, wherein said generating the candidate answer text corresponding to the question text according to the question text includes:

Query triples matching the question text from the common sense knowledge graph through keyword matching or attention mechanism models;

According to the matched triples, the candidate answer text corresponding to the question text is generated.
The video question and answer method according to claim 1, further comprising:

The video feature vector and/or the text feature vector are processed so that when the video feature vector and the text feature vector are spliced, the dimensions of the video feature vector and the dimension of the text feature vector are The dimensions are the same.
A video question and answer device, comprising a memory; and a processor coupled to the memory, the processor being configured to execute as described in any one of claims 1 to 11 based on instructions stored in the memory The steps of the video question and answer method.
A storage medium on which a computer program is stored. When the program is executed by a processor, the video question and answer method according to any one of claims 1 to 11 is implemented.
A video question and answer system, including a video question and answer device, a monitoring system, a speech recognition device, a voice input device and a knowledge base, wherein:

The monitoring system is configured to acquire one or more surveillance videos, process the surveillance videos according to the instruction text, and output the surveillance videos to the video question and answer device;

The voice input device is configured to receive voice input and output it to the voice recognition device;

The speech recognition device is configured to convert speech input into instruction text or question text through speech recognition, input the instruction text into the monitoring system, and input the question text into the video question and answer device;

The knowledge base is configured to store a common sense knowledge graph;

The video question and answer device is configured to receive question text and surveillance video, and generate candidate answer text according to the question text, wherein the question text is used to describe a question, and the candidate answer text is used to provide multiple candidate answers; It is also configured to extract video features from the received surveillance video. amount, extract text feature vectors for the question text and candidate answer text, splice the video feature vector and the text feature vector to obtain a spliced feature vector, input the spliced feature vector into the first pre-training model, and the third A pre-trained model learns the cross-modal information between the video feature vector and the text feature vector through a self-attention mechanism, and obtains the encoded second splicing feature vector; the second splicing feature vector is divided into two video feature vectors and a second text feature vector; input the second video feature vector and the second text feature vector into a modal fusion model, and the modal fusion model uses a mutual attention mechanism to calculate the second video feature vector and the second text feature vector. The second text feature vector is processed to obtain a video expression and a text expression, and the video expression and the text expression are pooled and fused respectively to obtain a fused feature vector; the fused feature vector is input to the decoding layer to predict Correct candidate answer.