CN115391511A - Video question-answering method, device, system and storage medium - Google Patents

Video question-answering method, device, system and storage medium Download PDF

Info

Publication number
CN115391511A
CN115391511A CN202211043431.7A CN202211043431A CN115391511A CN 115391511 A CN115391511 A CN 115391511A CN 202211043431 A CN202211043431 A CN 202211043431A CN 115391511 A CN115391511 A CN 115391511A
Authority
CN
China
Prior art keywords
video
text
feature vector
question
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211043431.7A
Other languages
Chinese (zh)
Inventor
王炳乾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BOE Technology Group Co Ltd
Original Assignee
BOE Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BOE Technology Group Co Ltd filed Critical BOE Technology Group Co Ltd
Priority to CN202211043431.7A priority Critical patent/CN115391511A/en
Publication of CN115391511A publication Critical patent/CN115391511A/en
Priority to PCT/CN2023/111455 priority patent/WO2024046038A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Animal Behavior & Ethology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A video question-answering method, device, system and storage medium includes: extracting video characteristic vectors aiming at an input video, and extracting text characteristic vectors aiming at a question text and a candidate answer text; splicing the video feature vector and the text feature vector to obtain a spliced feature vector, inputting the spliced feature vector into a first pre-training model, and learning cross-modal information between the video feature vector and the text feature vector by the first pre-training model through a self-attention mechanism to obtain a second spliced feature vector; dividing the second splicing feature vector into a second video feature vector and a second text feature vector, inputting a modal fusion model, processing the second video feature vector and the second text feature vector by the modal fusion model through a mutual attention mechanism to obtain a video expression and a text expression, and respectively pooling and fusing the video expression and the text expression to obtain a fusion feature vector; and predicting correct candidate answers according to the fused feature vectors.

Description

Video question-answering method, device, system and storage medium
Technical Field
The disclosed embodiments relate to, but not limited to, the field of natural language processing technologies, and in particular, to a video question-answering method, device, system, and storage medium.
Background
In the current mobile internet and big data era, video data on the network is in explosive growth, and as increasingly abundant information bearing media, understanding the semantics of videos is a technology for a plurality of video intelligent applications, and has important research significance and practical application value. Video question answering (Video QA) is a task of deducing correct answers from a candidate set given a Video clip and question, and with the progress of computer vision and natural language processing, the wide application of Video question answering in Video retrieval, intelligent question answering systems, auxiliary driving systems, automatic driving and the like is receiving more and more extensive attention.
Disclosure of Invention
The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.
The embodiment of the disclosure provides a video question-answering method, which comprises the following steps:
extracting video feature vectors for an input video, and extracting text feature vectors for a question text and a candidate answer text, wherein the question text is used for describing a question, and the candidate answer text is used for providing a plurality of candidate answers; splicing the video feature vector and the text feature vector to obtain a spliced feature vector, inputting the spliced feature vector into a first pre-training model, and learning cross-modal information between the video feature vector and the text feature vector by the first pre-training model through a self-attention mechanism to obtain a coded second spliced feature vector;
dividing the second mosaicing feature vector into a second video feature vector and a second text feature vector; inputting the second video feature vector and the second text feature vector into a modal fusion model, processing the second video feature vector and the second text feature vector by the modal fusion model through a mutual attention mechanism to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector;
the fused feature vectors are input to a decoding layer to predict correct candidate answers.
In an exemplary embodiment, the extracting a video feature vector for an input video includes:
and extracting frames of the input video at a preset speed, and extracting video characteristic vectors of the extracted frames by adopting a second pre-training model.
In an exemplary embodiment, the extracting text feature vectors for the question text and the candidate answer text includes:
generating a sequence string according to the question text and the candidate answer text, wherein the sequence string comprises a plurality of sequences, and each word or character in the question text and the candidate answer text corresponds to one or more sequences;
and inputting the sequence string into the first pre-training model to obtain a text feature vector.
In an exemplary embodiment, the method further comprises, before:
constructing the first pre-training model and initializing;
pre-training the first pre-training model through a plurality of self-supervision tasks, wherein the plurality of self-supervision tasks comprise a label classification task, a mask language mode task and a mask frame mode task, the label classification task is used for performing multi-label classification on videos, the mask language mode task is used for performing random shielding on texts and predicting shielding words, and the mask frame mode task is used for performing random shielding on video frames and predicting shielding frames;
calculating a loss of the first pre-trained model by a weighted sum of losses of a plurality of the self-supervised tasks.
In an exemplary embodiment, the penalty for the tag classification task and the mask language mode task is computed based on binary cross entropy and the penalty for the mask frame mode task is computed based on noise contrast estimation.
In an exemplary embodiment, the first pre-training model is a 24-layer depth transform encoder cascaded neural network, with hidden layer dimensions of 1024 and attention head number of 16, initialized with parameters from the transform's bi-directional encoder representation BERT pre-training.
In an exemplary embodiment, the processing the second video feature vector and the second text feature vector through a mutual attention mechanism includes:
taking the second video feature vector as a query vector, taking the second text feature vector as a key vector and a value vector, and performing multi-head attention;
and taking the second text feature vector as a query vector, taking the second video feature vector as a key vector and a value vector, and performing multi-head attention.
In an exemplary embodiment, the method further comprises, before:
receiving a voice input of a user;
converting the speech input into the question text by speech recognition.
In an exemplary embodiment, the method further comprises, before:
acquiring the question text;
and generating the candidate answer text corresponding to the question text according to the question text.
In an exemplary embodiment, the generating the candidate answer text corresponding to the question text according to the question text includes:
searching the triples matched with the question texts from the common knowledge graph through a keyword matching or attention mechanism model;
and generating the candidate answer text corresponding to the question text according to the matched triple.
In an exemplary embodiment, the method further comprises:
processing the video feature vector and/or the text feature vector such that a dimension of the video feature vector and a dimension of the text feature vector are the same when the video feature vector and the text feature vector are stitched.
The embodiment of the present disclosure further provides a video question answering device, which includes a memory; and a processor coupled to the memory, the processor configured to perform the steps of the video question answering method according to any one of the embodiments of the present disclosure based on instructions stored in the memory.
The embodiments of the present disclosure further provide a storage medium, on which a computer program is stored, where the program, when executed by a processor, implements the video question answering method according to any one of the embodiments of the present disclosure.
The embodiment of the present disclosure further provides a video question-answering system, which includes a video question-answering device, a monitoring system, a voice recognition device, a voice input device, and a knowledge base, wherein:
the monitoring system is configured to acquire one or more monitoring videos, process the monitoring videos according to an instruction text and output the monitoring videos to the video question-answering device;
the voice input device is configured to receive voice input and output the voice input to the voice recognition device;
the voice recognition device is configured to convert voice input into an instruction text or a question text through voice recognition, input the instruction text into the monitoring system, and input the question text into the video question-answering device;
the knowledge base is configured to store a common sense knowledge graph;
the video question-answering device is configured to receive a question text and a monitoring video, and generate a candidate answer text according to the question text, wherein the question text is used for describing a question, and the candidate answer text is used for providing a plurality of candidate answers; the method comprises the steps of extracting video feature vectors from a received surveillance video, extracting text feature vectors from a question text and a candidate answer text, splicing the video feature vectors and the text feature vectors to obtain spliced feature vectors, inputting the spliced feature vectors into a first pre-training model, and learning cross-modal information between the video feature vectors and the text feature vectors by the first pre-training model through a self-attention mechanism to obtain a second coded spliced feature vector; dividing the second mosaic feature vector into a second video feature vector and a second text feature vector; inputting the second video feature vector and the second text feature vector into a modal fusion model, wherein the modal fusion model adopts a mutual attention mechanism to process the second video feature vector and the second text feature vector to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector; the fused feature vectors are input to a decoding layer to predict correct candidate answers.
Other aspects will become apparent upon reading the attached drawings and the detailed description.
Drawings
The accompanying drawings are included to provide an understanding of the disclosed embodiments and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the example serve to explain the principles of the disclosure and not to limit the disclosure.
Fig. 1 is a schematic flow chart diagram of a video question answering method according to an exemplary embodiment of the present disclosure;
FIG. 2 is a schematic structural diagram of a visual language cross-modal model created by an embodiment of the present disclosure;
FIG. 3 is a schematic structural diagram of a modality fusion model created by an exemplary embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a traffic accident common sense knowledge graph according to an exemplary embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a video question answering device according to an exemplary embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a video question answering system according to an exemplary embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. Embodiments may be embodied in many different forms. One of ordinary skill in the art can readily appreciate the fact that the manner and content may be altered into one or more forms without departing from the spirit and scope of the present disclosure. Therefore, the present disclosure should not be construed as being limited to the contents described in the following embodiments. The embodiments and features of the embodiments in the present disclosure may be arbitrarily combined with each other without conflict.
In the drawings, the size of one or more constituent elements, the thickness of layers, or regions may be exaggerated for clarity. Therefore, one embodiment of the present disclosure is not necessarily limited to the dimensions, and the shapes and sizes of a plurality of components in the drawings do not reflect a true scale. Further, the drawings schematically show ideal examples, and one embodiment of the present disclosure is not limited to the shapes, numerical values, and the like shown in the drawings.
The ordinal numbers such as "first", "second", "third", and the like in the present disclosure are provided to avoid confusion of the constituent elements, and are not limited in number. "plurality" in this disclosure means two or more than two.
Computer Vision (CV) and Natural Language Processing (NLP) are two major branches of artificial intelligence that focus on simulating human intelligence in vision and language. In the past decade, deep learning has greatly advanced the development of single-modality learning in both areas. Some visual question-answering techniques encode an input image by an image encoder, encode an input question by a question encoder, dot-product-combine the encoded image and question features, and then predict the probability of candidate answers by a fully connected layer. However, these video question-answering techniques only consider unilateral modality information, and lack semantic comprehension ability. While real-world problems tend to be multi-modal. For example, an autonomous automobile should be able to handle human commands (speech), traffic signals (vision), road conditions (vision and sound).
As shown in fig. 1, an embodiment of the present disclosure provides a video question answering method, including the following steps:
step 101: extracting video characteristic vectors aiming at an input video, and extracting text characteristic vectors aiming at a question text and a candidate answer text, wherein the question text is used for describing a question, and the candidate answer text is used for providing a plurality of candidate answers; splicing the video feature vector and the text feature vector to obtain a spliced feature vector, inputting the spliced feature vector into a first pre-training model, learning cross-modal information between the video feature vector and the text feature vector by the first pre-training model through a self-attention mechanism, and obtaining a coded second spliced feature vector;
step 102: dividing the second splicing feature vector into a second video feature vector and a second text feature vector; inputting the second video feature vector and the second text feature vector into a modal fusion model, processing the second video feature vector and the second text feature vector by the modal fusion model through a mutual attention mechanism to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector;
step 103: the fused feature vectors are input to the decoding layer to predict the correct candidate answer.
According to the video question-answering method, the video feature vectors and the text feature vectors are spliced and then input into the first pre-training model, the video question-answering semantic representation capability is improved by using the vision-language cross-modal information, the video and text option pairs are subjected to deeper semantic interaction by using the mutual attention mechanism, the video content is expected to be subjected to semantic understanding more deeply, and therefore the model reasoning effect is improved. The video question-answering method provided by the embodiment of the disclosure can be applied to various application fields such as video retrieval, an intelligent question-answering system, an auxiliary driving system and automatic driving.
In some exemplary embodiments, before step 101, the method further comprises before:
constructing a first pre-training model and initializing;
pre-training the first pre-training model through a plurality of self-supervision tasks, wherein the plurality of self-supervision tasks comprise a label classification task, a mask language model task and a mask frame model task, the label classification task is used for performing multi-label classification on videos, the mask language model task is used for performing random shielding and predicting shielding words on texts, and the mask frame model task is used for performing random shielding and predicting shielding frames on video frames;
the loss of the first pre-trained model is calculated by a weighted sum of the losses of the self-supervised tasks.
Fig. 2 is a schematic structural diagram of a visual language cross-modal model created by an embodiment of the present disclosure. As shown in fig. 2, the visual language cross-modal model includes a first pre-training model, a modal fusion model, and a decoding layer in cascade.
In some exemplary embodiments, the first pre-trained model may be a 24-layer depth transform encoder cascade neural network, the hidden layer dimension is 1024, the number of attention heads is 16, and the first pre-trained model is initialized by parameters pre-trained by BERT.
The Transformer model is composed of an encoder and a decoder, wherein the encoder and the decoder respectively comprise a plurality of network blocks, and each network block of the encoder is composed of a multi-head Attention (Attention) sub-layer and a feedforward neural network sub-layer. The structure of the decoder is similar to that of the encoder, except that there is one more attention layer in each network block of the decoder.
Bidirectional Encoder Representation (BERT) from Transformer is a successful application of Transformer that utilizes a Transformer encoder and introduces a bidirectional masking technique that allows each language token to be bidirectionally focused on other tokens.
The first pre-training model of the embodiments of the present disclosure employs a multi-layer transform encoder cascade structure. Illustratively, the first pre-training model may be a 24-layer deep transform encoder cascade neural network, the hidden layer dimension is 1024, the number of attention heads is 16, at this time, the network structure of the first pre-training model is consistent with BERT Large (a natural language pre-training model proposed by Google), and therefore the first pre-training model of the embodiment of the present disclosure may be directly initialized with the pre-training results of BERT and other natural language pre-training models.
In some exemplary embodiments, extracting a video feature vector for an input video comprises:
and (3) extracting frames of the input video at a preset speed, and extracting video characteristic vectors of the extracted frames by adopting a second pre-training model.
The input of the video question-answer is a video clip, a question and a candidate answer set, and unlike the processing of images, a continuous video stream can be understood as a set of fast-playing pictures, wherein each picture is defined as a Frame (Frame). For video information, we frame the input video at a preset rate (exemplary, 1 frame per second (fps) rate), extract the highest N frames per video (exemplary, N may be 30), and extract video feature vectors (Visual Features) using a pre-trained model of Computer Vision (CV), efficientNetB 3. Optionally, in the embodiment of the present disclosure, a video feature vector may also be extracted from an input video by using a pre-training model such as MobileNet or ResNet 101.
In some exemplary embodiments, extracting the text feature vector for the question text and the candidate answer text comprises:
generating a sequence string according to the question text and the candidate answer text, wherein the sequence string comprises a plurality of sequences, and each word or character in the question text and the candidate answer text corresponds to one or more sequences;
and inputting the sequence string into a first pre-training model to obtain a text feature vector.
In the embodiment of the present disclosure, for the question text and the candidate answer text, a word list with a size of n _ vocab is firstly used to mark the question text and the candidate answer text, each word or character is converted into one or more sequences, and all the words and/or characters in the question text and the candidate answer text form a sequence string, where the sequence may be a numerical ID. Illustratively, the value ID may be between 0 and n _ vocab-1. For example, "what type of road lane highway street is currently" is converted into "2560 2013 1300 100 567 \8230;" such a sequence string. Inputting the sequence string into a first pre-training model (where the first pre-training model may be the initialized first pre-training model), and obtaining n _ word × 1024 text feature vectors, where n _ word is the number of sequences in the sequence string, and 1024 is the dimension of each text feature vector.
In the embodiment of the present disclosure, when the Video feature vector and the text feature vector are spliced and input into the first pre-training model, the input form of the spliced feature vector may be [ CLS ] Video _ frame [ SEP ] Question-Answer [ SEP ], where Video _ frame is the Video feature vector, question-Answer is the text feature vector, [ CLS ] flag is the head flag of the spliced feature vector, and [ SEP ] flag is used to separate the Video feature vector and the text feature vector, and the same position embedding vector and segment embedding vector as BERT are added to the input of the first pre-training model, where the position embedding vector is used to specify a position in a sequence, and the segment embedding vector is used to specify a frame segment to which the segment belongs.
In some exemplary embodiments, before stitching the video feature vector with the text feature vector, the method further comprises: and processing the video feature vector and/or the text feature vector so that the dimensionality of the video feature vector is the same as the dimensionality of the text feature vector when the video feature vector is spliced with the text feature vector.
In the embodiment of the disclosure, when the video feature vector extracted by the pre-training model EfficientNetB3 is a feature vector with 1536 dimensions, the feature vector with 1536 dimensions is subjected to dimension reduction through a full connection layer, and the feature dimension is reduced to 1024 dimensions, which are the same as the dimension of the text feature vector.
In the embodiment of the present disclosure, training a visual language cross-modal model includes two stages: and performing a pre-training phase on the first pre-training model and performing a fine-tuning phase on the visual language cross-modal model.
In the pre-training stage of the first pre-training model, three tasks, namely label classification (TC), mask Language Model (MLM), and Mask Frame Model (MFM), are adopted to pre-train the first pre-training model. The label classification task is used for performing multi-label classification on the video, the mask language mode task is used for randomly shielding the text and predicting shielding words, and the mask frame mode task is used for randomly shielding the video frame and predicting shielding frames.
For example, in the tag classification task, a tag of the top 100 classes with a high frequency of occurrence may be used as a multi-tag classification task, where the tag is a video tag that is manually labeled, for example, whether an accident occurs, the accident reason, the accident occurrence location, the vehicle type, the weather condition, the time information, and the like. And (3) connecting a [ CLS ] corresponding vector of the last layer of the Bert with a full-connection layer to obtain a prediction label of the tag, and calculating Binary Cross Entropy loss (BCE loss) with a real label.
In the mask language model task, a mask (mask) is performed on random 15% of the text, and the mask text is predicted. Under a multi-mode scene, the mask words are predicted by combining the information of the video, and multi-mode information can be effectively fused.
In the mask frame mode task, mask is carried out on random 15% of video frames, and all 0 vectors are adopted for the mask video frames. Because the video feature vector is a continuous real-valued vector, without segmentation (token), it is difficult to do a classification task similar to the mask language model task. Since it is desirable that the predicted frame of a mask is as similar as possible to the frame to be the mask in all frames within the entire batch (batch), the embodiments of the present disclosure calculate the loss (loss) of mask frame modulo based on Noise Contrast Estimation (NCE), maximizing the mutual information of the mask frame and the predicted frame.
And (2) adopting multi-task combined training, wherein the Loss of the total pre-training task adopts the weighted sum of the Loss of the three pre-training tasks, and the Loss = Loss (TC) a + Loss (mlm) b + Loss (mfm) c, wherein a, b and c are respectively the weight of the Loss of the three pre-training tasks.
According to the video question-answering method, information of two modes of a video and a text is represented through a first pre-training model, a video feature vector corresponding to the video and a text feature vector corresponding to a question text and a candidate answer text are spliced, the first pre-training model is input, and a second spliced feature vector after coding is obtained. The second stitched feature vector may be divided into a second video feature vector and a second text feature vector according to a flag in the second stitched feature vector.
Assuming that V = [ V1, V2, … vm ], Q = [ Q1, Q2, \8230, Q3], a = [ a1, a2, \8230, a3] are respectively video, question and answer, wherein vi, qi, ai are respectively corresponding sequences, VLP (·) is used to represent the VLP model (i.e. the first pre-training model), the encoded representation E obtained by inputting [ CLS ] V [ SEP ] Q + a [ SEP ] into the VLP model can be expressed as:
Figure BDA0003821514210000103
wherein E = [ E1, E2, \8230 ], E (m + n + k)]Dividing the video frame and the question option vector obtained by encoding into two parts to obtain a second video feature vector and a second text feature vector,
Figure BDA0003821514210000101
and
Figure BDA0003821514210000102
and then, connecting the pre-trained first pre-training model with the modal fusion model and the decoding layer to fine-tune the whole visual language cross-modal model. The processing procedure of the first pre-training model is the same as the aforementioned procedure, and is not described herein again. And inputting the second video feature vector and the second text feature vector into a modal fusion model, processing the second video feature vector and the second text feature vector by the modal fusion model through a mutual attention mechanism to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector.
Fig. 3 is a schematic structural diagram of a modality fusion model created according to an embodiment of the present disclosure. As shown in fig. 3, in the modality fusion model, the Query vector (Query) comes from one modality, and the Key vector (Key) and the Value vector (Value) come from another modality. The video question-answering method provided by the embodiment of the disclosure further aligns and fuses the language and the image semantically by introducing a mutual attention mechanism.
In some exemplary embodiments, processing the second video feature vector and the second text feature vector by a mutual attention mechanism includes:
taking the second video feature vector as a query vector, taking the second text feature vector as a key vector and a value vector, and carrying out Multi-Head Attention (Multi Head Attention);
and taking the second text feature vector as a query vector and the second video feature vector as a key vector and a value vector to carry out multi-head attention.
In the embodiment of the present disclosure, the modal fusion model uses the second video feature vector as Query, uses the second text feature vector as Key and Value, and performs multi-head attention, and uses the second text feature vector as Query, and uses the second video feature vector as Key and Value, and performs multi-head attention, so that the attention exchanged in this way is one layer, and in some exemplary embodiments, multiple layers of series connection may be performed. Pooling (Pooling) a video expression representing the video frame sequence, pooling a text expression representing the question option sequence, and fusing (Fuse) the two pooled vectors to obtain a fused feature vector. The fused feature vectors are then input into the decoding layer, predicting the probability that each candidate is the correct candidate answer.
The above process is formulated as:
Figure BDA0003821514210000111
Figure BDA0003821514210000112
MHA(E V ,E QA ,E QA )=concat(head 1 ,...head h );
MHA 1 =MHA(E V ,E QA ,E QA );
MHA 2 =MHA(E QA ,E V ,E V );
DUMA(E V ,E QA )=Fuse(MHA 1 ,MHA 2 );
wherein the content of the first and second substances,
Figure BDA0003821514210000113
the parameter matrix, MHA (-) is a multi-head attention mechanism, DUMA (-) is a bidirectional multi-head co-attention network, fuse (-) represents the average pooling and fusion of the results of DUMA output, and illustratively, the fusion can be performed in a splicing manner.
For each one<V,Q,A i >The triplet, the output over the DUMA network, is:
Figure BDA0003821514210000114
the loss function of the network is:
Figure BDA0003821514210000115
wherein, W T Is a parameter matrix, and s is the number of candidate answers.
In some exemplary embodiments, the method further comprises:
receiving a voice input of a user;
the speech input is converted to a question text by speech recognition.
In some exemplary embodiments, the method further comprises:
receiving a voice input of a user;
the voice input is converted into voice instructions and/or question text through voice recognition.
For example, the voice command may be "help me switch to the surveillance video of the XXXX road", and the like, and the system performs a corresponding response operation according to the received voice command, for example, switching the display screen to the surveillance video of the XXXX road. For example, the question text may be "whether there is an accident on the current road segment", "what is the cause of the accident? And the system generates a candidate answer text according to the received question text, and then predicts a correct candidate answer through the video question-answering method of the embodiment of the disclosure.
In some exemplary embodiments, the method further comprises:
and broadcasting the predicted correct candidate answer in a voice form through voice synthesis.
In the embodiment of the disclosure, the predicted correct candidate answer can be directly displayed on the display screen, or the predicted correct candidate answer can be broadcasted in a voice form.
In some exemplary embodiments, the method further comprises:
acquiring a question text;
and generating candidate answer texts corresponding to the question texts according to the question texts.
In some exemplary embodiments, generating a candidate answer text corresponding to the question text from the question text includes:
searching triples matched with the question text from the knowledge graph through a keyword matching or attention mechanism model;
and generating candidate answer texts corresponding to the question texts according to the matched triples.
As knowledge expansion grows, the use of knowledge-graphs as a concept is becoming more common. The knowledge graph can use a proper knowledge representation method to mine the connection between data through a relation layer, so that knowledge can be more easily circulated and cooperatively processed between computers. The knowledge graph represents the relationship between the entities by the data structure of the graph, and compared with the simple text information, the representation method of the graph is easier to understand and accept. The knowledge graph is formed by connecting edges of relationships of entity-relationship-entity or entity-attribute value, and the relationship between entities or the relationship between the entities and the attributes is represented. The relation search based on the knowledge graph is widely applied to entity matching and question-answering systems in various industries, can process and store known knowledge, and can quickly perform knowledge matching and answer search.
The basic composition of a knowledge graph is a triplet (S, P, O). Wherein S and O are nodes in the knowledge graph and represent entities. S specifically represents a host, and O specifically represents an object. P is an edge in the knowledge-graph connecting two entities (S and O) representing the relationship that the two entities have. For example, in a traffic accident common sense knowledge map, a triplet may be (rear-end accident, accident type, traffic accident) and so on.
For example, in an intelligent traffic multi-screen monitoring system, a large screen can be scheduled and interactively asked and answered through voice control, and monitoring of an accident is switched to a main screen for accident analysis, for example, a corresponding interactive instruction may be "help me switch to a monitoring video of a XXXX road"; "whether an accident occurs in the current road section"; "what is the cause of the accident? "in the specific implementation process, the knowledge related to the traffic accident may be implemented by introducing a knowledge base, for example, a traffic accident knowledge map as shown in fig. 4 may be constructed in the knowledge base, and a triple matched with the question text is queried from the traffic accident knowledge map through keyword matching or attention mechanism as the knowledge (i.e., candidate answer) corresponding to the displayed semantic information, for example, according to the traffic accident knowledge map as shown in fig. 4, as to the question of the user," what is the cause of the accident? ", automatically generating a candidate answer: weather, road, driver, vehicle conditions are not good.
The embodiment of the present disclosure further provides a video question answering device, which includes a memory; and a processor coupled to the memory, the processor configured to perform the steps of the video question answering method according to any one of the embodiments of the present disclosure based on instructions stored in the memory.
As shown in fig. 5, in one example, the video question answering device may include: a processor 510, a memory 520 and a bus system 530, wherein the processor 510 and the memory 520 are connected via the bus system 530, the memory 520 is configured to store instructions, and the processor 510 is configured to execute the instructions stored in the memory 520 to extract video feature vectors for an input video, text feature vectors for a question text describing a question and a candidate answer text providing a plurality of candidate answers; splicing the video feature vector and the text feature vector to obtain a spliced feature vector, inputting the spliced feature vector into a first pre-training model, and learning cross-modal information between the video feature vector and the text feature vector by the first pre-training model through a self-attention mechanism to obtain a coded second spliced feature vector; dividing the second mosaic feature vector into a second video feature vector and a second text feature vector; inputting the second video feature vector and the second text feature vector into a modal fusion model, processing the second video feature vector and the second text feature vector by the modal fusion model through a mutual attention mechanism to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector; the fused feature vectors are input to a decoding layer to predict correct candidate answers.
It should be understood that processor 510 may be a Central Processing Unit (CPU), and processor 510 may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Memory 520 may include both read-only memory and random access memory and provides instructions and data to processor 510. A portion of memory 520 may also include non-volatile random access memory. For example, the memory 520 may also store device type information.
The bus system 530 may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 530 in FIG. 5.
In implementation, the processing performed by the processing device may be performed by instructions in the form of hardware, integrated logic circuits, or software in the processor 510. That is, the method steps of the embodiments of the present disclosure may be implemented by a hardware processor, or implemented by a combination of hardware and software modules in a processor. The software module may be located in a storage medium such as a random access memory, a flash memory, a read only memory, a programmable read only memory or an electrically erasable programmable memory, a register, etc. The storage medium is located in the memory 520, and the processor 510 reads the information in the memory 520, and combines the hardware to complete the steps of the method. To avoid repetition, it is not described in detail here.
As shown in fig. 6, an embodiment of the present disclosure further provides a video question-answering system, which includes the video question-answering apparatus according to any embodiment of the present disclosure, and further includes a monitoring system, a voice recognition apparatus, a voice input apparatus, and a knowledge base, where:
the monitoring system is configured to acquire one or more monitoring videos, process the monitoring videos according to the instruction text and output the monitoring videos to the video question-answering device;
a voice input device configured to receive a voice input and output to the voice recognition device;
the voice recognition device is configured to convert voice input into instruction text or question text through voice recognition, input the instruction text into the monitoring system, and input the question text into the video question-answering device;
a knowledge base configured to store a knowledge graph;
the video question-answering device is configured to receive a question text and a monitoring video, and generate a candidate answer text according to the question text, wherein the question text is used for describing a question, and the candidate answer text is used for providing a plurality of candidate answers; the video feature vector is extracted from the received surveillance video, the text feature vector is extracted according to the question text and the candidate answer text, the video feature vector and the text feature vector are spliced to obtain a spliced feature vector, the spliced feature vector is input into a first pre-training model, the first pre-training model learns cross-modal information between the video feature vector and the text feature vector through a self-attention mechanism, and a coded second spliced feature vector is obtained; dividing the second splicing feature vector into a second video feature vector and a second text feature vector; inputting the second video feature vector and the second text feature vector into a modal fusion model, processing the second video feature vector and the second text feature vector by the modal fusion model by adopting a mutual attention mechanism to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector; the fused feature vectors are input to the decoding layer to predict the correct candidate answer.
In some exemplary embodiments, the video question-answering system further comprises a speech synthesis output device, wherein:
and the voice synthesis output device is used for broadcasting the predicted correct candidate answer in a voice form through voice synthesis.
The video question-answering system of the embodiment of the disclosure is composed of several modules shown in fig. 6, wherein a voice instruction or a question is given by a user side through a voice input device, a natural language is converted into a text through a voice recognition device, an answer fed back by the system is broadcasted in a voice mode through a voice synthesis output device, and the video question-answering device performs semantic understanding and multi-mode interactive reasoning to give a corresponding answer according to the question or the instruction transmitted by the user by combining a monitoring system and a knowledge base.
The embodiments of the present disclosure further provide a storage medium, on which a computer program is stored, where the program, when executed by a processor, implements the video question answering method according to any one of the embodiments of the present disclosure.
In some possible embodiments, the aspects of the video question answering method provided in this application may also be implemented in the form of a program product including program code for causing a computer device to perform the steps in the video question answering method according to various exemplary embodiments of this application described above in this specification when the program product runs on the computer device, for example, the computer device may perform the video question answering method described in this application.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example but not limited to: an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The drawings in this disclosure relate only to the structures to which this disclosure relates and other structures may be referred to in the general design. Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.
It will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the true spirit and scope of the disclosed embodiments and it is intended to cover all modifications and equivalents included within the scope of the claims of the present disclosure.

Claims (14)

1. A video question answering method is characterized by comprising the following steps:
extracting video feature vectors for input videos, and extracting text feature vectors for question texts and candidate answer texts, wherein the question texts are used for describing questions, and the candidate answer texts are used for providing a plurality of candidate answers; splicing the video feature vector and the text feature vector to obtain a spliced feature vector, inputting the spliced feature vector into a first pre-training model, and learning cross-modal information between the video feature vector and the text feature vector by the first pre-training model through a self-attention mechanism to obtain a coded second spliced feature vector;
dividing the second mosaic feature vector into a second video feature vector and a second text feature vector; inputting the second video feature vector and the second text feature vector into a modal fusion model, processing the second video feature vector and the second text feature vector by the modal fusion model through a mutual attention mechanism to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector;
the fused feature vectors are input to a decoding layer to predict correct candidate answers.
2. The video question-answering method according to claim 1, wherein the extracting video feature vectors for the input video comprises:
and extracting frames of the input video at a preset speed, and extracting video characteristic vectors of the extracted frames by adopting a second pre-training model.
3. The video question-answering method according to claim 1, wherein the extracting text feature vectors for the question text and the candidate answer text comprises:
generating a sequence string according to the question text and the candidate answer text, wherein the sequence string comprises a plurality of sequences, and each word or character in the question text and the candidate answer text corresponds to one or more sequences;
and inputting the sequence string into the first pre-training model to obtain a text feature vector.
4. The video question-answering method according to claim 1, characterized by further comprising before the method:
constructing the first pre-training model and initializing;
pre-training the first pre-training model through a plurality of self-supervision tasks, wherein the plurality of self-supervision tasks comprise a label classification task, a mask language mode task and a mask frame mode task, the label classification task is used for performing multi-label classification on videos, the mask language mode task is used for performing random shielding on texts and predicting shielding words, and the mask frame mode task is used for performing random shielding on video frames and predicting shielding frames;
calculating a loss of the first pre-trained model by a weighted sum of losses of a plurality of the self-supervised tasks.
5. The video question-answering method according to claim 4, wherein the loss of the tag classification task and the mask language mode task is calculated based on binary cross entropy, and the loss of the mask frame mode task is calculated based on noise contrast estimation.
6. The video question answering method according to claim 1, characterized in that the first pre-training model is a 24-layer depth transform encoder cascade neural network, a hidden layer dimension is 1024, the number of attention heads is 16, and the first pre-training model is initialized by representing parameters pre-trained by BERT via a bidirectional encoder from transforms.
7. The video question-answering method according to claim 1, wherein the processing the second video feature vector and the second text feature vector through a mutual attention mechanism comprises:
taking the second video feature vector as a query vector, taking the second text feature vector as a key vector and a value vector, and performing multi-head attention;
and taking the second text feature vector as a query vector, taking the second video feature vector as a key vector and a value vector, and performing multi-head attention.
8. The video question-answering method according to claim 1, characterized in that the method further comprises, before:
receiving a voice input of a user;
converting the speech input into the question text by speech recognition.
9. The video question-answering method according to claim 1, characterized in that the method further comprises, before:
acquiring the question text;
and generating the candidate answer text corresponding to the question text according to the question text.
10. The video question-answering method according to claim 9, wherein the generating of the candidate answer text corresponding to the question text from the question text comprises:
searching the triples matched with the question texts from the common knowledge graph through a keyword matching or attention mechanism model;
and generating the candidate answer text corresponding to the question text according to the matched triple.
11. The video question-answering method according to claim 1, characterized in that the method further comprises:
processing the video feature vector and/or the text feature vector such that a dimension of the video feature vector and a dimension of the text feature vector are the same when the video feature vector and the text feature vector are stitched.
12. A video question answering device, comprising a memory; and a processor coupled to the memory, the processor configured to perform the steps of the video question answering method according to any one of claims 1 to 11, based on instructions stored in the memory.
13. A storage medium, characterized in that a computer program is stored thereon, which when executed by a processor implements the video question-answering method according to any one of claims 1 to 11.
14. A video question-answering system is characterized by comprising a video question-answering device, a monitoring system, a voice recognition device, a voice input device and a knowledge base, wherein:
the monitoring system is configured to acquire one or more monitoring videos, process the monitoring videos according to an instruction text and output the monitoring videos to the video question-answering device;
the voice input device is configured to receive voice input and output the voice input to the voice recognition device;
the voice recognition device is configured to convert voice input into instruction text or question text through voice recognition, input the instruction text into the monitoring system, and input the question text into the video question-answering device;
the knowledge base is configured to store a common sense knowledge graph;
the video question-answering device is configured to receive a question text and a monitoring video, and generate a candidate answer text according to the question text, wherein the question text is used for describing a question, and the candidate answer text is used for providing a plurality of candidate answers; the method comprises the steps of extracting video feature vectors from a received surveillance video, extracting text feature vectors from question texts and candidate answer texts, splicing the video feature vectors and the text feature vectors to obtain spliced feature vectors, inputting the spliced feature vectors into a first pre-training model, learning cross-modal information between the video feature vectors and the text feature vectors through a self-attention mechanism by the first pre-training model, and obtaining encoded second spliced feature vectors; dividing the second mosaicing feature vector into a second video feature vector and a second text feature vector; inputting the second video feature vector and the second text feature vector into a modal fusion model, wherein the modal fusion model adopts a mutual attention mechanism to process the second video feature vector and the second text feature vector to obtain a video expression and a text expression, and pooling and fusing the video expression and the text expression respectively to obtain a fusion feature vector; the fused feature vectors are input to a decoding layer to predict correct candidate answers.
CN202211043431.7A 2022-08-29 2022-08-29 Video question-answering method, device, system and storage medium Pending CN115391511A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202211043431.7A CN115391511A (en) 2022-08-29 2022-08-29 Video question-answering method, device, system and storage medium
PCT/CN2023/111455 WO2024046038A1 (en) 2022-08-29 2023-08-07 Video question-answer method, device and system, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211043431.7A CN115391511A (en) 2022-08-29 2022-08-29 Video question-answering method, device, system and storage medium

Publications (1)

Publication Number Publication Date
CN115391511A true CN115391511A (en) 2022-11-25

Family

ID=84122684

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211043431.7A Pending CN115391511A (en) 2022-08-29 2022-08-29 Video question-answering method, device, system and storage medium

Country Status (2)

Country Link
CN (1) CN115391511A (en)
WO (1) WO2024046038A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116257611A (en) * 2023-01-13 2023-06-13 北京百度网讯科技有限公司 Question-answering model training method, question-answering processing device and storage medium
CN116883886A (en) * 2023-05-25 2023-10-13 中国科学院信息工程研究所 Weak supervision time sequence language positioning method and device based on two-stage comparison learning and noise robustness
WO2024046038A1 (en) * 2022-08-29 2024-03-07 京东方科技集团股份有限公司 Video question-answer method, device and system, and storage medium
CN117972138A (en) * 2024-04-02 2024-05-03 腾讯科技(深圳)有限公司 Training method and device for pre-training model and computer equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11769018B2 (en) * 2020-11-24 2023-09-26 Openstream Inc. System and method for temporal attention behavioral analysis of multi-modal conversations in a question and answer system
CN113807222B (en) * 2021-09-07 2023-06-27 中山大学 Video question-answering method and system for end-to-end training based on sparse sampling
CN114925703A (en) * 2022-06-14 2022-08-19 齐鲁工业大学 Visual question-answering method and system with multi-granularity text representation and image-text fusion
CN115391511A (en) * 2022-08-29 2022-11-25 京东方科技集团股份有限公司 Video question-answering method, device, system and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024046038A1 (en) * 2022-08-29 2024-03-07 京东方科技集团股份有限公司 Video question-answer method, device and system, and storage medium
CN116257611A (en) * 2023-01-13 2023-06-13 北京百度网讯科技有限公司 Question-answering model training method, question-answering processing device and storage medium
CN116257611B (en) * 2023-01-13 2023-11-10 北京百度网讯科技有限公司 Question-answering model training method, question-answering processing device and storage medium
CN116883886A (en) * 2023-05-25 2023-10-13 中国科学院信息工程研究所 Weak supervision time sequence language positioning method and device based on two-stage comparison learning and noise robustness
CN116883886B (en) * 2023-05-25 2024-05-28 中国科学院信息工程研究所 Weak supervision time sequence language positioning method and device based on two-stage comparison learning and noise robustness
CN117972138A (en) * 2024-04-02 2024-05-03 腾讯科技(深圳)有限公司 Training method and device for pre-training model and computer equipment

Also Published As

Publication number Publication date
WO2024046038A1 (en) 2024-03-07

Similar Documents

Publication Publication Date Title
Zhang et al. Multimodal intelligence: Representation learning, information fusion, and applications
CN109711463B (en) Attention-based important object detection method
CN115391511A (en) Video question-answering method, device, system and storage medium
EP3399460B1 (en) Captioning a region of an image
Zeng et al. Combining background subtraction algorithms with convolutional neural network
CN111930992A (en) Neural network training method and device and electronic equipment
CN114339450B (en) Video comment generation method, system, device and storage medium
CN113128431B (en) Video clip retrieval method, device, medium and electronic equipment
WO2021212601A1 (en) Image-based writing assisting method and apparatus, medium, and device
CN113792177A (en) Scene character visual question-answering method based on knowledge-guided deep attention network
CN114495129A (en) Character detection model pre-training method and device
CN113392265A (en) Multimedia processing method, device and equipment
CN115223020A (en) Image processing method, image processing device, electronic equipment and readable storage medium
CN115712709A (en) Multi-modal dialog question-answer generation method based on multi-relationship graph model
CN116543351A (en) Self-supervision group behavior identification method based on space-time serial-parallel relation coding
CN117609550B (en) Video title generation method and training method of video title generation model
Xue et al. LCSNet: End-to-end lipreading with channel-aware feature selection
CN115292439A (en) Data processing method and related equipment
CN117056474A (en) Session response method and device, electronic equipment and storage medium
CN116975347A (en) Image generation model training method and related device
CN116704433A (en) Self-supervision group behavior recognition method based on context-aware relationship predictive coding
CN116977701A (en) Video classification model training method, video classification method and device
Zhang et al. Vsam-based visual keyword generation for image caption
Shi et al. AdaFI-FCN: an adaptive feature integration fully convolutional network for predicting driver’s visual attention
Kong et al. Encoder–decoder with double spatial pyramid for semantic segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination