WO2024046038A1 - Video question-answer method, device and system, and storage medium - Google Patents

Video question-answer method, device and system, and storage medium Download PDF

Info

Publication number
WO2024046038A1
WO2024046038A1 PCT/CN2023/111455 CN2023111455W WO2024046038A1 WO 2024046038 A1 WO2024046038 A1 WO 2024046038A1 CN 2023111455 W CN2023111455 W CN 2023111455W WO 2024046038 A1 WO2024046038 A1 WO 2024046038A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
video
feature vector
question
answer
Prior art date
Application number
PCT/CN2023/111455
Other languages
French (fr)
Chinese (zh)
Inventor
王炳乾
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Publication of WO2024046038A1 publication Critical patent/WO2024046038A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • Embodiments of the present disclosure relate to, but are not limited to, the technical field of natural language processing, and in particular, to a video question and answer method, device, system and storage medium.
  • Video question answering is the task of inferring the correct answer from a candidate set given a video clip and question.
  • video question answering has been widely used in video retrieval, intelligent question answering systems, and assisted driving systems. The wide range of applications in autonomous driving and autonomous driving are attracting more and more attention.
  • the embodiment of the present disclosure provides a video question and answer method, including:
  • the vector is spliced with the text feature vector to obtain a spliced feature vector, and the spliced feature vector is input into the first pre-training model.
  • the first pre-trained model learns the video feature vector and the video feature vector through a self-attention mechanism.
  • the cross-modal information between text feature vectors is used to obtain the encoded second spliced feature vector;
  • the second splicing feature vector is divided into a second video feature vector and a second text feature vector; the second video feature vector and the second text feature vector are input into a modal fusion model, and the modal fusion model uses mutual attention
  • the force mechanism processes the second video feature vector and the second text feature vector to obtain a video expression and a text expression, and pools and fuses the video expression and the text expression respectively to obtain a fusion feature vector;
  • the fused feature vector is input to the decoding layer to predict the correct candidate answer.
  • Embodiments of the present disclosure also provide a video question and answer device, including a memory; and a processor coupled to the memory, the processor being configured to execute any of the instructions of the present disclosure based on instructions stored in the memory. The steps of the video question and answer method described in the embodiment.
  • An embodiment of the present disclosure also provides a storage medium on which a computer program is stored.
  • the program is executed by a processor, the video question and answer method as described in any embodiment of the present disclosure is implemented.
  • Embodiments of the present disclosure also provide a video question and answer system, including a video question and answer device, a monitoring system, a speech recognition device, a speech input device and a knowledge base, wherein:
  • the monitoring system is configured to acquire one or more surveillance videos, process the surveillance videos according to the instruction text, and output the surveillance videos to the video question and answer device;
  • the voice input device is configured to receive voice input and output it to the voice recognition device;
  • the speech recognition device is configured to convert speech input into instruction text or question text through speech recognition, input the instruction text into the monitoring system, and input the question text into the video question and answer device;
  • the knowledge base is configured to store a common sense knowledge graph
  • the video question and answer device is configured to receive question text and surveillance video, and generate candidate answer text according to the question text, wherein the question text is used to describe a question, and the candidate answer text is used to provide multiple candidate answers; It is also configured to extract a video feature vector from the received surveillance video, extract a text feature vector from the question text and the candidate answer text, splice the video feature vector and the text feature vector to obtain a spliced feature vector, and splice the spliced feature vector.
  • the feature vector is input into the first pre-training model.
  • the first pre-training model learns the cross-modal information between the video feature vector and the text feature vector through a self-attention mechanism to obtain the encoded second splicing feature.
  • the second video feature vector and the second text feature vector are processed to obtain a video expression and a text expression, and the video expression and text expression are pooled and fused respectively to obtain a fusion feature vector ; Input the fused feature vector into the decoding layer to predict the correct candidate answer.
  • Figure 1 is a schematic flow chart of a video question and answer method according to an exemplary embodiment of the present disclosure
  • Figure 2 is a schematic structural diagram of the visual language cross-modal model created by an embodiment of the present disclosure
  • Figure 3 is a schematic structural diagram of a modal fusion model created according to an exemplary embodiment of the present disclosure
  • Figure 4 is a schematic structural diagram of a traffic accident common sense knowledge graph according to an exemplary embodiment of the present disclosure
  • Figure 5 is a schematic structural diagram of a video question and answer device according to an exemplary embodiment of the present disclosure
  • Figure 6 is a schematic structural diagram of a video question and answer system according to an exemplary embodiment of the present disclosure.
  • CV Computer vision
  • NLP natural language processing
  • Some visual question answering techniques encode the input image through an image encoder, encode the input question through a question encoder, perform dot product merging of the encoded image and question features, and then predict the probability of candidate answers through a fully connected layer.
  • these video question and answer technologies only consider unilateral modal information and lack semantic understanding capabilities. Real-world problems often involve multimodality. For example, self-driving cars should be able to handle human commands (language), traffic signals (vision), and road conditions (sight and sound).
  • an embodiment of the present disclosure provides a video question and answer method, which includes the following steps:
  • Step 101 Extract video feature vectors from the input video, and extract text feature vectors from the question text and candidate answer text, where the question text is used to describe the question, and the candidate answer text is used to provide multiple candidate answers; combine the video feature vector with the text
  • the feature vectors are spliced to obtain a spliced feature vector.
  • the spliced feature vector is input into the first pre-training model.
  • the first pre-training model uses the self-attention mechanism to learn the cross-modal information between the video feature vector and the text feature vector to obtain the encoding.
  • Step 102 Divide the second splicing feature vector into a second video feature vector and a second text feature vector; input the second video feature vector and the second text feature vector into the modal fusion model, and the modal fusion model uses the mutual attention mechanism. Process the second video feature vector and the second text feature vector to obtain a video expression and a text expression, and pool and fuse the video expression and the text expression respectively to obtain a fusion feature vector;
  • Step 103 Input the fused feature vector into the decoding layer to predict the correct candidate answer.
  • the video question and answer method in the embodiment of the present disclosure combines video feature vectors and text feature vectors. After row splicing, the first pre-training model is input, using visual-linguistic cross-modal information to improve the semantic representation ability of video Q&A, and using the mutual attention mechanism to conduct deeper semantic interaction between video and text option pairs, which is expected to have a deeper understanding of video The content is semantically understood to improve the model reasoning effect.
  • the video question and answer method in the embodiment of the present disclosure can be applied to many application fields such as video retrieval, intelligent question and answer system, assisted driving system and automatic driving.
  • the method before step 101, the method further includes:
  • the first pre-training model is pre-trained through multiple self-supervised tasks.
  • the multiple self-supervised tasks include a label classification task, a masked language model task and a masked frame model task.
  • the label classification task is used to multi-label the video.
  • the masked language module task is used to randomly mask text and predict masked words
  • the masked frame module task is used to randomly mask video frames and predict masked frames;
  • the loss of the first pre-trained model is calculated by the weighted sum of the losses of the self-supervised tasks.
  • Figure 2 is a schematic structural diagram of the visual language cross-modal model created by an embodiment of the present disclosure.
  • the visual language cross-modal model includes a cascaded first pre-training model, a modal fusion model and a decoding layer.
  • the first pre-trained model may be a 24-layer deep Transformer encoder cascade neural network, with a hidden layer dimension of 1024 and an attention head number of 16.
  • the parameters pre-trained by BERT are suitable for the first Initialize the pre-trained model.
  • the Transformer model consists of an encoder and a decoder.
  • the encoder and decoder each include multiple network blocks.
  • Each encoder network block consists of a multi-head attention (Attention) sub-layer and a feed-forward neural network sub-layer.
  • the structure of the decoder is similar to that of the encoder, except that there is an additional multi-head attention layer in each network block of the decoder.
  • Bidirectional Encoder Representation from Transformer is a successful application of Transformer that leverages the Transformer encoder and introduces bidirectional masking technology, allowing each language token to bidirectionally focus on other tokens.
  • the first pre-training model in the embodiment of the present disclosure adopts a multi-layer Transformer encoder cascade structure.
  • the first pre-trained model can be a 24-layer deep Transformer encoder cascade neural network, the hidden layer dimension is 1024, and the number of attention heads is 16.
  • the network of the first pre-trained model The structure is consistent with BERT Large (a natural language pre-training model proposed by Google), so the pre-training results of natural language pre-training models such as BERT can be directly used to initialize the first pre-training model of the disclosed embodiment.
  • extracting video feature vectors for input videos includes:
  • Frames are extracted from the input video at a preset speed, and a second pre-trained model is used to extract video feature vectors from the extracted frames.
  • the input of video question and answer is a collection of video clips, questions and candidate answers.
  • the continuous video stream can be understood as a set of rapidly played pictures, where each picture is defined as a frame.
  • frames exemplarily, 1 frame per second (fps)
  • N can be 30
  • EfficientNetB3 in the field of computer vision (CV)
  • embodiments of the present disclosure may also use pre-trained models such as MobileNet or ResNet101 to extract video feature vectors from input videos.
  • extracting text feature vectors for question text and candidate answer text includes:
  • the sequence string includes multiple sequences. Each word or character in the question text and candidate answer text corresponds to one or more sequences;
  • n_vocab For the question text and candidate answer text, we first use a word list of size n_vocab to mark them as sequence lists, and convert each word or character into one or more sequences.
  • the question text and candidate answer text All words and/or characters in form a sequence string, where the sequence can be a numeric ID.
  • the numerical ID can be between 0 and n_vocab-1. For example, convert "What type of road is it currently, a trail, a highway, a street, or a street?" into a sequence string such as "2560 2013 1300 100 567.".
  • n_word is the number in the sequence string.
  • the number of sequences, 1024 is the dimension of each text feature vector.
  • the video feature vector and the text feature vector are spliced and input into the first preset
  • the input form of the splicing feature vector can be [CLS]Video_frame[SEP]Question-Answer[SEP], where Video_frame is the video feature vector, Question-Answer is the text feature vector, and the [CLS] flag is the splicing feature vector.
  • the first flag, [SEP] flag is used to separate the video feature vector and the text feature vector, and add the same position embedding vector and segment embedding vector as BERT to the input of the first pre-training model, where the position embedding vector is used to specify The position in the sequence, segment embedding vector is used to specify the frame segment it belongs to.
  • the method before splicing the video feature vector and the text feature vector, the method further includes: processing the video feature vector and/or the text feature vector, such that when the video feature vector and the text feature are spliced, When vectors are spliced, the dimensions of the video feature vector and the text feature vector are the same.
  • the 1536-dimensional feature vector is reduced in dimension through a fully connected layer, and the feature dimension is reduced to 1024 dimensions, which is consistent with the text
  • the feature vectors have the same dimensions.
  • the training of the visual language cross-modal model includes two stages: a pre-training stage for the first pre-training model and a fine-tuning stage for the visual language cross-modal model.
  • the first pre-trained model is pre-trained.
  • the label classification task is used to classify videos with multiple labels
  • the mask language module task is used to randomly mask text and predict masked words
  • the mask frame module task is used to randomly mask video frames and predict masked frames.
  • the top 100 tags with higher frequency of occurrence can be used for multi-label classification tasks, where the tags are manually labeled video tags, for example, whether an accident occurred, the cause of the accident, and the location of the accident. , vehicle type, weather conditions, time information, etc. Take the [CLS] corresponding vector of the last layer of Bert and connect it to the fully connected layer to get the predicted label of the tag, and calculate the binary cross entropy loss (Binary Cross Entropy loss, BCE loss) with the real label.
  • BCE loss Binary Cross Entropy loss
  • the embodiment of the present disclosure calculates the mask frame model based on Noise Contrastive Estimation (NCE). The loss of the task is to maximize the mutual information between the mask frame and the prediction frame.
  • NCE Noise Contrastive Estimation
  • the video question and answer method of the embodiment of the present disclosure uses a first pre-trained model to represent the information of the two modalities of video and text, splice the video feature vector corresponding to the video and the text feature vector corresponding to the question text and the text feature vector corresponding to the candidate answer text, and input
  • the first pre-trained model obtains the encoded second concatenated feature vector.
  • the second splicing feature vector can be divided into a second video feature vector and a second text feature vector.
  • V [v1,v2,...vm]
  • Q [q1,q2,...,q3]
  • A [a1,a2,...,a3] are videos, questions and answers respectively, where vi, qi, ai are the corresponding sequences respectively, and use VLP( ⁇ ) to represent the VLP model (i.e. the first pre-training model), then the encoding representation E obtained by inputting [CLS]V[SEP]Q+A[SEP] into the VLP model can be expressed as:
  • the second video feature vector and the second text feature vector can be obtained by dividing the encoded video frame and question option vector into two parts, and
  • the pre-trained first pre-training model is connected to the modal fusion model and the decoding layer to fine-tune the overall visual language cross-modal model.
  • the processing process of the first pre-trained model is the same as the aforementioned process and will not be described again here.
  • the second video feature vector and the second text feature vector are input into the modal fusion model.
  • the modal fusion model processes the second video feature vector and the second text feature vector through a mutual attention mechanism to obtain a video expression and a text expression. formula, and pool and fuse the video expression and text expression respectively to obtain the fusion feature vector.
  • Figure 3 is a schematic structural diagram of a modal fusion model created according to an embodiment of the present disclosure.
  • the query vector (Query) comes from one modality
  • the key vector (Key) and value vector (Value) come from another modality.
  • the video question and answer method in the embodiment of the present disclosure further semantically aligns and integrates language and images by introducing a mutual attention mechanism.
  • the second video feature vector and the second text feature vector are processed through a mutual attention mechanism, including:
  • the second text feature vector is used as a query vector
  • the second video feature vector is used as a key vector and a value vector to perform multi-head attention.
  • the modal fusion model uses the second video feature vector as Query and the second text feature vector as Key and Value to perform multi-head attention.
  • the second text feature vector is used as Query and the second video feature vector As Key and Value, multi-head attention is performed, so that the exchanged attention is one layer.
  • multi-layer concatenation can be performed. Then the video expression representing the video frame sequence is pooled, and the text expression representing the question option sequence is pooled, and then the two pooled vectors are fused to obtain the fusion feature vector. The fused feature vector is then fed into the decoding layer to predict the probability that each candidate is the correct candidate answer.
  • MHA( ⁇ ) is the multi-head attention mechanism
  • DUMA( ⁇ ) is the bidirectional multi-head co-attention network
  • Fuse( ⁇ ) represents the average pooling and fusion of the results output by DUMA.
  • the fusion can be performed by splicing.
  • the loss function of the network is:
  • W T is the parameter matrix
  • s is the number of candidate answers.
  • the method further includes:
  • the method further includes:
  • Speech recognition converts voice input into voice commands and/or question text.
  • the voice command can be "Help me switch to the surveillance video of XXXX Road” and so on.
  • the system performs corresponding response operations according to the received voice command, for example, switching the display screen to the surveillance video of XXXX Road.
  • the question text can be "Is there an accident on the current road section?”, "What is the cause of the accident?", etc.
  • the system generates candidate answer texts based on the received question text, and then uses the video Q&A according to the embodiment of the present disclosure. Method predicts the correct candidate answer.
  • the method further includes:
  • the predicted correct candidate answers are broadcast in voice form.
  • the predicted correct candidate answer can be directly displayed on the display screen, or the predicted correct candidate answer can be broadcast in voice form.
  • the method further includes:
  • candidate answer text corresponding to the question text is generated.
  • Answer text including:
  • candidate answer text corresponding to the question text is generated.
  • Knowledge graphs can use appropriate knowledge representation methods through the relationship layer to mine connections between data, making it easier for knowledge to circulate and be collaboratively processed between computers.
  • Knowledge graph uses the data structure of graph to represent the relationship between entities. Compared with pure text information, the representation method of graph is easier to understand and accept.
  • the knowledge graph is composed of the relationship edges of "entity-relationship-entity" or "entity-attribute-attribute value”, focusing on expressing the relationship between entities or between entities and attributes.
  • Relational search based on knowledge graphs is widely used in entity matching and question answering systems in various industries. It can process and store known knowledge, and can quickly perform knowledge matching and answer search.
  • the basic composition of the knowledge graph is a triplet (S, P, O).
  • S and O are nodes in the knowledge graph, representing entities.
  • P is the edge connecting two entities (S and O) in the knowledge graph, indicating the relationship between the two entities.
  • a triplet can be (rear-end accident, accident type, traffic accident), etc.
  • the video question and answer method of the embodiment of the present disclosure can be used in a video surveillance system.
  • the large screen can be scheduled and interacted with questions and answers through voice control, and the monitoring of accidents can be switched to Carry out accident analysis on the main screen.
  • the corresponding interactive instructions can be "Help me switch to the surveillance video of XXXX road”; "Is there an accident on the current road section?”; "What is the cause of the accident?".
  • the specific implementation process will involve traffic.
  • Accident-related common sense knowledge can be achieved by introducing a knowledge base.
  • a traffic accident common sense knowledge graph as shown in Figure 4 can be constructed in the knowledge base, and questions can be queried from the traffic accident common sense knowledge graph through keyword matching or attention mechanisms.
  • the triplet of text matching is used as the common sense knowledge (i.e. candidate answer) corresponding to the displayed semantic information.
  • candidates are automatically generated. Answer: Weather, roads, driver, poor vehicle condition.
  • An embodiment of the present disclosure also provides a video question and answer device, including a memory; and coupled to the A processor of a memory, the processor being configured to perform the steps of the video question answering method according to any embodiment of the present disclosure based on instructions stored in the memory.
  • the video question and answer device may include: a processor 510, a memory 520, and a bus system 530.
  • the processor 510 and the memory 520 are connected through the bus system 530, and the memory 520 is used to store instructions.
  • the processor 510 is configured to execute instructions stored in the memory 520 to extract a video feature vector for the input video, and extract a text feature vector for the question text and the candidate answer text, where the question text is used to describe the question, and the candidate answer text Used to provide multiple candidate answers; splice the video feature vector and the text feature vector to obtain a spliced feature vector, input the spliced feature vector into the first pre-training model, and the first pre-training model automatically
  • the attention mechanism learns the cross-modal information between the video feature vector and the text feature vector to obtain the encoded second splicing feature vector; divides the second splicing feature vector into the second video feature vector and the second splicing feature vector.
  • Two text feature vectors input the second video feature vector and the second text feature vector into a modal fusion model, and the modal fusion model uses a mutual attention mechanism to compare the second video feature vector and the second text feature
  • the vector is processed to obtain a video expression and a text expression, and the video expression and the text expression are pooled and fused respectively to obtain a fused feature vector; the fused feature vector is input to the decoding layer to predict the correct candidate answer.
  • the processor 510 can be a central processing unit (Central Processing Unit, CPU).
  • the processor 510 can also be other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASICs), and off-the-shelf programmable gate arrays. (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • Memory 520 may include read-only memory and random access memory and provides instructions and data to processor 510 .
  • a portion of memory 520 may also include non-volatile random access memory.
  • memory 520 may also store device type information.
  • bus system 530 may also include a power bus, a control bus, a status signal bus, etc.
  • bus system 530 may also include a power bus, a control bus, a status signal bus, etc.
  • the various buses are labeled as bus system 530 in FIG. 5 .
  • the processing performed by the processing device may be completed by instructions in the form of hardware integrated logic circuits or software in the processor 510 . That is, the method steps of the embodiments of the present disclosure can be embodied as The execution is now completed by the hardware processor, or by a combination of hardware and software modules in the processor.
  • Software modules can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media.
  • the storage medium is located in the memory 520.
  • the processor 510 reads the information in the memory 520 and completes the steps of the above method in combination with its hardware. To avoid repetition, it will not be described in detail here.
  • an embodiment of the present disclosure also provides a video question and answer system, which includes the video question and answer device as described in any embodiment of the present disclosure, and also includes a monitoring system, a speech recognition device, a voice input device and a knowledge base, in:
  • a surveillance system configured to acquire one or more surveillance videos, process the surveillance videos according to the instruction text, and output the surveillance videos to the video question and answer device;
  • a voice input device configured to receive voice input and output it to the voice recognition device
  • a speech recognition device configured to convert the speech input into instruction text or question text through speech recognition, input the instruction text into the monitoring system, and input the question text into the video question and answer device;
  • Knowledge base configured to store knowledge graphs
  • a video question and answer device configured to receive a question text and a surveillance video, and generate a candidate answer text based on the question text, where the question text is used to describe the question, and the candidate answer text is used to provide multiple candidate answers; and is also configured to monitor the received Extract video feature vectors from the video, extract text feature vectors from the question text and candidate answer text, splice the video feature vector and the text feature vector to obtain a spliced feature vector, and input the spliced feature vector into the first pre-training model, the first pre-training model
  • the self-attention mechanism learn the cross-modal information between the video feature vector and the text feature vector to obtain the encoded second splicing feature vector; divide the second splicing feature vector into the second video feature vector and the second text Feature vector; input the second video feature vector and the second text feature vector into the modal fusion model.
  • the modal fusion model uses a mutual attention mechanism to process the second video feature vector and the second text feature vector to obtain a video expression and text expressions, and pool and fuse the video expressions and text expressions respectively to obtain a fused feature vector; the fused feature vector is input to the decoding layer to predict the correct candidate answer.
  • the video question and answer system further includes a speech synthesis output device, wherein:
  • a speech synthesis output device is used to broadcast the predicted correct candidate answer in speech form through speech synthesis.
  • the video question and answer system of the embodiment of the present disclosure consists of several modules as shown in Figure 6, in which voice instructions or questions are given by the user through the voice input device, the voice recognition device converts natural language into text, and the speech synthesis output device converts natural language into text.
  • the answers fed back by the system are broadcast in the form of voice, and the video question and answer device combines the monitoring system and knowledge base to perform semantic understanding and multi-modal interactive reasoning to give corresponding answers based on the questions or instructions passed in by the user.
  • An embodiment of the present disclosure also provides a storage medium on which a computer program is stored.
  • the program is executed by a processor, the video question and answer method as described in any embodiment of the present disclosure is implemented.
  • various aspects of the video question and answer method provided by this application can also be implemented in the form of a program product, which includes program code.
  • the program product When the program product is run on a computer device, the program code It is used to cause the computer device to execute the steps in the video question and answer method according to various exemplary embodiments of the present application described above in this specification.
  • the computer device can execute the video question and answer method recorded in the embodiment of the present application.
  • the program product may take the form of any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to: electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Animal Behavior & Ethology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A video question-answer method, device and system, and a storage medium. The method comprises: extracting a video feature vector for an inputted video, and extracting a text feature vector for a question text and a candidate answer text; splicing the video feature vector and the text feature vector to obtain a spliced feature vector, inputting the spliced feature vector into a first pre-training model, and the first pre-training model learning cross-modal information between the video feature vector and the text feature vector by means of a self-attention mechanism to obtain a second spliced feature vector; dividing the second spliced feature vector into a second video feature vector and a second text feature vector, inputting the second video feature vector and the second text feature vector into a modal fusion model, the modal fusion model processing the second video feature vector and the second text feature vector by means of a mutual attention mechanism to obtain a video expression and a text expression, and respectively pooling and fusing the video expression and the text expression to obtain a fused feature vector; predicting a correct candidate answer according to the fusion feature vector.

Description

视频问答方法、装置、系统及存储介质Video question and answer method, device, system and storage medium
本申请要求于2022年8月29日提交中国专利局、申请号为202211043431.7、发明名称为“视频问答方法、装置、系统及存储介质”的中国专利申请的优先权,其内容应理解为通过引用的方式并入本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on August 29, 2022, with the application number 202211043431.7 and the invention name "Video Question and Answer Method, Device, System and Storage Medium", and its content should be understood as being by reference incorporated into this application.
技术领域Technical field
本公开实施例涉及但不限于自然语言处理技术领域,尤指一种视频问答方法、装置、系统及存储介质。Embodiments of the present disclosure relate to, but are not limited to, the technical field of natural language processing, and in particular, to a video question and answer method, device, system and storage medium.
背景技术Background technique
在当前的移动互联网、大数据时代,网络上的视频数据呈现爆发式增长,作为日益丰富的信息承载媒介,对视频的语义进行理解是诸多视频智能应用的技术,具有重要的研究意义和实际应用价值。视频问答(Video QA)是给定一个视频片段和问题,从候选集合中推断出正确答案的任务,随着计算机视觉和自然语言处理的进步,视频问答在视频检索、智能问答系统、辅助驾驶系统和自动驾驶等方面的广泛应用受到越来越广泛的关注。In the current era of mobile Internet and big data, video data on the Internet is growing explosively. As an increasingly rich information-carrying medium, understanding the semantics of videos is a technology for many video intelligent applications, which has important research significance and practical applications. value. Video question answering (Video QA) is the task of inferring the correct answer from a candidate set given a video clip and question. With the advancement of computer vision and natural language processing, video question answering has been widely used in video retrieval, intelligent question answering systems, and assisted driving systems. The wide range of applications in autonomous driving and autonomous driving are attracting more and more attention.
发明内容Contents of the invention
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics described in detail in this article. This summary is not intended to limit the scope of the claims.
本公开实施例提供了一种视频问答方法,包括:The embodiment of the present disclosure provides a video question and answer method, including:
针对输入的视频提取视频特征向量,针对问题文本与候选答案文本提取文本特征向量,其中,所述问题文本用于描述问题,所述候选答案文本用于提供多个候选答案;将所述视频特征向量与所述文本特征向量进行拼接,得到拼接特征向量,将所述拼接特征向量输入第一预训练模型,所述第一预训练模型通过自注意力机制,学习所述视频特征向量和所述文本特征向量之间的跨模态信息,得到编码后的第二拼接特征向量; Extract video feature vectors from the input video, and extract text feature vectors from the question text and candidate answer text, where the question text is used to describe the question, and the candidate answer text is used to provide multiple candidate answers; the video features are The vector is spliced with the text feature vector to obtain a spliced feature vector, and the spliced feature vector is input into the first pre-training model. The first pre-trained model learns the video feature vector and the video feature vector through a self-attention mechanism. The cross-modal information between text feature vectors is used to obtain the encoded second spliced feature vector;
将所述第二拼接特征向量分成第二视频特征向量和第二文本特征向量;将所述第二视频特征向量和第二文本特征向量输入模态融合模型,所述模态融合模型通过互注意力机制,对所述第二视频特征向量和第二文本特征向量进行处理,得到视频表达式和文本表达式,并对视频表达式和文本表达式分别进行池化并融合,得到融合特征向量;The second splicing feature vector is divided into a second video feature vector and a second text feature vector; the second video feature vector and the second text feature vector are input into a modal fusion model, and the modal fusion model uses mutual attention The force mechanism processes the second video feature vector and the second text feature vector to obtain a video expression and a text expression, and pools and fuses the video expression and the text expression respectively to obtain a fusion feature vector;
将所述融合特征向量输入解码层,以预测正确的候选答案。The fused feature vector is input to the decoding layer to predict the correct candidate answer.
本公开实施例还提供了一种视频问答装置,包括存储器;和耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行如本公开任一实施例所述的视频问答方法的步骤。Embodiments of the present disclosure also provide a video question and answer device, including a memory; and a processor coupled to the memory, the processor being configured to execute any of the instructions of the present disclosure based on instructions stored in the memory. The steps of the video question and answer method described in the embodiment.
本公开实施例还提供了一种存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本公开任一实施例所述的视频问答方法。An embodiment of the present disclosure also provides a storage medium on which a computer program is stored. When the program is executed by a processor, the video question and answer method as described in any embodiment of the present disclosure is implemented.
本公开实施例还提供了一种视频问答系统,包括视频问答装置、监控系统、语音识别装置、语音输入装置和知识库,其中:Embodiments of the present disclosure also provide a video question and answer system, including a video question and answer device, a monitoring system, a speech recognition device, a speech input device and a knowledge base, wherein:
所述监控系统,被配置为获取一个或多个监控视频,根据指令文本对所述监控视频进行处理,并将所述监控视频输出至所述视频问答装置;The monitoring system is configured to acquire one or more surveillance videos, process the surveillance videos according to the instruction text, and output the surveillance videos to the video question and answer device;
所述语音输入装置,被配置为接收语音输入,并输出至语音识别装置;The voice input device is configured to receive voice input and output it to the voice recognition device;
所述语音识别装置,被配置为通过语音识别,将语音输入转换为指令文本或问题文本,将所述指令文本输入所述监控系统,将所述问题文本输入所述视频问答装置;The speech recognition device is configured to convert speech input into instruction text or question text through speech recognition, input the instruction text into the monitoring system, and input the question text into the video question and answer device;
所述知识库,被配置为存储常识知识图谱;The knowledge base is configured to store a common sense knowledge graph;
所述视频问答装置,被配置为接收问题文本和监控视频,根据所述问题文本生成候选答案文本,其中,所述问题文本用于描述问题,所述候选答案文本用于提供多个候选答案;还被配置为对接收的监控视频提取视频特征向量,针对所述问题文本与候选答案文本提取文本特征向量,将所述视频特征向量与文本特征向量进行拼接,得到拼接特征向量,将所述拼接特征向量输入第一预训练模型,所述第一预训练模型通过自注意力机制,学习所述视频特征向量和所述文本特征向量之间的跨模态信息,得到编码后的第二拼接特 征向量;将所述第二拼接特征向量分成第二视频特征向量和第二文本特征向量;将所述第二视频特征向量和第二文本特征向量输入模态融合模型,所述模态融合模型采用互注意力机制,对第二视频特征向量和第二文本特征向量进行处理,得到视频表达式和文本表达式,并对视频表达式和文本表达式分别进行池化并融合,得到融合特征向量;将所述融合特征向量输入解码层,以预测正确的候选答案。The video question and answer device is configured to receive question text and surveillance video, and generate candidate answer text according to the question text, wherein the question text is used to describe a question, and the candidate answer text is used to provide multiple candidate answers; It is also configured to extract a video feature vector from the received surveillance video, extract a text feature vector from the question text and the candidate answer text, splice the video feature vector and the text feature vector to obtain a spliced feature vector, and splice the spliced feature vector. The feature vector is input into the first pre-training model. The first pre-training model learns the cross-modal information between the video feature vector and the text feature vector through a self-attention mechanism to obtain the encoded second splicing feature. feature vector; divide the second splicing feature vector into a second video feature vector and a second text feature vector; input the second video feature vector and the second text feature vector into a modal fusion model, and the modal fusion model Using a mutual attention mechanism, the second video feature vector and the second text feature vector are processed to obtain a video expression and a text expression, and the video expression and text expression are pooled and fused respectively to obtain a fusion feature vector ; Input the fused feature vector into the decoding layer to predict the correct candidate answer.
在阅读理解了附图和详细描述后,可以明白其他方面。After reading and understanding the drawings and detailed description, other aspects can be understood.
附图说明Description of drawings
附图用来提供对本公开技术方案的理解,并且构成说明书的一部分,与本公开的实施例一起用于解释本公开的技术方案,并不构成对本公开技术方案的限制。The drawings are used to provide an understanding of the technical solution of the present disclosure and constitute a part of the specification. They are used to explain the technical solution of the present disclosure together with the embodiments of the present disclosure and do not constitute a limitation of the technical solution of the present disclosure.
图1为本公开示例性实施例的一种视频问答方法的流程示意图;Figure 1 is a schematic flow chart of a video question and answer method according to an exemplary embodiment of the present disclosure;
图2为本公开实施例创建的视觉语言跨模态模型的结构示意图;Figure 2 is a schematic structural diagram of the visual language cross-modal model created by an embodiment of the present disclosure;
图3为本公开示例性实施例创建的模态融合模型的结构示意图;Figure 3 is a schematic structural diagram of a modal fusion model created according to an exemplary embodiment of the present disclosure;
图4为本公开示例性实施例的一种交通事故常识知识图谱的结构示意图;Figure 4 is a schematic structural diagram of a traffic accident common sense knowledge graph according to an exemplary embodiment of the present disclosure;
图5为本公开示例性实施例的一种视频问答装置的结构示意图;Figure 5 is a schematic structural diagram of a video question and answer device according to an exemplary embodiment of the present disclosure;
图6为本公开示例性实施例的一种视频问答系统的结构示意图。Figure 6 is a schematic structural diagram of a video question and answer system according to an exemplary embodiment of the present disclosure.
具体实施方式Detailed ways
下面将结合附图对本公开的实施例进行详细说明。实施方式可以以多个不同形式来实施。所属技术领域的普通技术人员可以很容易地理解一个事实,就是方式和内容可以在不脱离本公开的宗旨及其范围的条件下被变换为一种或多种形式。因此,本公开不应该被解释为仅限定在下面的实施方式所记载的内容中。在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互任意组合。The embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. Embodiments may be implemented in many different forms. Those of ordinary skill in the art can readily appreciate the fact that the manner and content may be transformed into one or more forms without departing from the spirit and scope of the present disclosure. Therefore, the present disclosure should not be construed as being limited only to the contents described in the following embodiments. The embodiments and features in the embodiments of the present disclosure may be arbitrarily combined with each other unless there is any conflict.
在附图中,有时为了明确起见,夸大表示了一个或多个构成要素的大小、 层的厚度或区域。因此,本公开的一个方式并不一定限定于该尺寸,附图中多个部件的形状和大小不反映真实比例。此外,附图示意性地示出了理想的例子,本公开的一个方式不局限于附图所示的形状或数值等。In the drawings, the size, size, and size of one or more components are sometimes exaggerated for clarity. The thickness or area of a layer. Therefore, one aspect of the present disclosure is not necessarily limited to such dimensions, and the shapes and sizes of various components in the drawings do not reflect true proportions. In addition, the drawings schematically show ideal examples, and one aspect of the present disclosure is not limited to shapes, numerical values, etc. shown in the drawings.
本公开中的“第一”、“第二”、“第三”等序数词是为了避免构成要素的混同而设置,而不是为了在数量方面上进行限定的。本公开中的“多个”表示两个或两个以上的数量。Ordinal numbers such as "first", "second", and "third" in this disclosure are provided to avoid confusion of constituent elements and are not intended to limit the quantity. "A plurality" in this disclosure means a quantity of two or more.
计算机视觉(CV)和自然语言处理(NLP)是人工智能的两大分支,它们专注于在视觉和语言上模拟人类智能。在过去的十年中,深度学习极大地推进了单模态学习在这两个领域的发展。一些视觉问答技术,通过一个图像编码器编码输入图像,通过一个问题编码器编码输入问题,将编码后的图像和问题特征进行点积合并,然后通过全连通层来预测候选答案的概率。但是,这些视频问答技术,仅仅考虑到了单方面的模态信息,缺乏语义理解能力。而现实世界的问题往往是涉及多模态的。例如,自动驾驶汽车应该做到能够处理人类的命令(语言)、交通信号(视觉)、道路状况(视觉和声音)。Computer vision (CV) and natural language processing (NLP) are two major branches of artificial intelligence that focus on simulating human intelligence in vision and language. Over the past decade, deep learning has greatly advanced the development of unimodal learning in both fields. Some visual question answering techniques encode the input image through an image encoder, encode the input question through a question encoder, perform dot product merging of the encoded image and question features, and then predict the probability of candidate answers through a fully connected layer. However, these video question and answer technologies only consider unilateral modal information and lack semantic understanding capabilities. Real-world problems often involve multimodality. For example, self-driving cars should be able to handle human commands (language), traffic signals (vision), and road conditions (sight and sound).
如图1所示,本公开实施例提供了一种视频问答方法,包括如下步骤:As shown in Figure 1, an embodiment of the present disclosure provides a video question and answer method, which includes the following steps:
步骤101:针对输入的视频提取视频特征向量,针对问题文本与候选答案文本提取文本特征向量,其中,问题文本用于描述问题,候选答案文本用于提供多个候选答案;将视频特征向量与文本特征向量进行拼接,得到拼接特征向量,将拼接特征向量输入第一预训练模型,第一预训练模型通过自注意力机制,学习视频特征向量和文本特征向量之间的跨模态信息,得到编码后的第二拼接特征向量;Step 101: Extract video feature vectors from the input video, and extract text feature vectors from the question text and candidate answer text, where the question text is used to describe the question, and the candidate answer text is used to provide multiple candidate answers; combine the video feature vector with the text The feature vectors are spliced to obtain a spliced feature vector. The spliced feature vector is input into the first pre-training model. The first pre-training model uses the self-attention mechanism to learn the cross-modal information between the video feature vector and the text feature vector to obtain the encoding. The second concatenated feature vector after;
步骤102:将第二拼接特征向量分成第二视频特征向量和第二文本特征向量;将第二视频特征向量和第二文本特征向量输入模态融合模型,模态融合模型通过互注意力机制,对第二视频特征向量和第二文本特征向量进行处理,得到视频表达式和文本表达式,并对视频表达式和文本表达式分别进行池化并融合,得到融合特征向量;Step 102: Divide the second splicing feature vector into a second video feature vector and a second text feature vector; input the second video feature vector and the second text feature vector into the modal fusion model, and the modal fusion model uses the mutual attention mechanism. Process the second video feature vector and the second text feature vector to obtain a video expression and a text expression, and pool and fuse the video expression and the text expression respectively to obtain a fusion feature vector;
步骤103:将融合特征向量输入解码层,以预测正确的候选答案。Step 103: Input the fused feature vector into the decoding layer to predict the correct candidate answer.
本公开实施例的视频问答方法,通过将视频特征向量与文本特征向量进 行拼接后输入第一预训练模型,利用视觉-语言跨模态信息提升视频问答语义表征能力,并利用互注意力机制对视频和文本选项对进行更深层次的语义交互,有望更深入的对视频内容进行语义理解,从而提升模型推理效果。本公开实施例的视频问答方法,可以适用于视频检索、智能问答系统、辅助驾驶系统和自动驾驶等诸多应用领域。The video question and answer method in the embodiment of the present disclosure combines video feature vectors and text feature vectors. After row splicing, the first pre-training model is input, using visual-linguistic cross-modal information to improve the semantic representation ability of video Q&A, and using the mutual attention mechanism to conduct deeper semantic interaction between video and text option pairs, which is expected to have a deeper understanding of video The content is semantically understood to improve the model reasoning effect. The video question and answer method in the embodiment of the present disclosure can be applied to many application fields such as video retrieval, intelligent question and answer system, assisted driving system and automatic driving.
在一些示例性实施方式中,在步骤101之前,该方法之前还包括:In some exemplary embodiments, before step 101, the method further includes:
构建第一预训练模型并进行初始化;Build the first pre-trained model and initialize it;
通过多个自监督任务对所述第一预训练模型进行预训练,多个自监督任务包括标签分类任务、掩码语言模任务和掩码帧模任务,标签分类任务用于对视频进行多标签分类,掩码语言模任务用于对文本进行随机屏蔽并预测屏蔽词,掩码帧模任务用于对视频帧进行随机屏蔽并预测屏蔽帧;The first pre-training model is pre-trained through multiple self-supervised tasks. The multiple self-supervised tasks include a label classification task, a masked language model task and a masked frame model task. The label classification task is used to multi-label the video. For classification, the masked language module task is used to randomly mask text and predict masked words, and the masked frame module task is used to randomly mask video frames and predict masked frames;
通过自监督任务的损失加权和,计算第一预训练模型的损失。The loss of the first pre-trained model is calculated by the weighted sum of the losses of the self-supervised tasks.
图2为本公开实施例创建的视觉语言跨模态模型的结构示意图。如图2所示,该视觉语言跨模态模型包括级联的第一预训练模型、模态融合模型和解码层。Figure 2 is a schematic structural diagram of the visual language cross-modal model created by an embodiment of the present disclosure. As shown in Figure 2, the visual language cross-modal model includes a cascaded first pre-training model, a modal fusion model and a decoding layer.
在一些示例性实施方式中,第一预训练模型可以为24层的深度Transformer编码器级联神经网络,隐藏层维度为1024,注意力头数为16,通过BERT预训练出的参数对第一预训练模型进行初始化。In some exemplary embodiments, the first pre-trained model may be a 24-layer deep Transformer encoder cascade neural network, with a hidden layer dimension of 1024 and an attention head number of 16. The parameters pre-trained by BERT are suitable for the first Initialize the pre-trained model.
Transformer模型由编码器和解码器组成,编码器和解码器分别包括多个网络块,每个编码器的网络块由一个多头注意力(Attention)子层和一个前馈神经网络子层组成。解码器的结构类似于编码器的结构,只是解码器的每个网络块中多了一个多头注意力层。The Transformer model consists of an encoder and a decoder. The encoder and decoder each include multiple network blocks. Each encoder network block consists of a multi-head attention (Attention) sub-layer and a feed-forward neural network sub-layer. The structure of the decoder is similar to that of the encoder, except that there is an additional multi-head attention layer in each network block of the decoder.
来自Transformer的双向编码器表示(BERT)是Transformer的一个成功应用,它利用Transformer编码器并引入了双向屏蔽技术,允许每个语言标记双向关注其他标记。Bidirectional Encoder Representation from Transformer (BERT) is a successful application of Transformer that leverages the Transformer encoder and introduces bidirectional masking technology, allowing each language token to bidirectionally focus on other tokens.
本公开实施例的第一预训练模型采用多层Transformer编码器级联结构。示例性的,第一预训练模型可以为24层的深度Transformer编码器级联神经网络,隐藏层维度为1024,注意力头数为16,此时,第一预训练模型的网络 结构与BERT Large(谷歌Google提出的一种自然语言预训练模型)一致,因此可以直接用BERT等自然语言预训练模型的预训练结果对本公开实施例的第一预训练模型进行初始化。The first pre-training model in the embodiment of the present disclosure adopts a multi-layer Transformer encoder cascade structure. For example, the first pre-trained model can be a 24-layer deep Transformer encoder cascade neural network, the hidden layer dimension is 1024, and the number of attention heads is 16. At this time, the network of the first pre-trained model The structure is consistent with BERT Large (a natural language pre-training model proposed by Google), so the pre-training results of natural language pre-training models such as BERT can be directly used to initialize the first pre-training model of the disclosed embodiment.
在一些示例性实施方式中,针对输入的视频提取视频特征向量,包括:In some exemplary implementations, extracting video feature vectors for input videos includes:
以预设速度对输入的视频进行抽帧,采用第二预训练模型对抽取出的帧提取视频特征向量。Frames are extracted from the input video at a preset speed, and a second pre-trained model is used to extract video feature vectors from the extracted frames.
视频问答的输入是视频片段、问题以及候选答案集合,与图像的处理不同,连续的视频流可以理解成一组快速播放的图片,其中每一幅图片定义为帧(Frame)。对于视频信息,我们以预设速度(示例性的,1帧每秒(fps)的速度)对输入的视频进行抽帧,每个视频最高抽取N帧(示例性的,N可以为30),利用计算机视觉领域(CV)的预训练模型EfficientNetB3提取视频特征向量(Visual Features)。可选的,本公开实施例也可以用MobileNet或者ResNet101等预训练模型对输入的视频提取视频特征向量。The input of video question and answer is a collection of video clips, questions and candidate answers. Different from image processing, the continuous video stream can be understood as a set of rapidly played pictures, where each picture is defined as a frame. For video information, we extract frames from the input video at a preset speed (exemplarily, 1 frame per second (fps)), and each video extracts up to N frames (exemplarily, N can be 30), Use the pre-trained model EfficientNetB3 in the field of computer vision (CV) to extract video feature vectors (Visual Features). Optionally, embodiments of the present disclosure may also use pre-trained models such as MobileNet or ResNet101 to extract video feature vectors from input videos.
在一些示例性实施方式中,针对问题文本与候选答案文本提取文本特征向量,包括:In some exemplary implementations, extracting text feature vectors for question text and candidate answer text includes:
根据问题文本与候选答案文本生成序列串,序列串包括多个序列,问题文本与候选答案文本中的每个单词或字符对应一个或多个序列;Generate a sequence string based on the question text and candidate answer text. The sequence string includes multiple sequences. Each word or character in the question text and candidate answer text corresponds to one or more sequences;
将序列串输入第一预训练模型,得到文本特征向量。Input the sequence string into the first pre-training model to obtain a text feature vector.
本公开实施例中,对问题文本与候选答案文本,我们先用大小为n_vocab的词表对其进行序列表标记,将每个单词或字符转化为一个或多个序列,问题文本与候选答案文本中的所有单词和/或字符组成一个序列串,其中,该序列可以为数值ID。示例性的,数值ID可以在0至n_vocab-1之间。例如,将“当前是什么道路类型小道高速公路马路街道”转化为“2560 2013 1300 100 567……”这样的序列串。将该序列串输入第一预训练模型(此处的第一预训练模型可以为初始化后的第一预训练模型),得到n_word*1024的文本特征向量,其中,n_word为所述序列串中的序列数量,1024为每个文本特征向量的维数。In this disclosed embodiment, for the question text and candidate answer text, we first use a word list of size n_vocab to mark them as sequence lists, and convert each word or character into one or more sequences. The question text and candidate answer text All words and/or characters in form a sequence string, where the sequence can be a numeric ID. For example, the numerical ID can be between 0 and n_vocab-1. For example, convert "What type of road is it currently, a trail, a highway, a street, or a street?" into a sequence string such as "2560 2013 1300 100 567...". Input the sequence string into the first pre-training model (the first pre-training model here can be the initialized first pre-training model), and obtain a text feature vector of n_word*1024, where n_word is the number in the sequence string. The number of sequences, 1024 is the dimension of each text feature vector.
本公开实施例中,将视频特征向量和文本特征向量进行拼接输入第一预 训练模型时,拼接特征向量的输入形式可以为[CLS]Video_frame[SEP]Question-Answer[SEP],其中,Video_frame为视频特征向量,Question-Answer为文本特征向量,[CLS]标志为拼接特征向量的首位标志,[SEP]标志用于分开视频特征向量和文本特征向量,且在第一预训练模型的输入中加入与BERT相同的位置嵌入向量和段嵌入向量,其中,位置嵌入向量用于指定序列中的位置,段嵌入向量用于指定所属的帧分段。In the embodiment of the present disclosure, the video feature vector and the text feature vector are spliced and input into the first preset When training the model, the input form of the splicing feature vector can be [CLS]Video_frame[SEP]Question-Answer[SEP], where Video_frame is the video feature vector, Question-Answer is the text feature vector, and the [CLS] flag is the splicing feature vector. The first flag, [SEP] flag is used to separate the video feature vector and the text feature vector, and add the same position embedding vector and segment embedding vector as BERT to the input of the first pre-training model, where the position embedding vector is used to specify The position in the sequence, segment embedding vector is used to specify the frame segment it belongs to.
在一些示例性实施方式中,在将视频特征向量与文本特征向量进行拼接之前,所述方法还包括:对视频特征向量和/或文本特征向量进行处理,以使得在将视频特征向量与文本特征向量进行拼接时,视频特征向量的维度和文本特征向量的维度相同。In some exemplary embodiments, before splicing the video feature vector and the text feature vector, the method further includes: processing the video feature vector and/or the text feature vector, such that when the video feature vector and the text feature are spliced, When vectors are spliced, the dimensions of the video feature vector and the text feature vector are the same.
本公开实施例中,当预训练模型EfficientNetB3提取的视频特征向量为1536维的特征向量时,将该1536维的特征向量通过一个全连接层进行降维,将特征维度降到1024维,与文本特征向量的维度相同。In the embodiment of the present disclosure, when the video feature vector extracted by the pre-training model EfficientNetB3 is a 1536-dimensional feature vector, the 1536-dimensional feature vector is reduced in dimension through a fully connected layer, and the feature dimension is reduced to 1024 dimensions, which is consistent with the text The feature vectors have the same dimensions.
本公开实施例中,对视觉语言跨模态模型的训练包括两个阶段:对第一预训练模型进行预训练阶段以及对视觉语言跨模态模型进行微调阶段。In the embodiment of the present disclosure, the training of the visual language cross-modal model includes two stages: a pre-training stage for the first pre-training model and a fine-tuning stage for the visual language cross-modal model.
在对第一预训练模型进行预训练阶段,我们采用标签分类(Tag classify,TC),掩码语言模(Mask language model,MLM),掩码帧模(Mask frame model,MFM)三个任务对第一预训练模型进行预训练。其中,标签分类任务用于对视频进行多标签分类,掩码语言模任务用于对文本进行随机屏蔽并预测屏蔽词,掩码帧模任务用于对视频帧进行随机屏蔽并预测屏蔽帧。In the pre-training stage of the first pre-training model, we use three tasks: tag classification (TC), mask language model (MLM), and mask frame model (MFM). The first pre-trained model is pre-trained. Among them, the label classification task is used to classify videos with multiple labels, the mask language module task is used to randomly mask text and predict masked words, and the mask frame module task is used to randomly mask video frames and predict masked frames.
示例性的,在标签分类任务中,可以采用出现频率较高的前100类的tag做多标签分类任务,其中,tag为人工标注的视频标签,例如,是否发生事故、事故原因、事故发生地点、车辆类型、天气状况、时间信息等。取Bert最后一层的[CLS]对应向量接全连接层得到tag的预测标签,与真实标签计算二元交叉熵损失(Binary Cross Entropy loss,BCE loss)。For example, in the label classification task, the top 100 tags with higher frequency of occurrence can be used for multi-label classification tasks, where the tags are manually labeled video tags, for example, whether an accident occurred, the cause of the accident, and the location of the accident. , vehicle type, weather conditions, time information, etc. Take the [CLS] corresponding vector of the last layer of Bert and connect it to the fully connected layer to get the predicted label of the tag, and calculate the binary cross entropy loss (Binary Cross Entropy loss, BCE loss) with the real label.
在掩码语言模任务中,对随机15%的文本进行掩模(mask),预测mask文本。多模态场景下,结合视频的信息预测mask词,可以有效融合多模态信息。 In the masked language module task, random 15% of the text is masked and the masked text is predicted. In multi-modal scenarios, combining video information to predict mask words can effectively integrate multi-modal information.
在掩码帧模任务中,对随机15%的视频帧进行mask,mask的视频帧采用全0的向量填充。因为视频特征向量是连续的实值向量,没有分词(token),难以类似于掩码语言模任务做分类任务。由于希望mask的预测帧在整个批处理(batch)内的所有帧范围内与被mask的帧尽可能相似,因此,本公开实施例基于噪声对比估计(NCE,Noise Contrastive Estimation)计算掩码帧模任务的损失(loss),最大化mask帧和预测帧的互信息。In the mask frame task, random 15% of the video frames are masked, and the masked video frames are filled with vectors of all 0s. Because the video feature vector is a continuous real-valued vector without word segmentation (token), it is difficult to perform a classification task similar to the masked language module task. Since it is hoped that the predicted frame of the mask is as similar as possible to the frame being masked within all frame ranges within the entire batch, the embodiment of the present disclosure calculates the mask frame model based on Noise Contrastive Estimation (NCE). The loss of the task is to maximize the mutual information between the mask frame and the prediction frame.
采用多任务联合训练,总的预训练任务的损失采用了上述三个预训练任务损失的加权和,Loss=Loss(TC)*a+Loss(mlm)*b+Loss(mfm)*c,其中,a、b、c分别为三个预训练任务损失的权重。Using multi-task joint training, the total pre-training task loss uses the weighted sum of the above three pre-training task losses, Loss=Loss(TC)*a+Loss(mlm)*b+Loss(mfm)*c, where , a, b, and c are the weights of the three pre-training task losses respectively.
本公开实施例的视频问答方法,通过第一预训练模型来表征视频和文本两个模态的信息,将视频对应的视频特征向量和问题文本与候选答案文本对应的文本特征向量进行拼接,输入第一预训练模型,得到编码后的第二拼接特征向量。根据第二拼接特征向量中的标志位,可以将第二拼接特征向量分成第二视频特征向量和第二文本特征向量。The video question and answer method of the embodiment of the present disclosure uses a first pre-trained model to represent the information of the two modalities of video and text, splice the video feature vector corresponding to the video and the text feature vector corresponding to the question text and the text feature vector corresponding to the candidate answer text, and input The first pre-trained model obtains the encoded second concatenated feature vector. According to the flag bit in the second splicing feature vector, the second splicing feature vector can be divided into a second video feature vector and a second text feature vector.
假设V=[v1,v2,…vm],Q=[q1,q2,…,q3],A=[a1,a2,…,a3]分别为视频、问题和答案,其中,vi,qi,ai分别为对应的序列,用VLP(·)表示VLP模型(即第一预训练模型),则将[CLS]V[SEP]Q+A[SEP]输入VLP模型得到的编码表征E可以表示为:
Assume that V=[v1,v2,…vm], Q=[q1,q2,…,q3], A=[a1,a2,…,a3] are videos, questions and answers respectively, where vi, qi, ai are the corresponding sequences respectively, and use VLP(·) to represent the VLP model (i.e. the first pre-training model), then the encoding representation E obtained by inputting [CLS]V[SEP]Q+A[SEP] into the VLP model can be expressed as:
其中,E=[e1,e2,…,e(m+n+k)],将编码得到的视频帧和问题选项向量分成两部分可得到第二视频特征向量和第二文本特征向量, Among them, E=[e1,e2,…,e(m+n+k)], the second video feature vector and the second text feature vector can be obtained by dividing the encoded video frame and question option vector into two parts, and
接下来,将预训练好的第一预训练模型与模态融合模型以及解码层相连接,以对视觉语言跨模态模型整体进行微调。第一预训练模型的处理过程与前述过程相同,此处不再赘述。将第二视频特征向量和第二文本特征向量输入模态融合模型,模态融合模型通过互注意力机制,对第二视频特征向量和第二文本特征向量进行处理,得到视频表达式和文本表达式,并对视频表达式和文本表达式分别进行池化并融合,得到融合特征向量。 Next, the pre-trained first pre-training model is connected to the modal fusion model and the decoding layer to fine-tune the overall visual language cross-modal model. The processing process of the first pre-trained model is the same as the aforementioned process and will not be described again here. The second video feature vector and the second text feature vector are input into the modal fusion model. The modal fusion model processes the second video feature vector and the second text feature vector through a mutual attention mechanism to obtain a video expression and a text expression. formula, and pool and fuse the video expression and text expression respectively to obtain the fusion feature vector.
图3为本公开实施例创建的模态融合模型的结构示意图。如图3所示,在模态融合模型中,查询向量(Query)来自一个模态,而键向量(Key)和值向量(Value)来自另一个模态。本公开实施例的视频问答方法,通过引入互注意力机制,进一步将语言与图像在语义上进行对齐和融合。Figure 3 is a schematic structural diagram of a modal fusion model created according to an embodiment of the present disclosure. As shown in Figure 3, in the modal fusion model, the query vector (Query) comes from one modality, while the key vector (Key) and value vector (Value) come from another modality. The video question and answer method in the embodiment of the present disclosure further semantically aligns and integrates language and images by introducing a mutual attention mechanism.
在一些示例性实施方式中,通过互注意力机制,对第二视频特征向量和第二文本特征向量进行处理,包括:In some exemplary implementations, the second video feature vector and the second text feature vector are processed through a mutual attention mechanism, including:
将第二视频特征向量作为查询向量,将第二文本特征向量作为键向量和值向量,进行多头注意力(Multi Head Attention);Use the second video feature vector as the query vector and the second text feature vector as the key vector and value vector to perform Multi Head Attention;
将第二文本特征向量作为查询向量,将第二视频特征向量作为键向量和值向量,进行多头注意力。The second text feature vector is used as a query vector, and the second video feature vector is used as a key vector and a value vector to perform multi-head attention.
本公开实施例中,模态融合模型用第二视频特征向量作为Query,第二文本特征向量作为Key和Value,进行多头注意力,同时,用第二文本特征向量作为Query,第二视频特征向量作为Key和Value,进行多头注意力,这样互换的注意力是一层,在一些示例性实施方式中,可以进行多层串联。然后将表示视频帧序列的视频表达式进行池化(pooling),以及表示问题选项序列的文本表达式进行池化,再融合(Fuse)两个池化后的向量,得到融合特征向量。然后将融合特征向量输入解码层,预测每个候选项是正确的候选答案的概率。In the embodiment of the present disclosure, the modal fusion model uses the second video feature vector as Query and the second text feature vector as Key and Value to perform multi-head attention. At the same time, the second text feature vector is used as Query and the second video feature vector As Key and Value, multi-head attention is performed, so that the exchanged attention is one layer. In some exemplary embodiments, multi-layer concatenation can be performed. Then the video expression representing the video frame sequence is pooled, and the text expression representing the question option sequence is pooled, and then the two pooled vectors are fused to obtain the fusion feature vector. The fused feature vector is then fed into the decoding layer to predict the probability that each candidate is the correct candidate answer.
上述过程用公式表示为:


MHA(EV,EQA,EQA)=concat(head1,…headh);
MHA1=MHA(EV,EQA,EQA);
MHA2=MHA(EQA,EV,EV);
DUMA(EV,EQA)=Fuse(MHA1,MHA2);
The above process is expressed as:


MHA(E V ,E QA ,E QA )=concat(head 1 ,…head h );
MHA 1 =MHA( EV ,E QA ,E QA );
MHA 2 =MHA(E QA ,E V ,E V );
DUMA(E V ,E QA )=Fuse(MHA 1 ,MHA 2 );
其中,参数矩阵,MHA(·)为多头注意力机制,DUMA(·)为双向多头共注意力网络,Fuse(·)表示对DUMA输出的结果进行平均池化并融 合,示例性的,可以采用拼接的方式进行融合。in, Parameter matrix, MHA(·) is the multi-head attention mechanism, DUMA(·) is the bidirectional multi-head co-attention network, and Fuse(·) represents the average pooling and fusion of the results output by DUMA. For example, the fusion can be performed by splicing.
对于每一个<V,Q,Ai>三元组,经过DUMA网络的输出为:
For each <V, Q, A i > triplet, the output of the DUMA network is:
网络的loss函数为:
The loss function of the network is:
其中,WT为参数矩阵,s为候选答案个数。Among them, W T is the parameter matrix, and s is the number of candidate answers.
在一些示例性实施方式中,所述方法还包括:In some exemplary embodiments, the method further includes:
接收用户的语音输入;Receive user voice input;
通过语音识别,将语音输入转换为问题文本。Convert voice input into question text through speech recognition.
在一些示例性实施方式中,所述方法还包括:In some exemplary embodiments, the method further includes:
接收用户的语音输入;Receive user voice input;
通过语音识别,将语音输入转换为语音指令和/或问题文本。Speech recognition converts voice input into voice commands and/or question text.
示例性的,语音指令可以为“帮我切换到XXXX道路的监控视频”等等,系统根据接收的语音指令进行相应的响应操作,例如,将显示屏切换到XXXX道路的监控视频。示例性的,问题文本可以为“当前路段是否有事故发生”、“事故发生的原因是什么?”等等,系统根据接收的问题文本,生成候选答案文本,再通过本公开实施例的视频问答方法预测正确的候选答案。For example, the voice command can be "Help me switch to the surveillance video of XXXX Road" and so on. The system performs corresponding response operations according to the received voice command, for example, switching the display screen to the surveillance video of XXXX Road. For example, the question text can be "Is there an accident on the current road section?", "What is the cause of the accident?", etc. The system generates candidate answer texts based on the received question text, and then uses the video Q&A according to the embodiment of the present disclosure. Method predicts the correct candidate answer.
在一些示例性实施方式中,所述方法还包括:In some exemplary embodiments, the method further includes:
通过语音合成,将预测的正确的候选答案通过语音形式进行播报。Through speech synthesis, the predicted correct candidate answers are broadcast in voice form.
本公开实施例中,可以直接在显示屏上显示预测的正确的候选答案,也可以通过语音形式播报预测的正确的候选答案。In the embodiment of the present disclosure, the predicted correct candidate answer can be directly displayed on the display screen, or the predicted correct candidate answer can be broadcast in voice form.
在一些示例性实施方式中,所述方法还包括:In some exemplary embodiments, the method further includes:
获取问题文本;Get question text;
根据问题文本,生成与问题文本对应的候选答案文本。Based on the question text, candidate answer text corresponding to the question text is generated.
在一些示例性实施方式中,根据问题文本,生成与问题文本对应的候选 答案文本,包括:In some exemplary embodiments, based on the question text, candidates corresponding to the question text are generated Answer text, including:
通过关键词匹配或注意力机制模型,从知识图谱中查询与问题文本匹配的三元组;Query triples matching the question text from the knowledge graph through keyword matching or attention mechanism models;
根据匹配的三元组,生成与问题文本对应的候选答案文本。Based on the matched triples, candidate answer text corresponding to the question text is generated.
随着知识的膨胀增长,知识图谱这一概念应用的越发普遍。知识图谱可以通过关系层使用合适的知识表示方法来挖掘数据之间的联系,使得知识更易于在计算机之间进行流通和协作加工。知识图谱是以图的数据结构来表示实体之间的关系,相比于单纯的文本信息,图的表示方法更加的易于理解和接受。知识图谱由“实体-关系-实体”或“实体-属性-属性值”的关系连边构成,重在表示实体之间或实体与属性之间的关系。基于知识图谱的关系搜索广泛应用在各行各业的实体匹配和问答系统中,它可以对已知的知识进行加工和存储,能够快速进行知识的匹配和答案的搜索。With the expansion and growth of knowledge, the concept of knowledge graph is becoming more and more widely used. Knowledge graphs can use appropriate knowledge representation methods through the relationship layer to mine connections between data, making it easier for knowledge to circulate and be collaboratively processed between computers. Knowledge graph uses the data structure of graph to represent the relationship between entities. Compared with pure text information, the representation method of graph is easier to understand and accept. The knowledge graph is composed of the relationship edges of "entity-relationship-entity" or "entity-attribute-attribute value", focusing on expressing the relationship between entities or between entities and attributes. Relational search based on knowledge graphs is widely used in entity matching and question answering systems in various industries. It can process and store known knowledge, and can quickly perform knowledge matching and answer search.
知识图谱的基本组成为三元组(S,P,O)。其中,S和O为知识图谱中的节点,表示实体。S具体表示主体,O具体表示客体。P为知识图谱中连接两个实体(S和O)的边,表示两个实体之间具有的关系。例如,在交通事故常识知识图谱中,一个三元组可以为(追尾事故,事故类型,交通事故)等。The basic composition of the knowledge graph is a triplet (S, P, O). Among them, S and O are nodes in the knowledge graph, representing entities. S specifically represents the subject, and O specifically represents the object. P is the edge connecting two entities (S and O) in the knowledge graph, indicating the relationship between the two entities. For example, in the traffic accident common sense knowledge graph, a triplet can be (rear-end accident, accident type, traffic accident), etc.
示例性的,本公开实施例的视频问答方法可用于视频监控系统中,例如,在智慧交通多屏监控系统中,可以通过语音控制对大屏进行调度以及交互问答,将出现事故的监控切换到主屏进行事故分析,如相应交互指令可以为“帮我切换到XXXX道路的监控视频”;“当前路段是否有事故发生”;“事故发生的原因是什么?”,具体实施过程中,会涉及交通事故相关常识知识,可以通过引入知识库方式实现,如可以在知识库中构建如图4所示的交通事故常识知识图谱,通过关键词匹配或注意力机制从交通事故常识知识图谱中查询与问题文本匹配的三元组,作为显示语义信息对应的常识知识(即候选答案),例如,根据图4的交通事故常识知识图谱,对于用户的问题“事故发生的原因是什么?”,自动生成候选答案:天气、道路、驾驶员、车况不佳。Exemplarily, the video question and answer method of the embodiment of the present disclosure can be used in a video surveillance system. For example, in a smart transportation multi-screen surveillance system, the large screen can be scheduled and interacted with questions and answers through voice control, and the monitoring of accidents can be switched to Carry out accident analysis on the main screen. For example, the corresponding interactive instructions can be "Help me switch to the surveillance video of XXXX road"; "Is there an accident on the current road section?"; "What is the cause of the accident?". The specific implementation process will involve traffic. Accident-related common sense knowledge can be achieved by introducing a knowledge base. For example, a traffic accident common sense knowledge graph as shown in Figure 4 can be constructed in the knowledge base, and questions can be queried from the traffic accident common sense knowledge graph through keyword matching or attention mechanisms. The triplet of text matching is used as the common sense knowledge (i.e. candidate answer) corresponding to the displayed semantic information. For example, according to the traffic accident common sense knowledge graph in Figure 4, for the user's question "What is the cause of the accident?", candidates are automatically generated. Answer: Weather, roads, driver, poor vehicle condition.
本公开实施例还提供了一种视频问答装置,包括存储器;和耦接至所述 存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行如本公开任一实施例所述的视频问答方法的步骤。An embodiment of the present disclosure also provides a video question and answer device, including a memory; and coupled to the A processor of a memory, the processor being configured to perform the steps of the video question answering method according to any embodiment of the present disclosure based on instructions stored in the memory.
如图5所示,在一个示例中,该视频问答装置可包括:处理器510、存储器520和总线系统530,其中,处理器510和存储器520通过总线系统530相连,存储器520用于存储指令,处理器510用于执行存储器520存储的指令,以针对输入的视频提取视频特征向量,针对问题文本与候选答案文本提取文本特征向量,其中,所述问题文本用于描述问题,所述候选答案文本用于提供多个候选答案;将所述视频特征向量与所述文本特征向量进行拼接,得到拼接特征向量,将所述拼接特征向量输入第一预训练模型,所述第一预训练模型通过自注意力机制,学习所述视频特征向量和所述文本特征向量之间的跨模态信息,得到编码后的第二拼接特征向量;将所述第二拼接特征向量分成第二视频特征向量和第二文本特征向量;将所述第二视频特征向量和第二文本特征向量输入模态融合模型,所述模态融合模型通过互注意力机制,对所述第二视频特征向量和第二文本特征向量进行处理,得到视频表达式和文本表达式,并对视频表达式和文本表达式分别进行池化并融合,得到融合特征向量;将所述融合特征向量输入解码层,以预测正确的候选答案。As shown in Figure 5, in one example, the video question and answer device may include: a processor 510, a memory 520, and a bus system 530. The processor 510 and the memory 520 are connected through the bus system 530, and the memory 520 is used to store instructions. The processor 510 is configured to execute instructions stored in the memory 520 to extract a video feature vector for the input video, and extract a text feature vector for the question text and the candidate answer text, where the question text is used to describe the question, and the candidate answer text Used to provide multiple candidate answers; splice the video feature vector and the text feature vector to obtain a spliced feature vector, input the spliced feature vector into the first pre-training model, and the first pre-training model automatically The attention mechanism learns the cross-modal information between the video feature vector and the text feature vector to obtain the encoded second splicing feature vector; divides the second splicing feature vector into the second video feature vector and the second splicing feature vector. Two text feature vectors; input the second video feature vector and the second text feature vector into a modal fusion model, and the modal fusion model uses a mutual attention mechanism to compare the second video feature vector and the second text feature The vector is processed to obtain a video expression and a text expression, and the video expression and the text expression are pooled and fused respectively to obtain a fused feature vector; the fused feature vector is input to the decoding layer to predict the correct candidate answer. .
应理解,处理器510可以是中央处理单元(Central Processing Unit,CPU),处理器510还可以是其他通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that the processor 510 can be a central processing unit (Central Processing Unit, CPU). The processor 510 can also be other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASICs), and off-the-shelf programmable gate arrays. (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
存储器520可以包括只读存储器和随机存取存储器,并向处理器510提供指令和数据。存储器520的一部分还可以包括非易失性随机存取存储器。例如,存储器520还可以存储设备类型的信息。Memory 520 may include read-only memory and random access memory and provides instructions and data to processor 510 . A portion of memory 520 may also include non-volatile random access memory. For example, memory 520 may also store device type information.
总线系统530除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图5中将各种总线都标为总线系统530。In addition to the data bus, the bus system 530 may also include a power bus, a control bus, a status signal bus, etc. However, for the sake of clarity, the various buses are labeled as bus system 530 in FIG. 5 .
在实现过程中,处理设备所执行的处理可以通过处理器510中的硬件的集成逻辑电路或者软件形式的指令完成。即本公开实施例的方法步骤可以体 现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等存储介质中。该存储介质位于存储器520,处理器510读取存储器520中的信息,结合其硬件完成上述方法的步骤。为避免重复,这里不再详细描述。During implementation, the processing performed by the processing device may be completed by instructions in the form of hardware integrated logic circuits or software in the processor 510 . That is, the method steps of the embodiments of the present disclosure can be embodied as The execution is now completed by the hardware processor, or by a combination of hardware and software modules in the processor. Software modules can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media. The storage medium is located in the memory 520. The processor 510 reads the information in the memory 520 and completes the steps of the above method in combination with its hardware. To avoid repetition, it will not be described in detail here.
如图6所示,本公开实施例还提供了一种视频问答系统,包括如本公开任一实施例所述的视频问答装置,还包括监控系统、语音识别装置、语音输入装置和知识库,其中:As shown in Figure 6, an embodiment of the present disclosure also provides a video question and answer system, which includes the video question and answer device as described in any embodiment of the present disclosure, and also includes a monitoring system, a speech recognition device, a voice input device and a knowledge base, in:
监控系统,被配置为获取一个或多个监控视频,根据指令文本对监控视频进行处理,并将监控视频输出至视频问答装置;A surveillance system configured to acquire one or more surveillance videos, process the surveillance videos according to the instruction text, and output the surveillance videos to the video question and answer device;
语音输入装置,被配置为接收语音输入,并输出至语音识别装置;a voice input device configured to receive voice input and output it to the voice recognition device;
语音识别装置,被配置为通过语音识别,将语音输入转换为指令文本或问题文本,将指令文本输入监控系统,将问题文本输入视频问答装置;A speech recognition device configured to convert the speech input into instruction text or question text through speech recognition, input the instruction text into the monitoring system, and input the question text into the video question and answer device;
知识库,被配置为存储知识图谱;Knowledge base, configured to store knowledge graphs;
视频问答装置,被配置为接收问题文本和监控视频,根据问题文本生成候选答案文本,其中,问题文本用于描述问题,候选答案文本用于提供多个候选答案;还被配置为对接收的监控视频提取视频特征向量,针对问题文本与候选答案文本提取文本特征向量,将视频特征向量与文本特征向量进行拼接,得到拼接特征向量,将拼接特征向量输入第一预训练模型,第一预训练模型通过自注意力机制,学习视频特征向量和所述文本特征向量之间的跨模态信息,得到编码后的第二拼接特征向量;将第二拼接特征向量分成第二视频特征向量和第二文本特征向量;将第二视频特征向量和第二文本特征向量输入模态融合模型,模态融合模型采用互注意力机制,对第二视频特征向量和第二文本特征向量进行处理,得到视频表达式和文本表达式,并对视频表达式和文本表达式分别进行池化并融合,得到融合特征向量;将融合特征向量输入解码层,以预测正确的候选答案。A video question and answer device configured to receive a question text and a surveillance video, and generate a candidate answer text based on the question text, where the question text is used to describe the question, and the candidate answer text is used to provide multiple candidate answers; and is also configured to monitor the received Extract video feature vectors from the video, extract text feature vectors from the question text and candidate answer text, splice the video feature vector and the text feature vector to obtain a spliced feature vector, and input the spliced feature vector into the first pre-training model, the first pre-training model Through the self-attention mechanism, learn the cross-modal information between the video feature vector and the text feature vector to obtain the encoded second splicing feature vector; divide the second splicing feature vector into the second video feature vector and the second text Feature vector; input the second video feature vector and the second text feature vector into the modal fusion model. The modal fusion model uses a mutual attention mechanism to process the second video feature vector and the second text feature vector to obtain a video expression and text expressions, and pool and fuse the video expressions and text expressions respectively to obtain a fused feature vector; the fused feature vector is input to the decoding layer to predict the correct candidate answer.
在一些示例性实施方式中,该视频问答系统还包括语音合成输出装置,其中: In some exemplary implementations, the video question and answer system further includes a speech synthesis output device, wherein:
语音合成输出装置,用于通过语音合成,将预测的正确的候选答案通过语音形式进行播报。A speech synthesis output device is used to broadcast the predicted correct candidate answer in speech form through speech synthesis.
本公开实施例的视频问答系统由图6所示的几个模块组成,其中,语音指令或问题通过语音输入装置由用户端给出,语音识别装置将自然语言转成文本,语音合成输出装置将系统反馈的答案用语音的形式播报出来,视频问答装置则根据用户传入的问题或者指令结合监控系统和知识库进行语义理解和多模态交互推理给出相应的答案。The video question and answer system of the embodiment of the present disclosure consists of several modules as shown in Figure 6, in which voice instructions or questions are given by the user through the voice input device, the voice recognition device converts natural language into text, and the speech synthesis output device converts natural language into text. The answers fed back by the system are broadcast in the form of voice, and the video question and answer device combines the monitoring system and knowledge base to perform semantic understanding and multi-modal interactive reasoning to give corresponding answers based on the questions or instructions passed in by the user.
本公开实施例还提供了一种存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本公开任一实施例所述的视频问答方法。An embodiment of the present disclosure also provides a storage medium on which a computer program is stored. When the program is executed by a processor, the video question and answer method as described in any embodiment of the present disclosure is implemented.
在一些可能的实施方式中,本申请提供的视频问答方法的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在计算机设备上运行时,所述程序代码用于使所述计算机设备执行本说明书上述描述的根据本申请各种示例性实施方式的视频问答方法中的步骤,例如,所述计算机设备可以执行本申请实施例所记载的视频问答方法。In some possible implementations, various aspects of the video question and answer method provided by this application can also be implemented in the form of a program product, which includes program code. When the program product is run on a computer device, the program code It is used to cause the computer device to execute the steps in the video question and answer method according to various exemplary embodiments of the present application described above in this specification. For example, the computer device can execute the video question and answer method recorded in the embodiment of the present application.
所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以是但不限于:电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The program product may take the form of any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to: electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
本公开中的附图只涉及本公开涉及到的结构,其他结构可参考通常设计。在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合以得到新的实施例。The drawings in this disclosure only refer to the structures involved in this disclosure, and other structures may refer to common designs. Without conflict, the embodiments of the present disclosure and the features in the embodiments may be combined with each other to obtain new embodiments.
本领域的普通技术人员应当理解,可以对本公开的技术方案进行修改或者等同替换,而不脱离本公开技术方案的精神和范围,均应涵盖在本公开的权利要求的范围当中。 Those of ordinary skill in the art should understand that the technical solutions of the present disclosure can be modified or equivalently substituted without departing from the spirit and scope of the technical solutions of the present disclosure, and all should be covered by the scope of the claims of the present disclosure.

Claims (14)

  1. 一种视频问答方法,包括:A video question and answer method including:
    针对输入的视频提取视频特征向量,针对问题文本与候选答案文本提取文本特征向量,其中,所述问题文本用于描述问题,所述候选答案文本用于提供多个候选答案;将所述视频特征向量与所述文本特征向量进行拼接,得到拼接特征向量,将所述拼接特征向量输入第一预训练模型,所述第一预训练模型通过自注意力机制,学习所述视频特征向量和所述文本特征向量之间的跨模态信息,得到编码后的第二拼接特征向量;Extract video feature vectors from the input video, and extract text feature vectors from the question text and candidate answer text, where the question text is used to describe the question, and the candidate answer text is used to provide multiple candidate answers; the video features are The vector is spliced with the text feature vector to obtain a spliced feature vector, and the spliced feature vector is input into the first pre-training model. The first pre-trained model learns the video feature vector and the video feature vector through a self-attention mechanism. The cross-modal information between text feature vectors is used to obtain the encoded second spliced feature vector;
    将所述第二拼接特征向量分成第二视频特征向量和第二文本特征向量;将所述第二视频特征向量和第二文本特征向量输入模态融合模型,所述模态融合模型通过互注意力机制,对所述第二视频特征向量和第二文本特征向量进行处理,得到视频表达式和文本表达式,并对视频表达式和文本表达式分别进行池化并融合,得到融合特征向量;The second splicing feature vector is divided into a second video feature vector and a second text feature vector; the second video feature vector and the second text feature vector are input into a modal fusion model, and the modal fusion model uses mutual attention The force mechanism processes the second video feature vector and the second text feature vector to obtain a video expression and a text expression, and pools and fuses the video expression and the text expression respectively to obtain a fusion feature vector;
    将所述融合特征向量输入解码层,以预测正确的候选答案。The fused feature vector is input to the decoding layer to predict the correct candidate answer.
  2. 根据权利要求1所述的视频问答方法,其中,所述针对输入的视频提取视频特征向量,包括:The video question and answer method according to claim 1, wherein the extracting a video feature vector for the input video includes:
    以预设速度对输入的视频进行抽帧,采用第二预训练模型对抽取出的帧提取视频特征向量。Frames are extracted from the input video at a preset speed, and a second pre-trained model is used to extract video feature vectors from the extracted frames.
  3. 根据权利要求1所述的视频问答方法,其中,所述针对问题文本与候选答案文本提取文本特征向量,包括:The video question and answer method according to claim 1, wherein the extracting text feature vectors for question text and candidate answer text includes:
    根据所述问题文本与候选答案文本生成序列串,所述序列串包括多个序列,所述问题文本与候选答案文本中的每个单词或字符对应一个或多个序列;Generate a sequence string based on the question text and candidate answer text, where the sequence string includes multiple sequences, and each word or character in the question text and candidate answer text corresponds to one or more sequences;
    将所述序列串输入所述第一预训练模型,得到文本特征向量。The sequence string is input into the first pre-training model to obtain a text feature vector.
  4. 根据权利要求1所述的视频问答方法,所述方法之前还包括:The video question and answer method according to claim 1, the method further includes:
    构建所述第一预训练模型并进行初始化;Construct the first pre-trained model and initialize it;
    通过多个自监督任务对所述第一预训练模型进行预训练,多个所述自监 督任务包括标签分类任务、掩码语言模任务和掩码帧模任务,所述标签分类任务用于对视频进行多标签分类,所述掩码语言模任务用于对文本进行随机屏蔽并预测屏蔽词,所述掩码帧模任务用于对视频帧进行随机屏蔽并预测屏蔽帧;The first pre-training model is pre-trained through a plurality of self-supervised tasks. The supervision task includes a label classification task, a masked language model task and a masked frame model task. The label classification task is used to classify videos with multiple labels. The masked language model task is used to randomly block text and predict masking. word, the mask frame module task is used to randomly mask video frames and predict masked frames;
    通过多个所述自监督任务的损失加权和,计算所述第一预训练模型的损失。The loss of the first pre-trained model is calculated by a weighted sum of losses of multiple self-supervised tasks.
  5. 根据权利要求4所述的视频问答方法,其中,基于二元交叉熵计算所述标签分类任务和掩码语言模任务的损失,基于噪声对比估计计算所述掩码帧模任务的损失。The video question and answer method according to claim 4, wherein the loss of the label classification task and the masked language module task is calculated based on binary cross entropy, and the loss of the masked frame module task is calculated based on noise contrast estimation.
  6. 根据权利要求1所述的视频问答方法,其中,所述第一预训练模型为24层的深度Transformer编码器级联神经网络,隐藏层维度为1024,注意力头数为16,通过来自Transformers的双向编码器表示BERT预训练出的参数对所述第一预训练模型进行初始化。The video question and answer method according to claim 1, wherein the first pre-training model is a 24-layer deep Transformer encoder cascade neural network, the hidden layer dimension is 1024, and the number of attention heads is 16. The bidirectional encoder indicates that the parameters pre-trained by BERT initialize the first pre-trained model.
  7. 根据权利要求1所述的视频问答方法,其中,所述通过互注意力机制,对所述第二视频特征向量和第二文本特征向量进行处理,包括:The video question and answer method according to claim 1, wherein processing the second video feature vector and the second text feature vector through a mutual attention mechanism includes:
    将所述第二视频特征向量作为查询向量,将所述第二文本特征向量作为键向量和值向量,进行多头注意力;Use the second video feature vector as a query vector and the second text feature vector as a key vector and a value vector to perform multi-head attention;
    将所述第二文本特征向量作为查询向量,将所述第二视频特征向量作为键向量和值向量,进行多头注意力。The second text feature vector is used as a query vector, and the second video feature vector is used as a key vector and a value vector to perform multi-head attention.
  8. 根据权利要求1所述的视频问答方法,所述方法之前还包括:The video question and answer method according to claim 1, the method further includes:
    接收用户的语音输入;Receive user voice input;
    通过语音识别,将所述语音输入转换为所述问题文本。Through speech recognition, the speech input is converted into the question text.
  9. 根据权利要求1所述的视频问答方法,所述方法之前还包括:The video question and answer method according to claim 1, the method further includes:
    获取所述问题文本;Get the text of said question;
    根据所述问题文本,生成与所述问题文本对应的所述候选答案文本。 According to the question text, the candidate answer text corresponding to the question text is generated.
  10. 根据权利要求9所述的视频问答方法,其中,所述根据所述问题文本,生成与所述问题文本对应的所述候选答案文本,包括:The video question and answer method according to claim 9, wherein said generating the candidate answer text corresponding to the question text according to the question text includes:
    通过关键词匹配或注意力机制模型,从常识知识图谱中查询与所述问题文本匹配的三元组;Query triples matching the question text from the common sense knowledge graph through keyword matching or attention mechanism models;
    根据匹配的所述三元组,生成与所述问题文本对应的所述候选答案文本。According to the matched triples, the candidate answer text corresponding to the question text is generated.
  11. 根据权利要求1所述的视频问答方法,所述方法还包括:The video question and answer method according to claim 1, further comprising:
    对所述视频特征向量和/或所述文本特征向量进行处理,以使得在将所述视频特征向量与所述文本特征向量进行拼接时,所述视频特征向量的维度和所述文本特征向量的维度相同。The video feature vector and/or the text feature vector are processed so that when the video feature vector and the text feature vector are spliced, the dimensions of the video feature vector and the dimension of the text feature vector are The dimensions are the same.
  12. 一种视频问答装置,包括存储器;和耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行如权利要求1至11中任一项所述的视频问答方法的步骤。A video question and answer device, comprising a memory; and a processor coupled to the memory, the processor being configured to execute as described in any one of claims 1 to 11 based on instructions stored in the memory The steps of the video question and answer method.
  13. 一种存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1至11中任一项所述的视频问答方法。A storage medium on which a computer program is stored. When the program is executed by a processor, the video question and answer method according to any one of claims 1 to 11 is implemented.
  14. 一种视频问答系统,包括视频问答装置、监控系统、语音识别装置、语音输入装置和知识库,其中:A video question and answer system, including a video question and answer device, a monitoring system, a speech recognition device, a voice input device and a knowledge base, wherein:
    所述监控系统,被配置为获取一个或多个监控视频,根据指令文本对所述监控视频进行处理,并将所述监控视频输出至所述视频问答装置;The monitoring system is configured to acquire one or more surveillance videos, process the surveillance videos according to the instruction text, and output the surveillance videos to the video question and answer device;
    所述语音输入装置,被配置为接收语音输入,并输出至语音识别装置;The voice input device is configured to receive voice input and output it to the voice recognition device;
    所述语音识别装置,被配置为通过语音识别,将语音输入转换为指令文本或问题文本,将所述指令文本输入所述监控系统,将所述问题文本输入所述视频问答装置;The speech recognition device is configured to convert speech input into instruction text or question text through speech recognition, input the instruction text into the monitoring system, and input the question text into the video question and answer device;
    所述知识库,被配置为存储常识知识图谱;The knowledge base is configured to store a common sense knowledge graph;
    所述视频问答装置,被配置为接收问题文本和监控视频,根据所述问题文本生成候选答案文本,其中,所述问题文本用于描述问题,所述候选答案文本用于提供多个候选答案;还被配置为对接收的监控视频提取视频特征向 量,针对所述问题文本与候选答案文本提取文本特征向量,将所述视频特征向量与文本特征向量进行拼接,得到拼接特征向量,将所述拼接特征向量输入第一预训练模型,所述第一预训练模型通过自注意力机制,学习所述视频特征向量和所述文本特征向量之间的跨模态信息,得到编码后的第二拼接特征向量;将所述第二拼接特征向量分成第二视频特征向量和第二文本特征向量;将所述第二视频特征向量和第二文本特征向量输入模态融合模型,所述模态融合模型采用互注意力机制,对第二视频特征向量和第二文本特征向量进行处理,得到视频表达式和文本表达式,并对视频表达式和文本表达式分别进行池化并融合,得到融合特征向量;将所述融合特征向量输入解码层,以预测正确的候选答案。 The video question and answer device is configured to receive question text and surveillance video, and generate candidate answer text according to the question text, wherein the question text is used to describe a question, and the candidate answer text is used to provide multiple candidate answers; It is also configured to extract video features from the received surveillance video. amount, extract text feature vectors for the question text and candidate answer text, splice the video feature vector and the text feature vector to obtain a spliced feature vector, input the spliced feature vector into the first pre-training model, and the third A pre-trained model learns the cross-modal information between the video feature vector and the text feature vector through a self-attention mechanism, and obtains the encoded second splicing feature vector; the second splicing feature vector is divided into two video feature vectors and a second text feature vector; input the second video feature vector and the second text feature vector into a modal fusion model, and the modal fusion model uses a mutual attention mechanism to calculate the second video feature vector and the second text feature vector. The second text feature vector is processed to obtain a video expression and a text expression, and the video expression and the text expression are pooled and fused respectively to obtain a fused feature vector; the fused feature vector is input to the decoding layer to predict Correct candidate answer.
PCT/CN2023/111455 2022-08-29 2023-08-07 Video question-answer method, device and system, and storage medium WO2024046038A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211043431.7A CN115391511A (en) 2022-08-29 2022-08-29 Video question-answering method, device, system and storage medium
CN202211043431.7 2022-08-29

Publications (1)

Publication Number Publication Date
WO2024046038A1 true WO2024046038A1 (en) 2024-03-07

Family

ID=84122684

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/111455 WO2024046038A1 (en) 2022-08-29 2023-08-07 Video question-answer method, device and system, and storage medium

Country Status (2)

Country Link
CN (1) CN115391511A (en)
WO (1) WO2024046038A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117892140A (en) * 2024-03-15 2024-04-16 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium
CN118035491A (en) * 2024-04-11 2024-05-14 北京搜狐新媒体信息技术有限公司 Training method and using method of video label labeling model and related products
CN118675092A (en) * 2024-08-21 2024-09-20 南方科技大学 Multi-mode video understanding method based on large language model
CN118674033A (en) * 2024-08-23 2024-09-20 成都携恩科技有限公司 Knowledge reasoning method based on multi-training global perception

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115391511A (en) * 2022-08-29 2022-11-25 京东方科技集团股份有限公司 Video question-answering method, device, system and storage medium
CN116257611B (en) * 2023-01-13 2023-11-10 北京百度网讯科技有限公司 Question-answering model training method, question-answering processing device and storage medium
CN116883886B (en) * 2023-05-25 2024-05-28 中国科学院信息工程研究所 Weak supervision time sequence language positioning method and device based on two-stage comparison learning and noise robustness
CN117972138B (en) * 2024-04-02 2024-07-23 腾讯科技(深圳)有限公司 Training method and device for pre-training model and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807222A (en) * 2021-09-07 2021-12-17 中山大学 Video question-answering method and system for end-to-end training based on sparse sampling
US20220164548A1 (en) * 2020-11-24 2022-05-26 Openstream Inc. System and Method for Temporal Attention Behavioral Analysis of Multi-Modal Conversations in a Question and Answer System
CN114925703A (en) * 2022-06-14 2022-08-19 齐鲁工业大学 Visual question-answering method and system with multi-granularity text representation and image-text fusion
CN115391511A (en) * 2022-08-29 2022-11-25 京东方科技集团股份有限公司 Video question-answering method, device, system and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220164548A1 (en) * 2020-11-24 2022-05-26 Openstream Inc. System and Method for Temporal Attention Behavioral Analysis of Multi-Modal Conversations in a Question and Answer System
CN113807222A (en) * 2021-09-07 2021-12-17 中山大学 Video question-answering method and system for end-to-end training based on sparse sampling
CN114925703A (en) * 2022-06-14 2022-08-19 齐鲁工业大学 Visual question-answering method and system with multi-granularity text representation and image-text fusion
CN115391511A (en) * 2022-08-29 2022-11-25 京东方科技集团股份有限公司 Video question-answering method, device, system and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MA ZHIYANG, ZHENG WENFENG, CHEN XIAOBING, YIN LIRONG: "Joint embedding VQA model based on dynamic word vector", PEERJ COMPUTER SCIENCE, vol. 7, 3 March 2021 (2021-03-03), pages e353, XP093142989, ISSN: 2376-5992, DOI: 10.7717/peerj-cs.353 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117892140A (en) * 2024-03-15 2024-04-16 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium
CN117892140B (en) * 2024-03-15 2024-05-31 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium
CN118035491A (en) * 2024-04-11 2024-05-14 北京搜狐新媒体信息技术有限公司 Training method and using method of video label labeling model and related products
CN118675092A (en) * 2024-08-21 2024-09-20 南方科技大学 Multi-mode video understanding method based on large language model
CN118674033A (en) * 2024-08-23 2024-09-20 成都携恩科技有限公司 Knowledge reasoning method based on multi-training global perception

Also Published As

Publication number Publication date
CN115391511A (en) 2022-11-25

Similar Documents

Publication Publication Date Title
WO2024046038A1 (en) Video question-answer method, device and system, and storage medium
Yang et al. Review networks for caption generation
WO2020215988A1 (en) Video caption generation method, device and apparatus, and storage medium
WO2020140487A1 (en) Speech recognition method for human-machine interaction of smart apparatus, and system
Wu et al. Encode, review, and decode: Reviewer module for caption generation
WO2021139191A1 (en) Method for data labeling and apparatus for data labeling
CN111263238B (en) Method and equipment for generating video comments based on artificial intelligence
WO2021212601A1 (en) Image-based writing assisting method and apparatus, medium, and device
Yang et al. NDNet: Narrow while deep network for real-time semantic segmentation
CN113128431B (en) Video clip retrieval method, device, medium and electronic equipment
WO2023077819A1 (en) Data processing system, method and apparatus, and device, storage medium, computer program and computer program product
CN112580362A (en) Visual behavior recognition method and system based on text semantic supervision and computer readable medium
Luo et al. Soc: Semantic-assisted object cluster for referring video object segmentation
CN115099234A (en) Chinese multi-mode fine-grained emotion analysis method based on graph neural network
Sudhakaran et al. Gate-shift-fuse for video action recognition
Greco et al. On the use of semantic technologies for video analytics
CN118194230A (en) Multi-mode video question-answering method and device and computer equipment
CN112861474B (en) Information labeling method, device, equipment and computer readable storage medium
CN116842944A (en) Entity relation extraction method and device based on word enhancement
CN116704433A (en) Self-supervision group behavior recognition method based on context-aware relationship predictive coding
Zhang et al. Multi-level neural prompt for zero-shot weakly supervised group activity recognition
Arefeen et al. iRAG: An Incremental Retrieval Augmented Generation System for Videos
CN116932788A (en) Cover image extraction method, device, equipment and computer storage medium
Zheng et al. SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving
CN118093936B (en) Video tag processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23859073

Country of ref document: EP

Kind code of ref document: A1