CN114218439A - Video question-answering method based on self-driven twin sampling and reasoning - Google Patents
Video question-answering method based on self-driven twin sampling and reasoning Download PDFInfo
- Publication number
- CN114218439A CN114218439A CN202111553547.0A CN202111553547A CN114218439A CN 114218439 A CN114218439 A CN 114218439A CN 202111553547 A CN202111553547 A CN 202111553547A CN 114218439 A CN114218439 A CN 114218439A
- Authority
- CN
- China
- Prior art keywords
- video
- twin
- sampling
- segment
- video segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005070 sampling Methods 0.000 title claims abstract description 71
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000000605 extraction Methods 0.000 claims abstract description 14
- 239000011159 matrix material Substances 0.000 claims description 32
- 238000012549 training Methods 0.000 claims description 18
- 230000000644 propagated effect Effects 0.000 claims description 11
- 230000000007 visual effect Effects 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 7
- 239000000126 substance Substances 0.000 claims description 7
- 239000013598 vector Substances 0.000 claims description 5
- 241000764238 Isis Species 0.000 claims description 4
- 230000004927 fusion Effects 0.000 claims description 4
- 230000008901 benefit Effects 0.000 abstract description 3
- 230000000694 effects Effects 0.000 abstract description 3
- 230000002708 enhancing effect Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 16
- 238000010586 diagram Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/71—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/75—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Library & Information Science (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a video question-answering method based on self-driven twin sampling and reasoning, which comprises video segment sampling, feature extraction and reasoning strategies, wherein the video segment sampling obtains a reference video segment through sparse sampling and obtains a twin video segment through twin sampling; the feature extraction encodes a plurality of video segment-text pairs into corresponding semantic feature representations through a video encoder, a text encoder and a multi-modality; the inference strategy generates a refined knowledge label for the video segment by using the twin knowledge generation module, and propagates the label to all twin samples by using the twin knowledge inference module and fuses the labels. The invention has the advantages that the self-driven twin sampling and reasoning-based framework is provided and used for extracting context semantic information in different video segments of the same video and enhancing the learning effect of the network.
Description
Technical Field
The invention relates to the technical field of computer vision and pattern recognition in computers, in particular to a video question-answering method based on self-driven twin sampling and reasoning.
Background
The video question-answering is a vision-language reasoning task and has great application prospect, thereby attracting the attention of more and more researchers. The video question-answering task needs to acquire and use time domain and space domain characteristics of visual signals in the video according to the combined semantics of the language clues so as to generate answers.
Some existing work extracts general visual information from video as well as motion features to represent video content and designs different attention mechanisms to integrate these features. These methods focus on how to better understand the overall content of the video, but this easily ignores details in the video segment. There are also some researchers exploring how to align features at the semantic level by applying visual and linguistic information to the video. But these efforts ignore the association between contexts in the same video.
Recently, fine-tuning (fine-tuning) the paradigm of pre-training models on target data sets has achieved great benefit in multi-modal tasks, such as text video retrieval, visual question answering, picture description, video description, and video question answering. Based on the results obtained by the language pre-training model based on the Transformer, the pre-training model aiming at the video-text of the video-language task has larger progress, especially for the pre-training model of the video question-answering task. This approach is based on a skeletal network (backbone) pre-trained on a large-scale video-text dataset, but none of the networks so pre-trained further mine contextual information between video segments of the same video.
In summary, these existing multi-modal learning paradigms all have a significant drawback: the correlation between video segment-text pairs (clip-text pairs) in the same video is ignored, and each video segment-text pair is considered to be independent from each other during training. Thus, the prior art does not make good use of rich context information between contexts in the same video. The inventor believes that the associated context information within such video can be used to enhance the learning effect of the network.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a video question-answering method based on self-driven twin sampling and reasoning, and solves the problem that the prior method cannot well utilize the context information in the same video.
In order to achieve the purpose, the invention adopts the following technical scheme:
a video question-answering method based on self-driven twin sampling and reasoning comprises video segment sampling, feature extraction and reasoning strategies, wherein the video segment sampling obtains a reference video segment through sparse sampling and obtains a twin video segment through twin sampling; the feature extraction encodes a plurality of video segment-text pairs into corresponding semantic feature representations through a video encoder, a text encoder and a multi-modality; the inference strategy generates a refined knowledge label for the video segment by using the twin knowledge generation module, and propagates the label to all twin samples by using the twin knowledge inference module and fuses the labels.
It should be noted that in the video segment sampling, a reference segment and a twin segment are obtained by respectively using sparse sampling and twin sampling, and the features of the video segments are respectively extracted by using a feature extraction model; in the feature extraction, a twin knowledge generation module is used for calculating the context feature inside the video according to the features of the reference segment and the twin segment; and in the inference strategy, a twin knowledge inference module is used for adaptively generating refined soft labels for the video segments according to the context characteristics.
It should be noted that the twin sampling includes obtaining a video segment sample with a length of B frames by performing sparse sampling on a video sample F; wherein the length of F is greater than B; determining the index of the starting frame number of a reference video segment in a random mode, then carrying out twin sampling, selecting a plurality of starting indexes adjacent to the reference video segment in the same video sample, and selecting the video segments of B frames with the same length; and after twin sampling is finished, obtaining a reference video segment and a twin video segment which have similar global video semantics, and then sequentially coding the video segments by using a video coder to obtain visual characteristics.
It should be noted that the start index of the reference segment is limited to the first third frame of the whole video; the index of the adjacent twin video segment should be adjacent to the index of the reference segment, and the twin segment is obtained by performing continuous sampling before or after the reference segment.
It should be noted that the knowledge generation module includes:
given a set of N video segment-text pairs in a video and a feature extractor F to be trained, the similarity matrix A ∈ R of the samples is first calculated pair by pair according to the dot product of the coded representations of the samplesN×N,
A(i,j)=σ(F(fi))Tσ(F(fj)),
Wherein i and j are indexes of samples, σ represents an L2 norm function, and N is the number of video segments; an element on a diagonal line in the association matrix a is set to 0 by formula a ═ a (1-I), so that inference is prevented from entering an individual dead cycle, where I is an identity matrix, indicating a hadamard product;
normalizing each row of matrix A such that it is satisfied for all iWhile keeping the diagonal element at 0; the normalization function can be represented using the softmax function for each row element of the matrix, i.e. the function
Wherein the content of the first and second substances,expressing the associated knowledge of video bands i, j in the same video after normalization;i.e. the overall associated knowledge of all video segments, i.e. the twin knowledge.
It should be noted that the twin knowledge inference module includes:
in the training process, a classifier is trained simultaneously, for the classifier, a video segment-text feature representation output by a multi-mode Transformer is input, a soft label prediction for the video segment-text sample is obtained, and the prediction probability is recorded as
the soft label prediction of other samples is propagated and combined, so as to obtain a better soft label based on the context relation in the video, and the formula is
Wherein the content of the first and second substances,probability vectors propagated to other samples for the ith sample can also be regarded as refined soft labels;
Is spread toThe weight of the time is greater; then all samples are subjected to parallel propagation operation of sample-by-sample pairs, and the formula is
It should be noted that, in order to avoid excessive noise propagation and fusion and unexpected prediction result, the soft labels used are the initial probability matrix P and the propagated probability matrix PIs given by the formula
Wherein ω ∈ [0, 1 ]]Is a weight factor, W1,W2Is a set of learnable parameters and the formula is satisfied
The invention has the advantages that the self-driven twin sampling and reasoning-based framework is provided and used for extracting context semantic information in different video segments of the same video and enhancing the learning effect of the network.
Drawings
FIG. 1 is a schematic diagram of the overall network architecture framework of the present invention;
FIG. 2 is a schematic diagram comparing dense sampling, sparse sampling and twinned sampling;
fig. 3 is a comparison diagram of a multi-modal learning process with different sampling modes.
Detailed Description
The present invention will be further described with reference to the accompanying drawings, and it should be noted that the following examples are provided to illustrate the detailed embodiments and specific operations based on the technical solutions of the present invention, but the scope of the present invention is not limited to the examples.
The invention relates to a video question-answering method based on self-driven twin sampling and reasoning, which comprises video segment sampling, feature extraction and reasoning strategies, wherein the video segment sampling obtains a reference video segment through sparse sampling and obtains a twin video segment through twin sampling; the feature extraction encodes a plurality of video segment-text pairs into corresponding semantic feature representations through a video encoder, a text encoder and a multi-modality; the inference strategy generates a refined knowledge label for the video segment by using the twin knowledge generation module, and propagates the label to all twin samples by using the twin knowledge inference module and fuses the labels.
It should be noted that in the video segment sampling, a reference segment and a twin segment are obtained by respectively using sparse sampling and twin sampling, and the features of the video segments are respectively extracted by using a feature extraction model; in the feature extraction, a twin knowledge generation module is used for calculating the context feature inside the video according to the features of the reference segment and the twin segment; and in the inference strategy, a twin knowledge inference module is used for adaptively generating refined soft labels for the video segments according to the context characteristics.
It should be noted that the twin sampling includes obtaining a video segment sample with a length of B frames by performing sparse sampling on a video sample F; wherein the length of F is greater than B; determining the index of the starting frame number of a reference video segment in a random mode, then carrying out twin sampling, selecting a plurality of starting indexes adjacent to the reference video segment in the same video sample, and selecting the video segments of B frames with the same length; and after twin sampling is finished, obtaining a reference video segment and a twin video segment which have similar global video semantics, and then sequentially coding the video segments by using a video coder to obtain visual characteristics.
It should be noted that the start index of the reference segment is limited to the first third frame of the whole video; the index of the adjacent twin video segment should be adjacent to the index of the reference segment, and the twin segment is obtained by performing continuous sampling before or after the reference segment.
It should be noted that the knowledge generation module includes:
given a set of N video segment-text pairs in a video and a feature extractor F to be trained, the similarity matrix A ∈ R of the samples is first calculated pair by pair according to the dot product of the coded representations of the samplesN×N,
A(i,j)=σ(F(fi))Tσ(F(fj)),
Wherein i and j are indexes of samples, σ represents an L2 norm function, and N is the number of video segments; an element on a diagonal line in the association matrix a is set to 0 by formula a ═ a (1-I), so that inference is prevented from entering an individual dead cycle, where I is an identity matrix, indicating a hadamard product;
normalizing each row of matrix A such that it is satisfied for all iWhile keeping the diagonal element at 0; the normalization function can be represented using the softmax function for each row element of the matrix, i.e. the function
Wherein the content of the first and second substances,expressing the associated knowledge of video bands i, j in the same video after normalization;i.e. the overall associated knowledge of all video segments, i.e. the twin knowledge.
It should be noted that the twin knowledge inference module includes:
in the training process, a classifier is trained simultaneously, for the classifier, a video segment-text feature representation output by a multi-mode Transformer is input, a soft label prediction for the video segment-text sample is obtained, and the prediction probability is recorded as
the soft label prediction of other samples is propagated and combined, so as to obtain a better soft label based on the context relation in the video, and the formula is
Wherein the content of the first and second substances,probability vectors propagated to other samples for the ith sample can also be regarded as refined soft labels;
Is spread toThe weight of the time is greater; all samples are then subjected to sample-pair-by-sample-pair parallelismA propagation operation of the formula
It should be noted that, in order to avoid excessive noise propagation and fusion and unexpected prediction result, the soft labels used are the initial probability matrix P and the propagated probability matrix PIs given by the formula
Wherein ω ∈ [0, 1 ]]Is a weight factor, W1,W2Is a set of learnable parameters and the formula is satisfied
Examples
Figure 1 shows the overall framework of the invention. For a video F, the invention constructs a reference video segment c by samplinganchorAnd corresponding twin video segmentsWhere N represents the number of all video segments. Each video segment c is B frames in length, and B feature maps are obtained by the video encoder. The characteristics of the reference segment and the twin segment after being coded are respectively marked as vanchorAnd
will be given a scriptureThe encoder-encoded text representation and the video segment representation of the video encoder are spliced together as input, and a multi-modal Transformer is used to generate video segment-text features. The reference video segment-text feature is noted as fanchorThe corresponding twin video segment-text feature is noted
Similarly, the soft label predictions derived using these feature inferences are denoted as p{anchor}Andwhere p is a vector of dimension K, depending on the number of classification classes.
Twin sampling. Fig. 2 illustrates the difference between twin sampling and conventional dense and sparse sampling. Firstly, the invention obtains the video segment sample with the length of B frame by carrying out sparse sampling on the video sample F. The length of F is much greater than B. The invention determines the index of the starting frame number of the reference video segment in a random mode, then carries out twin sampling, selects a plurality of starting indexes adjacent to the reference video segment in the same video sample, and selects the video segments of B frames with the same length.
The present invention imposes certain restrictions on the selection of the starting index of the reference segment. The start index of the reference segment is limited to the first third of the frame of the entire video to ensure the balance of the sampling. The "adjacent" in the above context means that the index of the twin video segment should be adjacent to the index of the reference segment. The invention carries out continuous sampling before or after the reference section so as to obtain the twin section.
After twin sampling is completed, the method can obtain the reference video segment and the twin video segment which have similar global video semantics. Then, the present invention uses a video encoder to encode the video bands in sequence according to the method shown in fig. 3, so as to obtain the visual characteristics. The video encoder and text encoder are very similar to the encoder in ClipBert.
And (4) reasoning strategies. Unlike previous researchers that directly use the video segment-text features obtained from the multimodal Transformer to predict the final result, the invention explores how to fuse the information of different video segments, and proposes an innovative inference strategy, named as self-driven multimodal learning, which includes a twin knowledge generation module and a twin knowledge inference module.
The twin knowledge generation module and how twin knowledge is generated will be described first.
Video segments that are visually highly similar should have a more similar class probability prediction when classifying each video segment. In the proposed solution, similar video segment information is systematically aggregated and fused together, providing more accurate soft tag prediction. The method propagates and fuses twin knowledge in the same group of reference segment and twin segment samples according to the similarity of the features extracted by the multi-modal Transformer. Given a set of N video segment-text pairs in a video and a feature extractor F to be trained, the invention first computes the similarity matrix A e R of the samples pair by pair according to the dot product of the coded representation of the samplesN×NWherein
A(i,j)=σ(F(fi))Tσ(F(fj)),
i, j is the index of the sample, σ represents the L2 norm function, and N is the number of video segments. To avoid the inference going into an individual dead cycle, the present invention uses the formula A ═ A (1-I) to set the elements on the diagonal line in the association matrix A to 0, where I is an identity matrix, indicating a Hadamard product.
The invention then normalizes each row of matrix A such that it is satisfied for all iWhile keeping the diagonal element at 0. The normalization function can be represented using the softmax function for each row element of the matrix, i.e. the function
WhereinAnd the associated knowledge of video bands i, j in the same video after normalization is shown.I.e., the overall associated knowledge of all video segments, is also referred to as twin knowledge in the present invention.
The twin knowledge inference module and how twin knowledge is used is described below.
In the training process, the invention trains a classifier simultaneously, and for the classifier, the video segment-text feature representation output by the multi-mode Transformer is input to obtain a soft label prediction for the video segment-text sample, and the probability of the prediction is recorded as P ═ P1,...,pN]T∈RN×NSatisfy the following requirementsWhere K is the total number of classes.
The invention can propagate and combine soft label prediction of other samples, thereby obtaining a better soft label based on the context relation in the video, and the formula is
Wherein the content of the first and second substances,probability vectors propagated to other samples for the ith sample may also be considered refined soft labels.
Is spread toThe weight of the time is greater. Similarly, the present invention performs a parallel sample-pair-by-sample-pair propagation operation on all samples, which is formulated as
In order to avoid the propagation and the fusion of excessive noise and unexpected prediction results, the soft label finally used by the invention is an initial probability matrix P and a propagated probability matrixIs given by the formula
Wherein ω ∈ [0, 1 ]]Is a weight factor, W1,W2Is a set of learnable parameters and the formula is satisfied
With the above formula, the present invention can propagate and fuse the knowledge of the correlation between samples within the contract video through one iteration. This knowledge will be compared to the previously generated sample class probability panchorThe comparison is made to guide the training process of the network so that video segments that are visually highly similar should have more similar class probability predictions.
An objective function. The objective function L of the framework proposed by the present invention has two terms. First item LsiameseIs predicted by the calculated soft label Q and the sampleanchorWith the loss in between. Hair brushComputing L using a cross entropy loss functionsiamese. The other term is the loss between the predicted result and the true value of the model during the training process. The specific content and formula of this term depends on the specific task, which is herein collectively denoted as L by the present inventiongt。
Finally, the objective function of the model of the invention is
L=αLsiamese+Lgt,
Where α is a hyperparameter that adjusts the ratio of the two terms.
Details of the frame. The invention uses the model obtained by pre-training ClipBert on COCO Captions and Visual Genome Captions data sets as the initialization weight of the SimSamrea framework of the invention. Based on this initialization weight, the present invention continues video segment-text pre-training on the TGIF-QA dataset and performs optimization using the aforementioned objective function. Since the present invention aims to model long-term context information using twin sampling and reasoning, the present invention uses only samples in the data set where the total number of video frames is greater than 40 frames. The invention determines the length of the video segment selected in the video according to the length of the video.
During pre-training and formal training, a reference video segment is selected on each GPU through sparse sampling, and an AdamW optimizer is used for carrying out overall end-to-end training. The invention gradually improves the learning rate in the early stage of training and then linearly decreases the learning rate. The training is stopped when the learning rate decays to a set minimum value.
Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.
Claims (7)
1. A video question-answering method based on self-driven twin sampling and reasoning is characterized by comprising a video segment sampling strategy, a feature extraction strategy and a reasoning strategy, wherein the video segment sampling strategy is used for obtaining a reference video segment through sparse sampling and obtaining a twin video segment through twin sampling; the feature extraction encodes a plurality of video segment-text pairs into corresponding semantic feature representations through a video encoder, a text encoder and a multi-modality; the inference strategy generates a refined knowledge label for the video segment by using the twin knowledge generation module, and propagates the label to all twin samples by using the twin knowledge inference module and fuses the labels.
2. The video question-answering method based on self-driven twin sampling and reasoning according to claim 1, wherein in the video segment sampling, a reference segment and a twin segment are obtained by respectively using sparse sampling and twin sampling, and the features of the video segments are respectively extracted by using a feature extraction model; in the feature extraction, a twin knowledge generation module is used for calculating the context feature inside the video according to the features of the reference segment and the twin segment; and in the inference strategy, a twin knowledge inference module is used for adaptively generating refined soft labels for the video segments according to the context characteristics.
3. The video question-answering method based on self-driven twin sampling and reasoning according to claim 1 or 2, wherein the twin sampling comprises obtaining video segment samples with the length of B frames by sparsely sampling video samples F; wherein the length of F is greater than B; determining the index of the starting frame number of a reference video segment in a random mode, then carrying out twin sampling, selecting a plurality of starting indexes adjacent to the reference video segment in the same video sample, and selecting the video segments of B frames with the same length; and after twin sampling is finished, obtaining a reference video segment and a twin video segment which have similar global video semantics, and then sequentially coding the video segments by using a video coder to obtain visual characteristics.
4. The self-driven twin sampling and reasoning-based video question answering method according to claim 3, wherein a start index of the reference segment is defined within the first third frame of the whole video; the index of the adjacent twin video segment should be adjacent to the index of the reference segment, and the twin segment is obtained by performing continuous sampling before or after the reference segment.
5. The self-driven twin sampling and reasoning based video question answering method according to claim 1, wherein the knowledge generating module comprises:
given a set of N video segment-text pairs in a video and a feature extractor F to be trained, the similarity matrix A ∈ R of the samples is first calculated pair by pair according to the dot product of the coded representations of the samplesN×N,
Wherein i and j are indexes of samples, σ represents an L2 norm function, and N is the number of video segments; an element on a diagonal line in the association matrix a is set to 0 by formula a ═ a (1-I), so that inference is prevented from entering an individual dead cycle, where I is an identity matrix, indicating a hadamard product;
normalizing each row of matrix A such that it is satisfied for all iWhile keeping the diagonal element at 0; the normalization function can be represented using the softmax function for each row element of the matrix, i.e. the function
6. The self-driven twin sampling and reasoning based video question answering method according to claim 1, wherein the twin knowledge reasoning module comprises:
in the training process, a classifier is trained simultaneously, for the classifier, a video segment-text feature representation output by a multi-mode Transformer is input, a soft label prediction for the video segment-text sample is obtained, and the prediction probability is recorded asAnd satisfyWherein K is the total number of categories;
the soft label prediction of other samples is propagated and combined, so as to obtain a better soft label based on the context relation in the video, and the formula is
Wherein the content of the first and second substances,probability vectors propagated to other samples for the ith sample can also be regarded as refined soft labels;
Is spread toThe weight of the time is greater; all samples were then sampledPerforming parallel propagation operation on sample-by-sample pairs, wherein the formula is
7. The self-driven twin sampling and reasoning-based video question-answering method according to claim 6, wherein in order to avoid excessive noise propagation and fusion and unexpected prediction results, the soft labels used are an initial probability matrix P and a propagated probability matrixIs given by the formula
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111553547.0A CN114218439A (en) | 2021-12-17 | 2021-12-17 | Video question-answering method based on self-driven twin sampling and reasoning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111553547.0A CN114218439A (en) | 2021-12-17 | 2021-12-17 | Video question-answering method based on self-driven twin sampling and reasoning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114218439A true CN114218439A (en) | 2022-03-22 |
Family
ID=80703709
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111553547.0A Pending CN114218439A (en) | 2021-12-17 | 2021-12-17 | Video question-answering method based on self-driven twin sampling and reasoning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114218439A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115278110A (en) * | 2022-07-12 | 2022-11-01 | 时空穿越(深圳)科技有限公司 | Information processing method, device and system based on digital twin cabin and readable storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113807222A (en) * | 2021-09-07 | 2021-12-17 | 中山大学 | Video question-answering method and system for end-to-end training based on sparse sampling |
-
2021
- 2021-12-17 CN CN202111553547.0A patent/CN114218439A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113807222A (en) * | 2021-09-07 | 2021-12-17 | 中山大学 | Video question-answering method and system for end-to-end training based on sparse sampling |
Non-Patent Citations (1)
Title |
---|
WEIJIANG YU等: "Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering" * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115278110A (en) * | 2022-07-12 | 2022-11-01 | 时空穿越(深圳)科技有限公司 | Information processing method, device and system based on digital twin cabin and readable storage medium |
CN115278110B (en) * | 2022-07-12 | 2023-08-25 | 时空穿越(深圳)科技有限公司 | Information processing method, device and system based on digital twin cabin and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110083705B (en) | Multi-hop attention depth model, method, storage medium and terminal for target emotion classification | |
CN109947912A (en) | A kind of model method based on paragraph internal reasoning and combined problem answer matches | |
CN109800434B (en) | Method for generating abstract text title based on eye movement attention | |
CN110866542B (en) | Depth representation learning method based on feature controllable fusion | |
CN110418210A (en) | A kind of video presentation generation method exported based on bidirectional circulating neural network and depth | |
CN114020862A (en) | Retrieval type intelligent question-answering system and method for coal mine safety regulations | |
CN113297364B (en) | Natural language understanding method and device in dialogue-oriented system | |
CN111897944B (en) | Knowledge graph question-answering system based on semantic space sharing | |
CN113297370B (en) | End-to-end multi-modal question-answering method and system based on multi-interaction attention | |
CN112527993B (en) | Cross-media hierarchical deep video question-answer reasoning framework | |
CN113204633B (en) | Semantic matching distillation method and device | |
CN110852089B (en) | Operation and maintenance project management method based on intelligent word segmentation and deep learning | |
CN112001166A (en) | Intelligent question-answer sentence-to-semantic matching method and device for government affair consultation service | |
CN112000770A (en) | Intelligent question and answer oriented sentence-to-sentence matching method based on semantic feature map | |
CN114387537A (en) | Video question-answering method based on description text | |
CN113988079A (en) | Low-data-oriented dynamic enhanced multi-hop text reading recognition processing method | |
CN115393933A (en) | Video face emotion recognition method based on frame attention mechanism | |
CN115238691A (en) | Knowledge fusion based embedded multi-intention recognition and slot filling model | |
CN114218439A (en) | Video question-answering method based on self-driven twin sampling and reasoning | |
CN112651225B (en) | Multi-item selection machine reading understanding method based on multi-stage maximum attention | |
CN116543289B (en) | Image description method based on encoder-decoder and Bi-LSTM attention model | |
CN111858879B (en) | Question and answer method and system based on machine reading understanding, storage medium and computer equipment | |
CN114239575B (en) | Statement analysis model construction method, statement analysis method, device, medium and computing equipment | |
CN113554040B (en) | Image description method and device based on condition generation countermeasure network | |
CN115659242A (en) | Multimode emotion classification method based on mode enhanced convolution graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |