CN114218439A - Video question-answering method based on self-driven twin sampling and reasoning - Google Patents

Video question-answering method based on self-driven twin sampling and reasoning Download PDF

Info

Publication number
CN114218439A
CN114218439A CN202111553547.0A CN202111553547A CN114218439A CN 114218439 A CN114218439 A CN 114218439A CN 202111553547 A CN202111553547 A CN 202111553547A CN 114218439 A CN114218439 A CN 114218439A
Authority
CN
China
Prior art keywords
video
twin
sampling
segment
video segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111553547.0A
Other languages
Chinese (zh)
Inventor
余伟江
卢宇彤
李孟非
陈志广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202111553547.0A priority Critical patent/CN114218439A/en
Publication of CN114218439A publication Critical patent/CN114218439A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Library & Information Science (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video question-answering method based on self-driven twin sampling and reasoning, which comprises video segment sampling, feature extraction and reasoning strategies, wherein the video segment sampling obtains a reference video segment through sparse sampling and obtains a twin video segment through twin sampling; the feature extraction encodes a plurality of video segment-text pairs into corresponding semantic feature representations through a video encoder, a text encoder and a multi-modality; the inference strategy generates a refined knowledge label for the video segment by using the twin knowledge generation module, and propagates the label to all twin samples by using the twin knowledge inference module and fuses the labels. The invention has the advantages that the self-driven twin sampling and reasoning-based framework is provided and used for extracting context semantic information in different video segments of the same video and enhancing the learning effect of the network.

Description

Video question-answering method based on self-driven twin sampling and reasoning
Technical Field
The invention relates to the technical field of computer vision and pattern recognition in computers, in particular to a video question-answering method based on self-driven twin sampling and reasoning.
Background
The video question-answering is a vision-language reasoning task and has great application prospect, thereby attracting the attention of more and more researchers. The video question-answering task needs to acquire and use time domain and space domain characteristics of visual signals in the video according to the combined semantics of the language clues so as to generate answers.
Some existing work extracts general visual information from video as well as motion features to represent video content and designs different attention mechanisms to integrate these features. These methods focus on how to better understand the overall content of the video, but this easily ignores details in the video segment. There are also some researchers exploring how to align features at the semantic level by applying visual and linguistic information to the video. But these efforts ignore the association between contexts in the same video.
Recently, fine-tuning (fine-tuning) the paradigm of pre-training models on target data sets has achieved great benefit in multi-modal tasks, such as text video retrieval, visual question answering, picture description, video description, and video question answering. Based on the results obtained by the language pre-training model based on the Transformer, the pre-training model aiming at the video-text of the video-language task has larger progress, especially for the pre-training model of the video question-answering task. This approach is based on a skeletal network (backbone) pre-trained on a large-scale video-text dataset, but none of the networks so pre-trained further mine contextual information between video segments of the same video.
In summary, these existing multi-modal learning paradigms all have a significant drawback: the correlation between video segment-text pairs (clip-text pairs) in the same video is ignored, and each video segment-text pair is considered to be independent from each other during training. Thus, the prior art does not make good use of rich context information between contexts in the same video. The inventor believes that the associated context information within such video can be used to enhance the learning effect of the network.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a video question-answering method based on self-driven twin sampling and reasoning, and solves the problem that the prior method cannot well utilize the context information in the same video.
In order to achieve the purpose, the invention adopts the following technical scheme:
a video question-answering method based on self-driven twin sampling and reasoning comprises video segment sampling, feature extraction and reasoning strategies, wherein the video segment sampling obtains a reference video segment through sparse sampling and obtains a twin video segment through twin sampling; the feature extraction encodes a plurality of video segment-text pairs into corresponding semantic feature representations through a video encoder, a text encoder and a multi-modality; the inference strategy generates a refined knowledge label for the video segment by using the twin knowledge generation module, and propagates the label to all twin samples by using the twin knowledge inference module and fuses the labels.
It should be noted that in the video segment sampling, a reference segment and a twin segment are obtained by respectively using sparse sampling and twin sampling, and the features of the video segments are respectively extracted by using a feature extraction model; in the feature extraction, a twin knowledge generation module is used for calculating the context feature inside the video according to the features of the reference segment and the twin segment; and in the inference strategy, a twin knowledge inference module is used for adaptively generating refined soft labels for the video segments according to the context characteristics.
It should be noted that the twin sampling includes obtaining a video segment sample with a length of B frames by performing sparse sampling on a video sample F; wherein the length of F is greater than B; determining the index of the starting frame number of a reference video segment in a random mode, then carrying out twin sampling, selecting a plurality of starting indexes adjacent to the reference video segment in the same video sample, and selecting the video segments of B frames with the same length; and after twin sampling is finished, obtaining a reference video segment and a twin video segment which have similar global video semantics, and then sequentially coding the video segments by using a video coder to obtain visual characteristics.
It should be noted that the start index of the reference segment is limited to the first third frame of the whole video; the index of the adjacent twin video segment should be adjacent to the index of the reference segment, and the twin segment is obtained by performing continuous sampling before or after the reference segment.
It should be noted that the knowledge generation module includes:
given a set of N video segment-text pairs in a video and a feature extractor F to be trained, the similarity matrix A ∈ R of the samples is first calculated pair by pair according to the dot product of the coded representations of the samplesN×N
A(i,j)=σ(F(fi))Tσ(F(fj)),
Wherein i and j are indexes of samples, σ represents an L2 norm function, and N is the number of video segments; an element on a diagonal line in the association matrix a is set to 0 by formula a ═ a (1-I), so that inference is prevented from entering an individual dead cycle, where I is an identity matrix, indicating a hadamard product;
normalizing each row of matrix A such that it is satisfied for all i
Figure BDA0003417862060000041
While keeping the diagonal element at 0; the normalization function can be represented using the softmax function for each row element of the matrix, i.e. the function
Figure BDA0003417862060000042
Wherein the content of the first and second substances,
Figure BDA0003417862060000043
expressing the associated knowledge of video bands i, j in the same video after normalization;
Figure BDA0003417862060000044
i.e. the overall associated knowledge of all video segments, i.e. the twin knowledge.
It should be noted that the twin knowledge inference module includes:
in the training process, a classifier is trained simultaneously, for the classifier, a video segment-text feature representation output by a multi-mode Transformer is input, a soft label prediction for the video segment-text sample is obtained, and the prediction probability is recorded as
P=[p1,...,pN]T∈RN×NAnd satisfy
Figure BDA0003417862060000045
Wherein K is the total number of categories;
the soft label prediction of other samples is propagated and combined, so as to obtain a better soft label based on the context relation in the video, and the formula is
Figure BDA0003417862060000046
Wherein the content of the first and second substances,
Figure BDA0003417862060000047
probability vectors propagated to other samples for the ith sample can also be regarded as refined soft labels;
if the ith, jth sample is more similar, then their similarity is
Figure BDA0003417862060000048
Is relatively large, then pj
Is spread to
Figure BDA0003417862060000051
The weight of the time is greater; then all samples are subjected to parallel propagation operation of sample-by-sample pairs, and the formula is
Figure BDA0003417862060000052
Wherein W ∈ R1×NIs a matrix that can be learned and,
Figure BDA0003417862060000053
is the calculated soft label.
It should be noted that, in order to avoid excessive noise propagation and fusion and unexpected prediction result, the soft labels used are the initial probability matrix P and the propagated probability matrix P
Figure BDA0003417862060000054
Is given by the formula
Figure BDA0003417862060000055
Wherein ω ∈ [0, 1 ]]Is a weight factor, W1,W2Is a set of learnable parameters and the formula is satisfied
Figure BDA0003417862060000056
The invention has the advantages that the self-driven twin sampling and reasoning-based framework is provided and used for extracting context semantic information in different video segments of the same video and enhancing the learning effect of the network.
Drawings
FIG. 1 is a schematic diagram of the overall network architecture framework of the present invention;
FIG. 2 is a schematic diagram comparing dense sampling, sparse sampling and twinned sampling;
fig. 3 is a comparison diagram of a multi-modal learning process with different sampling modes.
Detailed Description
The present invention will be further described with reference to the accompanying drawings, and it should be noted that the following examples are provided to illustrate the detailed embodiments and specific operations based on the technical solutions of the present invention, but the scope of the present invention is not limited to the examples.
The invention relates to a video question-answering method based on self-driven twin sampling and reasoning, which comprises video segment sampling, feature extraction and reasoning strategies, wherein the video segment sampling obtains a reference video segment through sparse sampling and obtains a twin video segment through twin sampling; the feature extraction encodes a plurality of video segment-text pairs into corresponding semantic feature representations through a video encoder, a text encoder and a multi-modality; the inference strategy generates a refined knowledge label for the video segment by using the twin knowledge generation module, and propagates the label to all twin samples by using the twin knowledge inference module and fuses the labels.
It should be noted that in the video segment sampling, a reference segment and a twin segment are obtained by respectively using sparse sampling and twin sampling, and the features of the video segments are respectively extracted by using a feature extraction model; in the feature extraction, a twin knowledge generation module is used for calculating the context feature inside the video according to the features of the reference segment and the twin segment; and in the inference strategy, a twin knowledge inference module is used for adaptively generating refined soft labels for the video segments according to the context characteristics.
It should be noted that the twin sampling includes obtaining a video segment sample with a length of B frames by performing sparse sampling on a video sample F; wherein the length of F is greater than B; determining the index of the starting frame number of a reference video segment in a random mode, then carrying out twin sampling, selecting a plurality of starting indexes adjacent to the reference video segment in the same video sample, and selecting the video segments of B frames with the same length; and after twin sampling is finished, obtaining a reference video segment and a twin video segment which have similar global video semantics, and then sequentially coding the video segments by using a video coder to obtain visual characteristics.
It should be noted that the start index of the reference segment is limited to the first third frame of the whole video; the index of the adjacent twin video segment should be adjacent to the index of the reference segment, and the twin segment is obtained by performing continuous sampling before or after the reference segment.
It should be noted that the knowledge generation module includes:
given a set of N video segment-text pairs in a video and a feature extractor F to be trained, the similarity matrix A ∈ R of the samples is first calculated pair by pair according to the dot product of the coded representations of the samplesN×N
A(i,j)=σ(F(fi))Tσ(F(fj)),
Wherein i and j are indexes of samples, σ represents an L2 norm function, and N is the number of video segments; an element on a diagonal line in the association matrix a is set to 0 by formula a ═ a (1-I), so that inference is prevented from entering an individual dead cycle, where I is an identity matrix, indicating a hadamard product;
normalizing each row of matrix A such that it is satisfied for all i
Figure BDA0003417862060000071
While keeping the diagonal element at 0; the normalization function can be represented using the softmax function for each row element of the matrix, i.e. the function
Figure BDA0003417862060000072
Wherein the content of the first and second substances,
Figure BDA0003417862060000073
expressing the associated knowledge of video bands i, j in the same video after normalization;
Figure BDA0003417862060000074
i.e. the overall associated knowledge of all video segments, i.e. the twin knowledge.
It should be noted that the twin knowledge inference module includes:
in the training process, a classifier is trained simultaneously, for the classifier, a video segment-text feature representation output by a multi-mode Transformer is input, a soft label prediction for the video segment-text sample is obtained, and the prediction probability is recorded as
P=[p1,...,pN]T∈RN×NAnd satisfy
Figure BDA0003417862060000081
Wherein K is the total number of categories;
the soft label prediction of other samples is propagated and combined, so as to obtain a better soft label based on the context relation in the video, and the formula is
Figure BDA0003417862060000082
Wherein the content of the first and second substances,
Figure BDA0003417862060000083
probability vectors propagated to other samples for the ith sample can also be regarded as refined soft labels;
if the ith, jth sample is more similar, then their similarity is
Figure BDA0003417862060000084
Is relatively large, then pj
Is spread to
Figure BDA0003417862060000085
The weight of the time is greater; all samples are then subjected to sample-pair-by-sample-pair parallelismA propagation operation of the formula
Figure BDA0003417862060000086
Wherein W ∈ R1×NIs a matrix that can be learned and,
Figure BDA0003417862060000087
is the calculated soft label.
It should be noted that, in order to avoid excessive noise propagation and fusion and unexpected prediction result, the soft labels used are the initial probability matrix P and the propagated probability matrix P
Figure BDA0003417862060000088
Is given by the formula
Figure BDA0003417862060000089
Wherein ω ∈ [0, 1 ]]Is a weight factor, W1,W2Is a set of learnable parameters and the formula is satisfied
Figure BDA00034178620600000810
Examples
Figure 1 shows the overall framework of the invention. For a video F, the invention constructs a reference video segment c by samplinganchorAnd corresponding twin video segments
Figure BDA00034178620600000811
Where N represents the number of all video segments. Each video segment c is B frames in length, and B feature maps are obtained by the video encoder. The characteristics of the reference segment and the twin segment after being coded are respectively marked as vanchorAnd
Figure BDA0003417862060000091
will be given a scriptureThe encoder-encoded text representation and the video segment representation of the video encoder are spliced together as input, and a multi-modal Transformer is used to generate video segment-text features. The reference video segment-text feature is noted as fanchorThe corresponding twin video segment-text feature is noted
Figure BDA0003417862060000092
Similarly, the soft label predictions derived using these feature inferences are denoted as p{anchor}And
Figure BDA0003417862060000093
where p is a vector of dimension K, depending on the number of classification classes.
Twin sampling. Fig. 2 illustrates the difference between twin sampling and conventional dense and sparse sampling. Firstly, the invention obtains the video segment sample with the length of B frame by carrying out sparse sampling on the video sample F. The length of F is much greater than B. The invention determines the index of the starting frame number of the reference video segment in a random mode, then carries out twin sampling, selects a plurality of starting indexes adjacent to the reference video segment in the same video sample, and selects the video segments of B frames with the same length.
The present invention imposes certain restrictions on the selection of the starting index of the reference segment. The start index of the reference segment is limited to the first third of the frame of the entire video to ensure the balance of the sampling. The "adjacent" in the above context means that the index of the twin video segment should be adjacent to the index of the reference segment. The invention carries out continuous sampling before or after the reference section so as to obtain the twin section.
After twin sampling is completed, the method can obtain the reference video segment and the twin video segment which have similar global video semantics. Then, the present invention uses a video encoder to encode the video bands in sequence according to the method shown in fig. 3, so as to obtain the visual characteristics. The video encoder and text encoder are very similar to the encoder in ClipBert.
And (4) reasoning strategies. Unlike previous researchers that directly use the video segment-text features obtained from the multimodal Transformer to predict the final result, the invention explores how to fuse the information of different video segments, and proposes an innovative inference strategy, named as self-driven multimodal learning, which includes a twin knowledge generation module and a twin knowledge inference module.
The twin knowledge generation module and how twin knowledge is generated will be described first.
Video segments that are visually highly similar should have a more similar class probability prediction when classifying each video segment. In the proposed solution, similar video segment information is systematically aggregated and fused together, providing more accurate soft tag prediction. The method propagates and fuses twin knowledge in the same group of reference segment and twin segment samples according to the similarity of the features extracted by the multi-modal Transformer. Given a set of N video segment-text pairs in a video and a feature extractor F to be trained, the invention first computes the similarity matrix A e R of the samples pair by pair according to the dot product of the coded representation of the samplesN×NWherein
A(i,j)=σ(F(fi))Tσ(F(fj)),
i, j is the index of the sample, σ represents the L2 norm function, and N is the number of video segments. To avoid the inference going into an individual dead cycle, the present invention uses the formula A ═ A (1-I) to set the elements on the diagonal line in the association matrix A to 0, where I is an identity matrix, indicating a Hadamard product.
The invention then normalizes each row of matrix A such that it is satisfied for all i
Figure BDA0003417862060000111
While keeping the diagonal element at 0. The normalization function can be represented using the softmax function for each row element of the matrix, i.e. the function
Figure BDA0003417862060000112
Wherein
Figure BDA0003417862060000113
And the associated knowledge of video bands i, j in the same video after normalization is shown.
Figure BDA0003417862060000114
I.e., the overall associated knowledge of all video segments, is also referred to as twin knowledge in the present invention.
The twin knowledge inference module and how twin knowledge is used is described below.
In the training process, the invention trains a classifier simultaneously, and for the classifier, the video segment-text feature representation output by the multi-mode Transformer is input to obtain a soft label prediction for the video segment-text sample, and the probability of the prediction is recorded as P ═ P1,...,pN]T∈RN×NSatisfy the following requirements
Figure BDA0003417862060000115
Where K is the total number of classes.
The invention can propagate and combine soft label prediction of other samples, thereby obtaining a better soft label based on the context relation in the video, and the formula is
Figure BDA0003417862060000116
Wherein the content of the first and second substances,
Figure BDA0003417862060000117
probability vectors propagated to other samples for the ith sample may also be considered refined soft labels.
If the ith, jth sample is more similar, then their similarity is
Figure BDA0003417862060000118
Is relatively large, then pj
Is spread to
Figure BDA0003417862060000119
The weight of the time is greater. Similarly, the present invention performs a parallel sample-pair-by-sample-pair propagation operation on all samples, which is formulated as
Figure BDA00034178620600001110
Wherein W ∈ R1×NIs a matrix that can be learned and,
Figure BDA00034178620600001111
is the calculated soft label.
In order to avoid the propagation and the fusion of excessive noise and unexpected prediction results, the soft label finally used by the invention is an initial probability matrix P and a propagated probability matrix
Figure BDA0003417862060000123
Is given by the formula
Figure BDA0003417862060000121
Wherein ω ∈ [0, 1 ]]Is a weight factor, W1,W2Is a set of learnable parameters and the formula is satisfied
Figure BDA0003417862060000122
With the above formula, the present invention can propagate and fuse the knowledge of the correlation between samples within the contract video through one iteration. This knowledge will be compared to the previously generated sample class probability panchorThe comparison is made to guide the training process of the network so that video segments that are visually highly similar should have more similar class probability predictions.
An objective function. The objective function L of the framework proposed by the present invention has two terms. First item LsiameseIs predicted by the calculated soft label Q and the sampleanchorWith the loss in between. Hair brushComputing L using a cross entropy loss functionsiamese. The other term is the loss between the predicted result and the true value of the model during the training process. The specific content and formula of this term depends on the specific task, which is herein collectively denoted as L by the present inventiongt
Finally, the objective function of the model of the invention is
L=αLsiamese+Lgt
Where α is a hyperparameter that adjusts the ratio of the two terms.
Details of the frame. The invention uses the model obtained by pre-training ClipBert on COCO Captions and Visual Genome Captions data sets as the initialization weight of the SimSamrea framework of the invention. Based on this initialization weight, the present invention continues video segment-text pre-training on the TGIF-QA dataset and performs optimization using the aforementioned objective function. Since the present invention aims to model long-term context information using twin sampling and reasoning, the present invention uses only samples in the data set where the total number of video frames is greater than 40 frames. The invention determines the length of the video segment selected in the video according to the length of the video.
During pre-training and formal training, a reference video segment is selected on each GPU through sparse sampling, and an AdamW optimizer is used for carrying out overall end-to-end training. The invention gradually improves the learning rate in the early stage of training and then linearly decreases the learning rate. The training is stopped when the learning rate decays to a set minimum value.
Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims (7)

1. A video question-answering method based on self-driven twin sampling and reasoning is characterized by comprising a video segment sampling strategy, a feature extraction strategy and a reasoning strategy, wherein the video segment sampling strategy is used for obtaining a reference video segment through sparse sampling and obtaining a twin video segment through twin sampling; the feature extraction encodes a plurality of video segment-text pairs into corresponding semantic feature representations through a video encoder, a text encoder and a multi-modality; the inference strategy generates a refined knowledge label for the video segment by using the twin knowledge generation module, and propagates the label to all twin samples by using the twin knowledge inference module and fuses the labels.
2. The video question-answering method based on self-driven twin sampling and reasoning according to claim 1, wherein in the video segment sampling, a reference segment and a twin segment are obtained by respectively using sparse sampling and twin sampling, and the features of the video segments are respectively extracted by using a feature extraction model; in the feature extraction, a twin knowledge generation module is used for calculating the context feature inside the video according to the features of the reference segment and the twin segment; and in the inference strategy, a twin knowledge inference module is used for adaptively generating refined soft labels for the video segments according to the context characteristics.
3. The video question-answering method based on self-driven twin sampling and reasoning according to claim 1 or 2, wherein the twin sampling comprises obtaining video segment samples with the length of B frames by sparsely sampling video samples F; wherein the length of F is greater than B; determining the index of the starting frame number of a reference video segment in a random mode, then carrying out twin sampling, selecting a plurality of starting indexes adjacent to the reference video segment in the same video sample, and selecting the video segments of B frames with the same length; and after twin sampling is finished, obtaining a reference video segment and a twin video segment which have similar global video semantics, and then sequentially coding the video segments by using a video coder to obtain visual characteristics.
4. The self-driven twin sampling and reasoning-based video question answering method according to claim 3, wherein a start index of the reference segment is defined within the first third frame of the whole video; the index of the adjacent twin video segment should be adjacent to the index of the reference segment, and the twin segment is obtained by performing continuous sampling before or after the reference segment.
5. The self-driven twin sampling and reasoning based video question answering method according to claim 1, wherein the knowledge generating module comprises:
given a set of N video segment-text pairs in a video and a feature extractor F to be trained, the similarity matrix A ∈ R of the samples is first calculated pair by pair according to the dot product of the coded representations of the samplesN×N
Figure FDA0003417862050000025
Wherein i and j are indexes of samples, σ represents an L2 norm function, and N is the number of video segments; an element on a diagonal line in the association matrix a is set to 0 by formula a ═ a (1-I), so that inference is prevented from entering an individual dead cycle, where I is an identity matrix, indicating a hadamard product;
normalizing each row of matrix A such that it is satisfied for all i
Figure FDA0003417862050000021
While keeping the diagonal element at 0; the normalization function can be represented using the softmax function for each row element of the matrix, i.e. the function
Figure FDA0003417862050000022
Wherein the content of the first and second substances,
Figure FDA0003417862050000023
expressing the associated knowledge of video bands i, j in the same video after normalization;
Figure FDA0003417862050000024
i.e. the overall associated knowledge of all video segments, i.e. the twin knowledge.
6. The self-driven twin sampling and reasoning based video question answering method according to claim 1, wherein the twin knowledge reasoning module comprises:
in the training process, a classifier is trained simultaneously, for the classifier, a video segment-text feature representation output by a multi-mode Transformer is input, a soft label prediction for the video segment-text sample is obtained, and the prediction probability is recorded as
Figure FDA00034178620500000311
And satisfy
Figure FDA0003417862050000031
Wherein K is the total number of categories;
the soft label prediction of other samples is propagated and combined, so as to obtain a better soft label based on the context relation in the video, and the formula is
Figure FDA0003417862050000032
Wherein the content of the first and second substances,
Figure FDA0003417862050000033
probability vectors propagated to other samples for the ith sample can also be regarded as refined soft labels;
if the ith, jth sample is more similar, then their similarity is
Figure FDA0003417862050000034
Is relatively large, then pj
Is spread to
Figure FDA0003417862050000035
The weight of the time is greater; all samples were then sampledPerforming parallel propagation operation on sample-by-sample pairs, wherein the formula is
Figure FDA0003417862050000036
Wherein W ∈ R1×NIs a matrix that can be learned and,
Figure FDA0003417862050000037
is the calculated soft label.
7. The self-driven twin sampling and reasoning-based video question-answering method according to claim 6, wherein in order to avoid excessive noise propagation and fusion and unexpected prediction results, the soft labels used are an initial probability matrix P and a propagated probability matrix
Figure FDA00034178620500000310
Is given by the formula
Figure FDA0003417862050000038
Wherein ω ∈ [0, 1 ]]Is a weight factor, W1,W2Is a set of learnable parameters and the formula is satisfied
Figure FDA0003417862050000039
CN202111553547.0A 2021-12-17 2021-12-17 Video question-answering method based on self-driven twin sampling and reasoning Pending CN114218439A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111553547.0A CN114218439A (en) 2021-12-17 2021-12-17 Video question-answering method based on self-driven twin sampling and reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111553547.0A CN114218439A (en) 2021-12-17 2021-12-17 Video question-answering method based on self-driven twin sampling and reasoning

Publications (1)

Publication Number Publication Date
CN114218439A true CN114218439A (en) 2022-03-22

Family

ID=80703709

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111553547.0A Pending CN114218439A (en) 2021-12-17 2021-12-17 Video question-answering method based on self-driven twin sampling and reasoning

Country Status (1)

Country Link
CN (1) CN114218439A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115278110A (en) * 2022-07-12 2022-11-01 时空穿越(深圳)科技有限公司 Information processing method, device and system based on digital twin cabin and readable storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807222A (en) * 2021-09-07 2021-12-17 中山大学 Video question-answering method and system for end-to-end training based on sparse sampling

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807222A (en) * 2021-09-07 2021-12-17 中山大学 Video question-answering method and system for end-to-end training based on sparse sampling

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WEIJIANG YU等: "Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115278110A (en) * 2022-07-12 2022-11-01 时空穿越(深圳)科技有限公司 Information processing method, device and system based on digital twin cabin and readable storage medium
CN115278110B (en) * 2022-07-12 2023-08-25 时空穿越(深圳)科技有限公司 Information processing method, device and system based on digital twin cabin and readable storage medium

Similar Documents

Publication Publication Date Title
CN110083705B (en) Multi-hop attention depth model, method, storage medium and terminal for target emotion classification
CN109947912A (en) A kind of model method based on paragraph internal reasoning and combined problem answer matches
CN109800434B (en) Method for generating abstract text title based on eye movement attention
CN110866542B (en) Depth representation learning method based on feature controllable fusion
CN110418210A (en) A kind of video presentation generation method exported based on bidirectional circulating neural network and depth
CN114020862A (en) Retrieval type intelligent question-answering system and method for coal mine safety regulations
CN113297364B (en) Natural language understanding method and device in dialogue-oriented system
CN111897944B (en) Knowledge graph question-answering system based on semantic space sharing
CN113297370B (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN112527993B (en) Cross-media hierarchical deep video question-answer reasoning framework
CN113204633B (en) Semantic matching distillation method and device
CN110852089B (en) Operation and maintenance project management method based on intelligent word segmentation and deep learning
CN112001166A (en) Intelligent question-answer sentence-to-semantic matching method and device for government affair consultation service
CN112000770A (en) Intelligent question and answer oriented sentence-to-sentence matching method based on semantic feature map
CN114387537A (en) Video question-answering method based on description text
CN113988079A (en) Low-data-oriented dynamic enhanced multi-hop text reading recognition processing method
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
CN115238691A (en) Knowledge fusion based embedded multi-intention recognition and slot filling model
CN114218439A (en) Video question-answering method based on self-driven twin sampling and reasoning
CN112651225B (en) Multi-item selection machine reading understanding method based on multi-stage maximum attention
CN116543289B (en) Image description method based on encoder-decoder and Bi-LSTM attention model
CN111858879B (en) Question and answer method and system based on machine reading understanding, storage medium and computer equipment
CN114239575B (en) Statement analysis model construction method, statement analysis method, device, medium and computing equipment
CN113554040B (en) Image description method and device based on condition generation countermeasure network
CN115659242A (en) Multimode emotion classification method based on mode enhanced convolution graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination