CN114218439A

CN114218439A - Video question-answering method based on self-driven twin sampling and reasoning

Info

Publication number: CN114218439A
Application number: CN202111553547.0A
Authority: CN
Inventors: 余伟江; 卢宇彤; 李孟非; 陈志广
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-03-22

Abstract

The invention discloses a video question-answering method based on self-driven twin sampling and reasoning, which comprises video segment sampling, feature extraction and reasoning strategies, wherein the video segment sampling obtains a reference video segment through sparse sampling and obtains a twin video segment through twin sampling; the feature extraction encodes a plurality of video segment-text pairs into corresponding semantic feature representations through a video encoder, a text encoder and a multi-modality; the inference strategy generates a refined knowledge label for the video segment by using the twin knowledge generation module, and propagates the label to all twin samples by using the twin knowledge inference module and fuses the labels. The invention has the advantages that the self-driven twin sampling and reasoning-based framework is provided and used for extracting context semantic information in different video segments of the same video and enhancing the learning effect of the network.

Description

Video question-answering method based on self-driven twin sampling and reasoning

Technical Field

The invention relates to the technical field of computer vision and pattern recognition in computers, in particular to a video question-answering method based on self-driven twin sampling and reasoning.

Background

The video question-answering is a vision-language reasoning task and has great application prospect, thereby attracting the attention of more and more researchers. The video question-answering task needs to acquire and use time domain and space domain characteristics of visual signals in the video according to the combined semantics of the language clues so as to generate answers.

Some existing work extracts general visual information from video as well as motion features to represent video content and designs different attention mechanisms to integrate these features. These methods focus on how to better understand the overall content of the video, but this easily ignores details in the video segment. There are also some researchers exploring how to align features at the semantic level by applying visual and linguistic information to the video. But these efforts ignore the association between contexts in the same video.

Recently, fine-tuning (fine-tuning) the paradigm of pre-training models on target data sets has achieved great benefit in multi-modal tasks, such as text video retrieval, visual question answering, picture description, video description, and video question answering. Based on the results obtained by the language pre-training model based on the Transformer, the pre-training model aiming at the video-text of the video-language task has larger progress, especially for the pre-training model of the video question-answering task. This approach is based on a skeletal network (backbone) pre-trained on a large-scale video-text dataset, but none of the networks so pre-trained further mine contextual information between video segments of the same video.

In summary, these existing multi-modal learning paradigms all have a significant drawback: the correlation between video segment-text pairs (clip-text pairs) in the same video is ignored, and each video segment-text pair is considered to be independent from each other during training. Thus, the prior art does not make good use of rich context information between contexts in the same video. The inventor believes that the associated context information within such video can be used to enhance the learning effect of the network.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a video question-answering method based on self-driven twin sampling and reasoning, and solves the problem that the prior method cannot well utilize the context information in the same video.

In order to achieve the purpose, the invention adopts the following technical scheme:

a video question-answering method based on self-driven twin sampling and reasoning comprises video segment sampling, feature extraction and reasoning strategies, wherein the video segment sampling obtains a reference video segment through sparse sampling and obtains a twin video segment through twin sampling; the feature extraction encodes a plurality of video segment-text pairs into corresponding semantic feature representations through a video encoder, a text encoder and a multi-modality; the inference strategy generates a refined knowledge label for the video segment by using the twin knowledge generation module, and propagates the label to all twin samples by using the twin knowledge inference module and fuses the labels.

It should be noted that in the video segment sampling, a reference segment and a twin segment are obtained by respectively using sparse sampling and twin sampling, and the features of the video segments are respectively extracted by using a feature extraction model; in the feature extraction, a twin knowledge generation module is used for calculating the context feature inside the video according to the features of the reference segment and the twin segment; and in the inference strategy, a twin knowledge inference module is used for adaptively generating refined soft labels for the video segments according to the context characteristics.

It should be noted that the twin sampling includes obtaining a video segment sample with a length of B frames by performing sparse sampling on a video sample F; wherein the length of F is greater than B; determining the index of the starting frame number of a reference video segment in a random mode, then carrying out twin sampling, selecting a plurality of starting indexes adjacent to the reference video segment in the same video sample, and selecting the video segments of B frames with the same length; and after twin sampling is finished, obtaining a reference video segment and a twin video segment which have similar global video semantics, and then sequentially coding the video segments by using a video coder to obtain visual characteristics.

It should be noted that the start index of the reference segment is limited to the first third frame of the whole video; the index of the adjacent twin video segment should be adjacent to the index of the reference segment, and the twin segment is obtained by performing continuous sampling before or after the reference segment.

It should be noted that the knowledge generation module includes:

given a set of N video segment-text pairs in a video and a feature extractor F to be trained, the similarity matrix A ∈ R of the samples is first calculated pair by pair according to the dot product of the coded representations of the samples^N×N，

A(i，j)＝σ(F(f_i))^Tσ(F(f_j))，

Wherein i and j are indexes of samples, σ represents an L2 norm function, and N is the number of video segments; an element on a diagonal line in the association matrix a is set to 0 by formula a ═ a (1-I), so that inference is prevented from entering an individual dead cycle, where I is an identity matrix, indicating a hadamard product;

normalizing each row of matrix A such that it is satisfied for all i

While keeping the diagonal element at 0; the normalization function can be represented using the softmax function for each row element of the matrix, i.e. the function

Wherein the content of the first and second substances,

expressing the associated knowledge of video bands i, j in the same video after normalization;

i.e. the overall associated knowledge of all video segments, i.e. the twin knowledge.

It should be noted that the twin knowledge inference module includes:

in the training process, a classifier is trained simultaneously, for the classifier, a video segment-text feature representation output by a multi-mode Transformer is input, a soft label prediction for the video segment-text sample is obtained, and the prediction probability is recorded as

P＝[p₁，...，p_N]^T∈R^N×NAnd satisfy

Wherein K is the total number of categories;

the soft label prediction of other samples is propagated and combined, so as to obtain a better soft label based on the context relation in the video, and the formula is

Wherein the content of the first and second substances,

probability vectors propagated to other samples for the ith sample can also be regarded as refined soft labels;

if the ith, jth sample is more similar, then their similarity is

Is relatively large, then pj

Is spread to

The weight of the time is greater; then all samples are subjected to parallel propagation operation of sample-by-sample pairs, and the formula is

Wherein W ∈ R^1×NIs a matrix that can be learned and,

is the calculated soft label.

It should be noted that, in order to avoid excessive noise propagation and fusion and unexpected prediction result, the soft labels used are the initial probability matrix P and the propagated probability matrix P

Is given by the formula

Wherein ω ∈ [0, 1 ]]Is a weight factor, W₁，W₂Is a set of learnable parameters and the formula is satisfied

The invention has the advantages that the self-driven twin sampling and reasoning-based framework is provided and used for extracting context semantic information in different video segments of the same video and enhancing the learning effect of the network.

Drawings

FIG. 1 is a schematic diagram of the overall network architecture framework of the present invention;

FIG. 2 is a schematic diagram comparing dense sampling, sparse sampling and twinned sampling;

fig. 3 is a comparison diagram of a multi-modal learning process with different sampling modes.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and it should be noted that the following examples are provided to illustrate the detailed embodiments and specific operations based on the technical solutions of the present invention, but the scope of the present invention is not limited to the examples.

The invention relates to a video question-answering method based on self-driven twin sampling and reasoning, which comprises video segment sampling, feature extraction and reasoning strategies, wherein the video segment sampling obtains a reference video segment through sparse sampling and obtains a twin video segment through twin sampling; the feature extraction encodes a plurality of video segment-text pairs into corresponding semantic feature representations through a video encoder, a text encoder and a multi-modality; the inference strategy generates a refined knowledge label for the video segment by using the twin knowledge generation module, and propagates the label to all twin samples by using the twin knowledge inference module and fuses the labels.

It should be noted that the knowledge generation module includes:

A(i，j)＝σ(F(f_i))^Tσ(F(f_j))，

normalizing each row of matrix A such that it is satisfied for all i

Wherein the content of the first and second substances,

It should be noted that the twin knowledge inference module includes:

P＝[p₁，...，p_N]^T∈R^N×NAnd satisfy

Wherein K is the total number of categories;

Wherein the content of the first and second substances,

if the ith, jth sample is more similar, then their similarity is

Is relatively large, then p_j

Is spread to

The weight of the time is greater; all samples are then subjected to sample-pair-by-sample-pair parallelismA propagation operation of the formula

Wherein W ∈ R^1×NIs a matrix that can be learned and,

is the calculated soft label.

Is given by the formula

Examples

Figure 1 shows the overall framework of the invention. For a video F, the invention constructs a reference video segment c by sampling_anchorAnd corresponding twin video segments

Where N represents the number of all video segments. Each video segment c is B frames in length, and B feature maps are obtained by the video encoder. The characteristics of the reference segment and the twin segment after being coded are respectively marked as v_anchorAnd

will be given a scriptureThe encoder-encoded text representation and the video segment representation of the video encoder are spliced together as input, and a multi-modal Transformer is used to generate video segment-text features. The reference video segment-text feature is noted as f_anchorThe corresponding twin video segment-text feature is noted

Similarly, the soft label predictions derived using these feature inferences are denoted as p_{anchor}And

where p is a vector of dimension K, depending on the number of classification classes.

Twin sampling. Fig. 2 illustrates the difference between twin sampling and conventional dense and sparse sampling. Firstly, the invention obtains the video segment sample with the length of B frame by carrying out sparse sampling on the video sample F. The length of F is much greater than B. The invention determines the index of the starting frame number of the reference video segment in a random mode, then carries out twin sampling, selects a plurality of starting indexes adjacent to the reference video segment in the same video sample, and selects the video segments of B frames with the same length.

The present invention imposes certain restrictions on the selection of the starting index of the reference segment. The start index of the reference segment is limited to the first third of the frame of the entire video to ensure the balance of the sampling. The "adjacent" in the above context means that the index of the twin video segment should be adjacent to the index of the reference segment. The invention carries out continuous sampling before or after the reference section so as to obtain the twin section.

After twin sampling is completed, the method can obtain the reference video segment and the twin video segment which have similar global video semantics. Then, the present invention uses a video encoder to encode the video bands in sequence according to the method shown in fig. 3, so as to obtain the visual characteristics. The video encoder and text encoder are very similar to the encoder in ClipBert.

And (4) reasoning strategies. Unlike previous researchers that directly use the video segment-text features obtained from the multimodal Transformer to predict the final result, the invention explores how to fuse the information of different video segments, and proposes an innovative inference strategy, named as self-driven multimodal learning, which includes a twin knowledge generation module and a twin knowledge inference module.

The twin knowledge generation module and how twin knowledge is generated will be described first.

Video segments that are visually highly similar should have a more similar class probability prediction when classifying each video segment. In the proposed solution, similar video segment information is systematically aggregated and fused together, providing more accurate soft tag prediction. The method propagates and fuses twin knowledge in the same group of reference segment and twin segment samples according to the similarity of the features extracted by the multi-modal Transformer. Given a set of N video segment-text pairs in a video and a feature extractor F to be trained, the invention first computes the similarity matrix A e R of the samples pair by pair according to the dot product of the coded representation of the samples^N×NWherein

A(i，j)＝σ(F(f_i))^Tσ(F(f_j))，

i, j is the index of the sample, σ represents the L2 norm function, and N is the number of video segments. To avoid the inference going into an individual dead cycle, the present invention uses the formula A ═ A (1-I) to set the elements on the diagonal line in the association matrix A to 0, where I is an identity matrix, indicating a Hadamard product.

The invention then normalizes each row of matrix A such that it is satisfied for all i

While keeping the diagonal element at 0. The normalization function can be represented using the softmax function for each row element of the matrix, i.e. the function

Wherein

And the associated knowledge of video bands i, j in the same video after normalization is shown.

I.e., the overall associated knowledge of all video segments, is also referred to as twin knowledge in the present invention.

The twin knowledge inference module and how twin knowledge is used is described below.

In the training process, the invention trains a classifier simultaneously, and for the classifier, the video segment-text feature representation output by the multi-mode Transformer is input to obtain a soft label prediction for the video segment-text sample, and the probability of the prediction is recorded as P ═ P₁，...，p_N]^T∈R^N×NSatisfy the following requirements

Where K is the total number of classes.

The invention can propagate and combine soft label prediction of other samples, thereby obtaining a better soft label based on the context relation in the video, and the formula is

Wherein the content of the first and second substances,

probability vectors propagated to other samples for the ith sample may also be considered refined soft labels.

If the ith, jth sample is more similar, then their similarity is

Is relatively large, then p_j

Is spread to

The weight of the time is greater. Similarly, the present invention performs a parallel sample-pair-by-sample-pair propagation operation on all samples, which is formulated as

Wherein W ∈ R^1×NIs a matrix that can be learned and,

is the calculated soft label.

In order to avoid the propagation and the fusion of excessive noise and unexpected prediction results, the soft label finally used by the invention is an initial probability matrix P and a propagated probability matrix

Is given by the formula

With the above formula, the present invention can propagate and fuse the knowledge of the correlation between samples within the contract video through one iteration. This knowledge will be compared to the previously generated sample class probability p_anchorThe comparison is made to guide the training process of the network so that video segments that are visually highly similar should have more similar class probability predictions.

An objective function. The objective function L of the framework proposed by the present invention has two terms. First item L_siameseIs predicted by the calculated soft label Q and the sample_anchorWith the loss in between. Hair brushComputing L using a cross entropy loss function_siamese. The other term is the loss between the predicted result and the true value of the model during the training process. The specific content and formula of this term depends on the specific task, which is herein collectively denoted as L by the present invention_gt。

Finally, the objective function of the model of the invention is

L＝αL_siamese+L_gt，

Where α is a hyperparameter that adjusts the ratio of the two terms.

Details of the frame. The invention uses the model obtained by pre-training ClipBert on COCO Captions and Visual Genome Captions data sets as the initialization weight of the SimSamrea framework of the invention. Based on this initialization weight, the present invention continues video segment-text pre-training on the TGIF-QA dataset and performs optimization using the aforementioned objective function. Since the present invention aims to model long-term context information using twin sampling and reasoning, the present invention uses only samples in the data set where the total number of video frames is greater than 40 frames. The invention determines the length of the video segment selected in the video according to the length of the video.

During pre-training and formal training, a reference video segment is selected on each GPU through sparse sampling, and an AdamW optimizer is used for carrying out overall end-to-end training. The invention gradually improves the learning rate in the early stage of training and then linearly decreases the learning rate. The training is stopped when the learning rate decays to a set minimum value.

Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims

1. A video question-answering method based on self-driven twin sampling and reasoning is characterized by comprising a video segment sampling strategy, a feature extraction strategy and a reasoning strategy, wherein the video segment sampling strategy is used for obtaining a reference video segment through sparse sampling and obtaining a twin video segment through twin sampling; the feature extraction encodes a plurality of video segment-text pairs into corresponding semantic feature representations through a video encoder, a text encoder and a multi-modality; the inference strategy generates a refined knowledge label for the video segment by using the twin knowledge generation module, and propagates the label to all twin samples by using the twin knowledge inference module and fuses the labels.

2. The video question-answering method based on self-driven twin sampling and reasoning according to claim 1, wherein in the video segment sampling, a reference segment and a twin segment are obtained by respectively using sparse sampling and twin sampling, and the features of the video segments are respectively extracted by using a feature extraction model; in the feature extraction, a twin knowledge generation module is used for calculating the context feature inside the video according to the features of the reference segment and the twin segment; and in the inference strategy, a twin knowledge inference module is used for adaptively generating refined soft labels for the video segments according to the context characteristics.

3. The video question-answering method based on self-driven twin sampling and reasoning according to claim 1 or 2, wherein the twin sampling comprises obtaining video segment samples with the length of B frames by sparsely sampling video samples F; wherein the length of F is greater than B; determining the index of the starting frame number of a reference video segment in a random mode, then carrying out twin sampling, selecting a plurality of starting indexes adjacent to the reference video segment in the same video sample, and selecting the video segments of B frames with the same length; and after twin sampling is finished, obtaining a reference video segment and a twin video segment which have similar global video semantics, and then sequentially coding the video segments by using a video coder to obtain visual characteristics.

4. The self-driven twin sampling and reasoning-based video question answering method according to claim 3, wherein a start index of the reference segment is defined within the first third frame of the whole video; the index of the adjacent twin video segment should be adjacent to the index of the reference segment, and the twin segment is obtained by performing continuous sampling before or after the reference segment.

5. The self-driven twin sampling and reasoning based video question answering method according to claim 1, wherein the knowledge generating module comprises:

normalizing each row of matrix A such that it is satisfied for all i

Wherein the content of the first and second substances,

6. The self-driven twin sampling and reasoning based video question answering method according to claim 1, wherein the twin knowledge reasoning module comprises:

And satisfy

Wherein K is the total number of categories;

Wherein the content of the first and second substances,

if the ith, jth sample is more similar, then their similarity is

Is relatively large, then p_j

Is spread to

The weight of the time is greater; all samples were then sampledPerforming parallel propagation operation on sample-by-sample pairs, wherein the formula is

Wherein W ∈ R^1×NIs a matrix that can be learned and,

is the calculated soft label.

7. The self-driven twin sampling and reasoning-based video question-answering method according to claim 6, wherein in order to avoid excessive noise propagation and fusion and unexpected prediction results, the soft labels used are an initial probability matrix P and a propagated probability matrix

Is given by the formula