CN109840506B

CN109840506B - Method for solving video question-answering task by utilizing video converter combined with relational interaction

Info

Publication number: CN109840506B
Application number: CN201910112159.5A
Authority: CN
Inventors: 璧垫床; 赵洲
Original assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Current assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Priority date: 2019-02-13
Filing date: 2019-02-13
Publication date: 2020-11-20
Anticipated expiration: 2039-02-13
Also published as: CN109840506A

Abstract

The invention discloses a method for solving a video question-answering task by utilizing a video converter combined with relationship interaction, which mainly comprises the following steps: 1) and designing a video converter model combined with relational interaction to complete answer acquisition of the video question-answering task. 2) And training to obtain a final video converter model, and generating answers of the video question-answering task by using the model. Compared with a general video question and answer task solution, the video question and answer task can be better completed by utilizing the relationship interaction information. Compared with the traditional method, the effect of the invention in the video question-answering task is better.

Description

Method for solving video question-answering task by utilizing video converter combined with relational interaction

Technical Field

The invention relates to a video question and answer task, in particular to a method for solving the video question and answer task by utilizing a video converter combined with relationship interaction.

Background

The video question-answering task is a very challenging task and attracts the attention of many people at present. In this task it is required that the system can give a corresponding answer to a question for a particular video. At present, the video question-answering task is still a novel task, and the research on the video question-answering task is not mature. The research on the video question-answering task can be applied to the related fields of computer vision, natural language processing and the like.

The existing video question-answer task solution generally utilizes the traditional image question-answer correlation method. The method comprises the steps of obtaining codes of images by using a convolutional neural network, obtaining codes of questions by using a cyclic neural network, combining the codes of the images and the questions to generate feature codes mixing the images and the question information, and obtaining final image question-answer answers by using the feature codes mixing the images and the question information by a decoder.

Due to the lack of analysis on the time sequence information contained in the video, the method has inaccurate answer generation on the video question-answering task. In order to solve the problems, the video question-answer positioning task is solved by using a video converter combined with relationship interaction, and the accuracy of forming a video question-answer by the video question-answer task is improved.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides a method for solving a video question-answering task by using a video converter with interactive combination relation, in order to solve the problem that the video question-answering task in the prior art cannot provide more accurate video question-answering answers. The invention adopts the specific technical scheme that:

the method for solving the video question-answering task by utilizing the video converter combined with the relationship interaction comprises the following steps:

1. a video object relation obtaining method is designed, and a space-time relation matrix of a video object is obtained by the video object relation obtaining method.

2. And designing a multi-interaction attention mechanism unit, and acquiring multi-interaction attention mechanism output containing comprehensive information contained in an input sequence by utilizing the multi-interaction attention mechanism unit in combination with the space-time relation matrix of the video object acquired in the step 1.

3. And (3) designing a video converter comprising an encoder and a decoder by using the multi-interaction attention mechanism unit designed in the step (2), training, and acquiring answers corresponding to the video question-answering tasks by using the trained video converter.

The above steps can be realized in the following way:

for the video frames of the video question-answering task, the trained video object recognition network is used for acquiring the appearance characteristics of the objects in the video

And object position characteristics

Wherein N represents the number of objects contained in the video, and the outside of the object NWatch features

For high-dimensional vectors obtained using a trained model, the position characteristics of each object

Is a 5-dimensional vector (x)_n,y_n,w_n,h_n,t_n)，

Front four dimensions (x)_n,y_n,w_n,h_n) The object bounding box center point coordinates representing object n,

of the fifth dimension t_nRepresenting the number of frames in which the object n is located.

Location features for object m

Position characteristics of object n

Calculating to obtain a 5-dimensional relative relation vector (X) according to the following formula_mn,Y_mn,W_mn,H_mn,T_mn)，

Thereafter, the obtained 5-dimensional relative relationship vector (X)_mn,Y_mn,W_mn,H_mn,T_mn) Mapping position codes of sine and cosine functions with different frequencies into high-dimensional expressions, and connecting the high-dimensional expressions obtained by mapping to obtain relative relation characteristics

The weight of the space-time relation between the object m and the object n is calculated according to the following formula

Wherein, W_rIs a trainable weight vector.

Obtaining a space-time relation matrix W of video objects by using the obtained space-time relation weights among all the objects in the video_R。

Designing a multi-interaction attention mechanism unit, and aiming at an input matrix Q ═ Q₁,q₂,...,q_lq) And matrix V ═ V₁,v₂,...,v_lv) Calculating a column vector K in the three-dimensional tensor K according to the following formula_ij，

K_ij＝q_iοv_j

Wherein q is_iA column vector, v, representing the ith column of the input matrix Q_jA column vector, representing the jth column of the input matrix V, represents a multiplication operation at the element level. All column vectors K to be obtained_ij(i∈[1,2,...,l_q],j∈[1,2,...,l_v]) In combination, a three-dimensional tensor K is obtained. Dividing K into several sub-sheetsMeasurement of

For the sub-tensor K', a weight sum vector p is calculated according to the following calculation formula,

wherein, w_ijFor trainable weight scalars, b₁Is a trainable offset value. And copying the obtained weights and vectors p for s times to form a new three-dimensional tensor M.

The obtained three-dimensional tensor K and the last bit of the new three-dimensional tensor M are subjected to summation compression to obtain a weight matrix W at an element level_EWeight matrix W with segment level_SUsing the resulting element-level weight matrix W_ESegment-level weight matrix W_SAnd input matrix V ═ V (V)₁,v₂,...,v_lv) Obtaining the output O of the multi-interaction attention machine unit containing the comprehensive information contained in the input sequence according to the following calculation formula,

wherein,

representing element-level multiplication operations, softmax () represents softmax function computation operations.

The video converter designed by the invention consists of an encoder and a decoder, wherein the encoder of the video converter comprises three parts: a question text encoding part, a video object encoding part and a video frame encoding part. The problem text coding part mechanism is as follows: for the question text input by the video question-answering task, the mapping of words contained in the question text is used as an input sequence, the position information characteristic of the question text is obtained by combining the position coding technology in an original converter, the question word mapping and the question word position information characteristic are input into a designed multi-interaction attention mechanism unit, and the output of the multi-interaction attention mechanism unit is input into a forward conveying unit after connection operation and linear mapping operation. And (3) the output of the forward conveying unit passes through two linear mapping units which take ReLU as an activation function, and then the output corresponding to the problem text encoding part is obtained.

The mechanism of the video frame coding part of the coder is as follows: for a video frame sequence input by a video question-answering task, ResNet is utilized to obtain video frame characteristics as an input sequence, the position information characteristics of the video frame are obtained by combining the position coding technology in an original converter, the video frame characteristics and the video frame position information characteristics are input into a designed multi-interaction attention mechanism unit, the output of the multi-interaction attention mechanism unit is input into another multi-interaction attention mechanism unit through connection operation and linear mapping operation by combining the output corresponding to a question text coding part, and the output of the multi-interaction attention mechanism unit is input into a forward conveying unit through the connection operation and the linear mapping operation. And (3) the output of the forward conveying unit passes through two linear mapping units with ReLU as an activation function, and then the corresponding output of the video frame coding part is obtained. And re-inputting the output corresponding to the video frame coding part into the video frame coding part, and performing T-time circulation to obtain the final output corresponding to the video frame coding part.

The mechanism of the video object coding part of the coder is as follows: utilizing object appearance features in acquired video

And object position characteristics

As an input sequence, inputting the appearance characteristics and the position characteristics of an object in a video into a designed multi-interaction attention machine unit, inputting the output of the multi-interaction attention machine unit into another multi-interaction attention machine unit through connection operation and linear mapping operation in combination with the corresponding output of a problem text encoding part, and inputting the output of the multi-interaction attention machine unitThe output is input to the forward conveying unit through a connecting operation and a linear mapping operation. And after the output of the forward conveying unit passes through two linear mapping units with the ReLU as an activation function, the output corresponding to the video object coding part is obtained. And re-inputting the output corresponding to the video object coding part into the video object coding part, and performing T-time circulation to obtain the final output corresponding to the video object coding part.

And connecting the output corresponding to the video frame coding part with the output corresponding to the video object coding part, and inputting the output to a linear mapping unit to obtain the encoder output of the video converter.

The video converter has three decoders, which respectively aim at a multinomial selection type video question-answering task, an open digital type video question-answering task and an open text type video question-answering task:

for the multi-choice video question-answering task, an evaluation score s for each candidate answer is calculated by the following formula,

wherein,

transpose representing trainable weight matrices, F_voRepresenting the obtained encoder output of the video converter.

For the open digital video question-answering task, the digital answer n of the open digital video question-answering task is calculated by the following formula,

wherein,

representing transposes of trainable weight matrices, b₂Representing trainable offsets, F_voRepresenting the obtained encoder output of the video converter, Round () represents the Round function calculation operation.

For the open text type video question-answering task, the answer word probability distribution o of the open text type video question-answering task is calculated by the following formula,

wherein,

representing transposes of trainable weight matrices, b₃Representing trainable offsets, F_voRepresenting the encoder output of the video converter obtained, softmax () represents the softmax function calculation operation. And taking the word corresponding to the maximum probability value in the obtained answer word probability distribution o as the answer of the open text type video question-answering task.

After training, the trained video converter is used for the new video question-answering task, and answers corresponding to the video question-answering task can be obtained.

Drawings

Fig. 1 is an overall schematic diagram of a video converter for solving associative relational interaction of a video question-answering task according to an embodiment of the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

As shown in fig. 1, the method for solving the video question-answering task by using the video converter combined with the relationship interaction of the present invention comprises the following steps:

1) designing a video object relation acquisition method, and acquiring a time-space relation matrix of a video object by using the video object relation acquisition method;

2) designing a multi-interaction attention machine unit, and acquiring multi-interaction attention machine output containing comprehensive information contained in an input sequence by utilizing the multi-interaction attention machine unit in combination with the space-time relation matrix of the video object acquired in the step 1);

3) designing a video converter comprising an encoder and a decoder by using the multi-interaction attention mechanism unit designed in the step 2), training, and acquiring answers corresponding to the video question-answering tasks by using the trained video converter.

The step 1) comprises the following specific steps:

And object position characteristics

Wherein N represents the number of objects contained in the video and the appearance characteristics of the object N

Is a 5-dimensional vector (x)_n,y_n,w_n,h_n,t_n)，

Location features for object m

Position characteristics of object n

Wherein, W_rIs a trainable weight vector.

By use ofObtaining the spatio-temporal relation weight among all the objects in the video to obtain the spatio-temporal relation matrix W of the video objects_R。

The step 2) comprises the following specific steps:

designing a multi-interaction attention mechanism unit for an input matrix

And matrix

Calculating a column vector K in the three-dimensional tensor K according to the following formula_ij，

K_ij＝q_iοv_j

Wherein q is_iA column vector, v, representing the ith column of the input matrix Q_jA column vector, representing the jth column of the input matrix V, represents a multiplication operation at the element level. All column vectors K to be obtained_ij(i∈[1,2,...,l_q],j∈[1,2,...,l_v]) In combination, a three-dimensional tensor K is obtained. Divide K into several sub-tensors

The obtained three-dimensional tensor K and the last bit of the new three-dimensional tensor M are subjected to summation compression to obtain a weight matrix W at an element level_EWeight matrix W with segment level_SUsing the resulting element-level weight matrix W_ESegment-level weight matrix W_SAnd input matrix

The output O of the multi-interaction attention mechanism unit containing the comprehensive information contained in the input sequence is obtained according to the following calculation formula,

wherein,

The step 3) comprises the following specific steps:

the video converter in the step 3) consists of an encoder and a decoder, wherein the encoder of the video converter comprises three parts: a question text encoding part, a video object encoding part and a video frame encoding part. The problem text coding part mechanism is as follows: for the question text input by the video question-answering task, the mapping of words contained in the question text is used as an input sequence, the position information characteristic of the question text is obtained by combining the position coding technology in an original converter, the question word mapping and the question word position information characteristic are input into a designed multi-interaction attention mechanism unit, and the output of the multi-interaction attention mechanism unit is input into a forward conveying unit after connection operation and linear mapping operation. And (3) the output of the forward conveying unit passes through two linear mapping units which take ReLU as an activation function, and then the output corresponding to the problem text encoding part is obtained.

And object position characteristics

And as an input sequence, inputting the appearance characteristics and the position characteristics of the object in the video into a designed multi-interaction attention mechanism unit, inputting the output of the multi-interaction attention mechanism unit into another multi-interaction attention mechanism unit by combining the output corresponding to the problem text coding part through connection operation and linear mapping operation, and inputting the output of the multi-interaction attention mechanism unit into a forward conveying unit through connection operation and linear mapping operation. And after the output of the forward conveying unit passes through two linear mapping units with the ReLU as an activation function, the output corresponding to the video object coding part is obtained. And re-inputting the output corresponding to the video object coding part into the video object coding part, and performing T-time circulation to obtain the final output corresponding to the video object coding part.

wherein,

wherein,

wherein,

representing transposes of trainable weight matrices, b₃Representing trainable offsets, F_voRepresenting the encoder output of the video converter obtained, softmax () represents the softmax function calculation operation. To be obtainedAnd taking the word corresponding to the maximum probability value in the answer word probability distribution o as an answer of the open text type video question-answering task.

The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.

Examples

The invention performs experiments on the TGIF-QA experimental data set. The TGIF-QA experimental data set contains four video question-and-answer tasks: the method comprises the following steps of finding an Action task (Action) with given repetition times in the video, solving an Action state change task (Trans) in the video, solving a Frame task (Frame) most relevant to a video question and answer task question in the video, and solving a repetition time task (Count) of the given Action in the video. In order to objectively evaluate the performance of the algorithm of the invention, in the selected test set, Accuracy (ACC) evaluation criteria are used for an Action task (Action) for finding a given repetition number in a video, a task (Trans) for evaluating the Action state change in the video and a task (Frame) for evaluating the Frame most relevant to a video question and answer task problem in the video to evaluate the effect of the invention, and an average square error evaluation criterion (MSE) is used for a task (Count) for evaluating the repetition number of the given Action in the video to evaluate the effect of the invention. The experimental results obtained following the procedure described in the specific embodiment are shown in table 1, and the method is expressed as videotransform (multi):

table 1 test results of the present invention for the TGIF-QA dataset.

Claims

1. A method for solving a video question-answering task by utilizing a video converter combined with relationship interaction is used for solving the video question-answering task and is characterized by comprising the following steps of:

the multi-interaction attention mechanism unit calculates a column vector of a three-dimensional tensor for each column vector in two input matrixes, combines the column vectors to obtain the three-dimensional tensor, divides the three-dimensional tensor into a plurality of sub-tensors, calculates a weight and a vector for each sub-tensor, sums and compresses the obtained three-dimensional tensor and the last bit of a new three-dimensional tensor to obtain a weight matrix at an element level and a weight matrix at a segment level, and calculates by using the weight matrix at the element level, the weight matrix at the segment level and an input matrix to obtain the output of the multi-interaction attention mechanism unit containing comprehensive information contained in an input sequence;

2. The method for solving the video question-answering task by using the video converter combined with the relational interaction as claimed in claim 1, wherein the step 1) is specifically as follows:

And object position characteristics

Is a 5-dimensional vector (x)_n,y_n,w_n,h_n,t_n)，

of the fifth dimension t_nRepresenting the number of frame sequence where the object n is located;

location features for object m

Position characteristics of object n

Wherein, W_rIs a trainable weight vector;

3. The method for solving the video question-answering task by using the video converter combined with the relational interaction as claimed in claim 2, wherein the step 2) is specifically as follows:

designing a multi-interaction attention mechanism unit for an input matrix

And matrix

Wherein q is_iA column vector, v, representing the ith column of the input matrix Q_jA column vector representing the jth column of the input matrix V,

a multiplication operation at the representative element level; all column vectors K to be obtained_ij(i∈[1,2,...,l_q],j∈[1,2,...,l_v]) Combining to obtain a three-dimensional tensor K; divide K into several sub-tensors

wherein, w_ijFor trainable weight scalars, b₁Is a trainable bias value; copying the obtained weight and vector p for s times to form a new three-dimensional tensor M;

wherein,

4. The method for solving the video question-answering task by using the video converter combined with the relational interaction as claimed in claim 3, wherein the step 3) is specifically as follows:

the video converter in the step 3) consists of an encoder and a decoder, wherein the encoder of the video converter comprises three parts: a question text encoding part, a video object encoding part and a video frame encoding part; the problem text coding part mechanism is as follows: for a question text input by a video question-answering task, mapping words contained in the question text is used as an input sequence, a position coding technology in an original converter is combined to obtain a question text position information characteristic, the question word mapping and the question word position information characteristic are input into a designed multi-interaction attention mechanism unit, and the output of the multi-interaction attention mechanism unit is subjected to connecting operation and linear mapping operation and then is input into a forward conveying unit; after the output of the forward conveying unit passes through two linear mapping units which take ReLU as an activation function, the output corresponding to the problem text coding part is obtained;

the mechanism of the video frame coding part of the coder is as follows: for a video frame sequence input by a video question-answering task, acquiring video frame characteristics by using ResNet as an input sequence, acquiring video frame position information characteristics by combining a position coding technology in an original converter, inputting the video frame characteristics and the video frame position information characteristics into a designed multi-interaction attention mechanism unit, inputting the output of the multi-interaction attention mechanism unit into another multi-interaction attention mechanism unit through connection operation and linear mapping operation by combining the output corresponding to a question text coding part, and inputting the output of the multi-interaction attention mechanism unit into a forward conveying unit through connection operation and linear mapping operation; after the output of the forward conveying unit passes through two linear mapping units which take ReLU as an activation function, the output corresponding to the video frame coding part is obtained; re-inputting the output corresponding to the video frame coding part into the video frame coding part, and performing T-time circulation to obtain the final output corresponding to the video frame coding part;

And object position characteristics

As an input sequence, inputting the appearance characteristics and the position characteristics of an object in a video into a designed multi-interaction attention mechanism unit, inputting the output of the multi-interaction attention mechanism unit into another multi-interaction attention mechanism unit through connection operation and linear mapping operation in combination with the corresponding output of a problem text encoding part, and inputting the output of the multi-interaction attention mechanism unit into a forward conveying unit through connection operation and linear mapping operation; after the output of the forward conveying unit passes through two linear mapping units which take ReLU as an activation function, the output corresponding to the video object coding part is obtained; re-inputting the output corresponding to the video object coding part into the video object coding part, and performing T-time circulation to obtain the final output corresponding to the video object coding part;

connecting the output corresponding to the video frame coding part with the output corresponding to the video object coding part, and inputting the output to a linear mapping unit to obtain the encoder output of the video converter;

wherein,

transpose representing trainable weight matrices, F_voRepresenting the obtained encoder output of the video converter;

wherein,

representing transposes of trainable weight matrices, b₂Representing trainable offsets, F_voRepresents the obtained encoder output of the video converter, Round () represents the Round function calculation operation;

wherein,

representing transposes of trainable weight matrices, b₃Representing trainable offsets, F_voRepresenting the obtained encoder output of the video converter, softmax () represents a softmax function computation operation; taking the word corresponding to the maximum probability value in the obtained answer word probability distribution o as an answer of the open text type video question-answering task;