CN109840506B - Method for solving video question-answering task by utilizing video converter combined with relational interaction - Google Patents

Method for solving video question-answering task by utilizing video converter combined with relational interaction Download PDF

Info

Publication number
CN109840506B
CN109840506B CN201910112159.5A CN201910112159A CN109840506B CN 109840506 B CN109840506 B CN 109840506B CN 201910112159 A CN201910112159 A CN 201910112159A CN 109840506 B CN109840506 B CN 109840506B
Authority
CN
China
Prior art keywords
video
question
output
answering task
interaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910112159.5A
Other languages
Chinese (zh)
Other versions
CN109840506A (en
Inventor
璧垫床
赵洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yizhi Intelligent Technology Co ltd
Original Assignee
Hangzhou Yizhi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yizhi Intelligent Technology Co ltd filed Critical Hangzhou Yizhi Intelligent Technology Co ltd
Priority to CN201910112159.5A priority Critical patent/CN109840506B/en
Publication of CN109840506A publication Critical patent/CN109840506A/en
Application granted granted Critical
Publication of CN109840506B publication Critical patent/CN109840506B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a method for solving a video question-answering task by utilizing a video converter combined with relationship interaction, which mainly comprises the following steps: 1) and designing a video converter model combined with relational interaction to complete answer acquisition of the video question-answering task. 2) And training to obtain a final video converter model, and generating answers of the video question-answering task by using the model. Compared with a general video question and answer task solution, the video question and answer task can be better completed by utilizing the relationship interaction information. Compared with the traditional method, the effect of the invention in the video question-answering task is better.

Description

Method for solving video question-answering task by utilizing video converter combined with relational interaction
Technical Field
The invention relates to a video question and answer task, in particular to a method for solving the video question and answer task by utilizing a video converter combined with relationship interaction.
Background
The video question-answering task is a very challenging task and attracts the attention of many people at present. In this task it is required that the system can give a corresponding answer to a question for a particular video. At present, the video question-answering task is still a novel task, and the research on the video question-answering task is not mature. The research on the video question-answering task can be applied to the related fields of computer vision, natural language processing and the like.
The existing video question-answer task solution generally utilizes the traditional image question-answer correlation method. The method comprises the steps of obtaining codes of images by using a convolutional neural network, obtaining codes of questions by using a cyclic neural network, combining the codes of the images and the questions to generate feature codes mixing the images and the question information, and obtaining final image question-answer answers by using the feature codes mixing the images and the question information by a decoder.
Due to the lack of analysis on the time sequence information contained in the video, the method has inaccurate answer generation on the video question-answering task. In order to solve the problems, the video question-answer positioning task is solved by using a video converter combined with relationship interaction, and the accuracy of forming a video question-answer by the video question-answer task is improved.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provides a method for solving a video question-answering task by using a video converter with interactive combination relation, in order to solve the problem that the video question-answering task in the prior art cannot provide more accurate video question-answering answers. The invention adopts the specific technical scheme that:
the method for solving the video question-answering task by utilizing the video converter combined with the relationship interaction comprises the following steps:
1. a video object relation obtaining method is designed, and a space-time relation matrix of a video object is obtained by the video object relation obtaining method.
2. And designing a multi-interaction attention mechanism unit, and acquiring multi-interaction attention mechanism output containing comprehensive information contained in an input sequence by utilizing the multi-interaction attention mechanism unit in combination with the space-time relation matrix of the video object acquired in the step 1.
3. And (3) designing a video converter comprising an encoder and a decoder by using the multi-interaction attention mechanism unit designed in the step (2), training, and acquiring answers corresponding to the video question-answering tasks by using the trained video converter.
The above steps can be realized in the following way:
for the video frames of the video question-answering task, the trained video object recognition network is used for acquiring the appearance characteristics of the objects in the video
Figure BDA0001968590950000021
And object position characteristics
Figure BDA0001968590950000022
Wherein N represents the number of objects contained in the video, and the outside of the object NWatch features
Figure BDA0001968590950000031
For high-dimensional vectors obtained using a trained model, the position characteristics of each object
Figure BDA0001968590950000032
Is a 5-dimensional vector (x)n,yn,wn,hn,tn),
Figure BDA0001968590950000033
Front four dimensions (x)n,yn,wn,hn) The object bounding box center point coordinates representing object n,
Figure BDA0001968590950000034
of the fifth dimension tnRepresenting the number of frames in which the object n is located.
Location features for object m
Figure BDA0001968590950000035
Position characteristics of object n
Figure BDA0001968590950000036
Calculating to obtain a 5-dimensional relative relation vector (X) according to the following formulamn,Ymn,Wmn,Hmn,Tmn),
Figure BDA0001968590950000037
Figure BDA0001968590950000038
Figure BDA0001968590950000039
Figure BDA00019685909500000310
Figure BDA00019685909500000311
Thereafter, the obtained 5-dimensional relative relationship vector (X)mn,Ymn,Wmn,Hmn,Tmn) Mapping position codes of sine and cosine functions with different frequencies into high-dimensional expressions, and connecting the high-dimensional expressions obtained by mapping to obtain relative relation characteristics
Figure BDA00019685909500000312
The weight of the space-time relation between the object m and the object n is calculated according to the following formula
Figure BDA00019685909500000313
Figure BDA00019685909500000314
Wherein, WrIs a trainable weight vector.
Obtaining a space-time relation matrix W of video objects by using the obtained space-time relation weights among all the objects in the videoR
Designing a multi-interaction attention mechanism unit, and aiming at an input matrix Q ═ Q1,q2,...,qlq) And matrix V ═ V1,v2,...,vlv) Calculating a column vector K in the three-dimensional tensor K according to the following formulaij
Kij=qiοvj
Wherein q isiA column vector, v, representing the ith column of the input matrix QjA column vector, representing the jth column of the input matrix V, represents a multiplication operation at the element level. All column vectors K to be obtainedij(i∈[1,2,...,lq],j∈[1,2,...,lv]) In combination, a three-dimensional tensor K is obtained. Dividing K into several sub-sheetsMeasurement of
Figure BDA0001968590950000041
For the sub-tensor K', a weight sum vector p is calculated according to the following calculation formula,
Figure BDA0001968590950000042
wherein, wijFor trainable weight scalars, b1Is a trainable offset value. And copying the obtained weights and vectors p for s times to form a new three-dimensional tensor M.
The obtained three-dimensional tensor K and the last bit of the new three-dimensional tensor M are subjected to summation compression to obtain a weight matrix W at an element levelEWeight matrix W with segment levelSUsing the resulting element-level weight matrix WESegment-level weight matrix WSAnd input matrix V ═ V (V)1,v2,...,vlv) Obtaining the output O of the multi-interaction attention machine unit containing the comprehensive information contained in the input sequence according to the following calculation formula,
Figure BDA0001968590950000043
wherein,
Figure BDA0001968590950000044
representing element-level multiplication operations, softmax () represents softmax function computation operations.
The video converter designed by the invention consists of an encoder and a decoder, wherein the encoder of the video converter comprises three parts: a question text encoding part, a video object encoding part and a video frame encoding part. The problem text coding part mechanism is as follows: for the question text input by the video question-answering task, the mapping of words contained in the question text is used as an input sequence, the position information characteristic of the question text is obtained by combining the position coding technology in an original converter, the question word mapping and the question word position information characteristic are input into a designed multi-interaction attention mechanism unit, and the output of the multi-interaction attention mechanism unit is input into a forward conveying unit after connection operation and linear mapping operation. And (3) the output of the forward conveying unit passes through two linear mapping units which take ReLU as an activation function, and then the output corresponding to the problem text encoding part is obtained.
The mechanism of the video frame coding part of the coder is as follows: for a video frame sequence input by a video question-answering task, ResNet is utilized to obtain video frame characteristics as an input sequence, the position information characteristics of the video frame are obtained by combining the position coding technology in an original converter, the video frame characteristics and the video frame position information characteristics are input into a designed multi-interaction attention mechanism unit, the output of the multi-interaction attention mechanism unit is input into another multi-interaction attention mechanism unit through connection operation and linear mapping operation by combining the output corresponding to a question text coding part, and the output of the multi-interaction attention mechanism unit is input into a forward conveying unit through the connection operation and the linear mapping operation. And (3) the output of the forward conveying unit passes through two linear mapping units with ReLU as an activation function, and then the corresponding output of the video frame coding part is obtained. And re-inputting the output corresponding to the video frame coding part into the video frame coding part, and performing T-time circulation to obtain the final output corresponding to the video frame coding part.
The mechanism of the video object coding part of the coder is as follows: utilizing object appearance features in acquired video
Figure BDA0001968590950000061
And object position characteristics
Figure BDA0001968590950000062
As an input sequence, inputting the appearance characteristics and the position characteristics of an object in a video into a designed multi-interaction attention machine unit, inputting the output of the multi-interaction attention machine unit into another multi-interaction attention machine unit through connection operation and linear mapping operation in combination with the corresponding output of a problem text encoding part, and inputting the output of the multi-interaction attention machine unitThe output is input to the forward conveying unit through a connecting operation and a linear mapping operation. And after the output of the forward conveying unit passes through two linear mapping units with the ReLU as an activation function, the output corresponding to the video object coding part is obtained. And re-inputting the output corresponding to the video object coding part into the video object coding part, and performing T-time circulation to obtain the final output corresponding to the video object coding part.
And connecting the output corresponding to the video frame coding part with the output corresponding to the video object coding part, and inputting the output to a linear mapping unit to obtain the encoder output of the video converter.
The video converter has three decoders, which respectively aim at a multinomial selection type video question-answering task, an open digital type video question-answering task and an open text type video question-answering task:
for the multi-choice video question-answering task, an evaluation score s for each candidate answer is calculated by the following formula,
Figure BDA0001968590950000063
wherein,
Figure BDA0001968590950000064
transpose representing trainable weight matrices, FvoRepresenting the obtained encoder output of the video converter.
For the open digital video question-answering task, the digital answer n of the open digital video question-answering task is calculated by the following formula,
Figure BDA0001968590950000071
wherein,
Figure BDA0001968590950000072
representing transposes of trainable weight matrices, b2Representing trainable offsets, FvoRepresenting the obtained encoder output of the video converter, Round () represents the Round function calculation operation.
For the open text type video question-answering task, the answer word probability distribution o of the open text type video question-answering task is calculated by the following formula,
Figure BDA0001968590950000073
wherein,
Figure BDA0001968590950000074
representing transposes of trainable weight matrices, b3Representing trainable offsets, FvoRepresenting the encoder output of the video converter obtained, softmax () represents the softmax function calculation operation. And taking the word corresponding to the maximum probability value in the obtained answer word probability distribution o as the answer of the open text type video question-answering task.
After training, the trained video converter is used for the new video question-answering task, and answers corresponding to the video question-answering task can be obtained.
Drawings
Fig. 1 is an overall schematic diagram of a video converter for solving associative relational interaction of a video question-answering task according to an embodiment of the present invention.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
As shown in fig. 1, the method for solving the video question-answering task by using the video converter combined with the relationship interaction of the present invention comprises the following steps:
1) designing a video object relation acquisition method, and acquiring a time-space relation matrix of a video object by using the video object relation acquisition method;
2) designing a multi-interaction attention machine unit, and acquiring multi-interaction attention machine output containing comprehensive information contained in an input sequence by utilizing the multi-interaction attention machine unit in combination with the space-time relation matrix of the video object acquired in the step 1);
3) designing a video converter comprising an encoder and a decoder by using the multi-interaction attention mechanism unit designed in the step 2), training, and acquiring answers corresponding to the video question-answering tasks by using the trained video converter.
The step 1) comprises the following specific steps:
for the video frames of the video question-answering task, the trained video object recognition network is used for acquiring the appearance characteristics of the objects in the video
Figure BDA0001968590950000081
And object position characteristics
Figure BDA0001968590950000082
Wherein N represents the number of objects contained in the video and the appearance characteristics of the object N
Figure BDA0001968590950000083
For high-dimensional vectors obtained using a trained model, the position characteristics of each object
Figure BDA0001968590950000084
Is a 5-dimensional vector (x)n,yn,wn,hn,tn),
Figure BDA0001968590950000085
Front four dimensions (x)n,yn,wn,hn) The object bounding box center point coordinates representing object n,
Figure BDA0001968590950000086
of the fifth dimension tnRepresenting the number of frames in which the object n is located.
Location features for object m
Figure BDA0001968590950000087
Position characteristics of object n
Figure BDA0001968590950000088
Calculating to obtain a 5-dimensional relative relation vector (X) according to the following formulamn,Ymn,Wmn,Hmn,Tmn),
Figure BDA0001968590950000089
Figure BDA00019685909500000810
Figure BDA0001968590950000091
Figure BDA0001968590950000092
Figure BDA0001968590950000093
Thereafter, the obtained 5-dimensional relative relationship vector (X)mn,Ymn,Wmn,Hmn,Tmn) Mapping position codes of sine and cosine functions with different frequencies into high-dimensional expressions, and connecting the high-dimensional expressions obtained by mapping to obtain relative relation characteristics
Figure BDA0001968590950000094
The weight of the space-time relation between the object m and the object n is calculated according to the following formula
Figure BDA0001968590950000095
Figure BDA0001968590950000096
Wherein, WrIs a trainable weight vector.
By use ofObtaining the spatio-temporal relation weight among all the objects in the video to obtain the spatio-temporal relation matrix W of the video objectsR
The step 2) comprises the following specific steps:
designing a multi-interaction attention mechanism unit for an input matrix
Figure BDA0001968590950000097
And matrix
Figure BDA0001968590950000098
Calculating a column vector K in the three-dimensional tensor K according to the following formulaij
Kij=qiοvj
Wherein q isiA column vector, v, representing the ith column of the input matrix QjA column vector, representing the jth column of the input matrix V, represents a multiplication operation at the element level. All column vectors K to be obtainedij(i∈[1,2,...,lq],j∈[1,2,...,lv]) In combination, a three-dimensional tensor K is obtained. Divide K into several sub-tensors
Figure BDA0001968590950000099
For the sub-tensor K', a weight sum vector p is calculated according to the following calculation formula,
Figure BDA0001968590950000101
wherein, wijFor trainable weight scalars, b1Is a trainable offset value. And copying the obtained weights and vectors p for s times to form a new three-dimensional tensor M.
The obtained three-dimensional tensor K and the last bit of the new three-dimensional tensor M are subjected to summation compression to obtain a weight matrix W at an element levelEWeight matrix W with segment levelSUsing the resulting element-level weight matrix WESegment-level weight matrix WSAnd input matrix
Figure BDA0001968590950000102
The output O of the multi-interaction attention mechanism unit containing the comprehensive information contained in the input sequence is obtained according to the following calculation formula,
Figure BDA0001968590950000103
wherein,
Figure BDA0001968590950000104
representing element-level multiplication operations, softmax () represents softmax function computation operations.
The step 3) comprises the following specific steps:
the video converter in the step 3) consists of an encoder and a decoder, wherein the encoder of the video converter comprises three parts: a question text encoding part, a video object encoding part and a video frame encoding part. The problem text coding part mechanism is as follows: for the question text input by the video question-answering task, the mapping of words contained in the question text is used as an input sequence, the position information characteristic of the question text is obtained by combining the position coding technology in an original converter, the question word mapping and the question word position information characteristic are input into a designed multi-interaction attention mechanism unit, and the output of the multi-interaction attention mechanism unit is input into a forward conveying unit after connection operation and linear mapping operation. And (3) the output of the forward conveying unit passes through two linear mapping units which take ReLU as an activation function, and then the output corresponding to the problem text encoding part is obtained.
The mechanism of the video frame coding part of the coder is as follows: for a video frame sequence input by a video question-answering task, ResNet is utilized to obtain video frame characteristics as an input sequence, the position information characteristics of the video frame are obtained by combining the position coding technology in an original converter, the video frame characteristics and the video frame position information characteristics are input into a designed multi-interaction attention mechanism unit, the output of the multi-interaction attention mechanism unit is input into another multi-interaction attention mechanism unit through connection operation and linear mapping operation by combining the output corresponding to a question text coding part, and the output of the multi-interaction attention mechanism unit is input into a forward conveying unit through the connection operation and the linear mapping operation. And (3) the output of the forward conveying unit passes through two linear mapping units with ReLU as an activation function, and then the corresponding output of the video frame coding part is obtained. And re-inputting the output corresponding to the video frame coding part into the video frame coding part, and performing T-time circulation to obtain the final output corresponding to the video frame coding part.
The mechanism of the video object coding part of the coder is as follows: utilizing object appearance features in acquired video
Figure BDA0001968590950000111
And object position characteristics
Figure BDA0001968590950000112
And as an input sequence, inputting the appearance characteristics and the position characteristics of the object in the video into a designed multi-interaction attention mechanism unit, inputting the output of the multi-interaction attention mechanism unit into another multi-interaction attention mechanism unit by combining the output corresponding to the problem text coding part through connection operation and linear mapping operation, and inputting the output of the multi-interaction attention mechanism unit into a forward conveying unit through connection operation and linear mapping operation. And after the output of the forward conveying unit passes through two linear mapping units with the ReLU as an activation function, the output corresponding to the video object coding part is obtained. And re-inputting the output corresponding to the video object coding part into the video object coding part, and performing T-time circulation to obtain the final output corresponding to the video object coding part.
And connecting the output corresponding to the video frame coding part with the output corresponding to the video object coding part, and inputting the output to a linear mapping unit to obtain the encoder output of the video converter.
The video converter has three decoders, which respectively aim at a multinomial selection type video question-answering task, an open digital type video question-answering task and an open text type video question-answering task:
for the multi-choice video question-answering task, an evaluation score s for each candidate answer is calculated by the following formula,
Figure BDA0001968590950000121
wherein,
Figure BDA0001968590950000122
transpose representing trainable weight matrices, FvoRepresenting the obtained encoder output of the video converter.
For the open digital video question-answering task, the digital answer n of the open digital video question-answering task is calculated by the following formula,
Figure BDA0001968590950000123
wherein,
Figure BDA0001968590950000124
representing transposes of trainable weight matrices, b2Representing trainable offsets, FvoRepresenting the obtained encoder output of the video converter, Round () represents the Round function calculation operation.
For the open text type video question-answering task, the answer word probability distribution o of the open text type video question-answering task is calculated by the following formula,
Figure BDA0001968590950000125
wherein,
Figure BDA0001968590950000131
representing transposes of trainable weight matrices, b3Representing trainable offsets, FvoRepresenting the encoder output of the video converter obtained, softmax () represents the softmax function calculation operation. To be obtainedAnd taking the word corresponding to the maximum probability value in the answer word probability distribution o as an answer of the open text type video question-answering task.
After training, the trained video converter is used for the new video question-answering task, and answers corresponding to the video question-answering task can be obtained.
The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.
Examples
The invention performs experiments on the TGIF-QA experimental data set. The TGIF-QA experimental data set contains four video question-and-answer tasks: the method comprises the following steps of finding an Action task (Action) with given repetition times in the video, solving an Action state change task (Trans) in the video, solving a Frame task (Frame) most relevant to a video question and answer task question in the video, and solving a repetition time task (Count) of the given Action in the video. In order to objectively evaluate the performance of the algorithm of the invention, in the selected test set, Accuracy (ACC) evaluation criteria are used for an Action task (Action) for finding a given repetition number in a video, a task (Trans) for evaluating the Action state change in the video and a task (Frame) for evaluating the Frame most relevant to a video question and answer task problem in the video to evaluate the effect of the invention, and an average square error evaluation criterion (MSE) is used for a task (Count) for evaluating the repetition number of the given Action in the video to evaluate the effect of the invention. The experimental results obtained following the procedure described in the specific embodiment are shown in table 1, and the method is expressed as videotransform (multi):
Figure BDA0001968590950000141
table 1 test results of the present invention for the TGIF-QA dataset.

Claims (4)

1. A method for solving a video question-answering task by utilizing a video converter combined with relationship interaction is used for solving the video question-answering task and is characterized by comprising the following steps of:
1) designing a video object relation acquisition method, and acquiring a time-space relation matrix of a video object by using the video object relation acquisition method;
2) designing a multi-interaction attention machine unit, and acquiring multi-interaction attention machine output containing comprehensive information contained in an input sequence by utilizing the multi-interaction attention machine unit in combination with the space-time relation matrix of the video object acquired in the step 1);
the multi-interaction attention mechanism unit calculates a column vector of a three-dimensional tensor for each column vector in two input matrixes, combines the column vectors to obtain the three-dimensional tensor, divides the three-dimensional tensor into a plurality of sub-tensors, calculates a weight and a vector for each sub-tensor, sums and compresses the obtained three-dimensional tensor and the last bit of a new three-dimensional tensor to obtain a weight matrix at an element level and a weight matrix at a segment level, and calculates by using the weight matrix at the element level, the weight matrix at the segment level and an input matrix to obtain the output of the multi-interaction attention mechanism unit containing comprehensive information contained in an input sequence;
3) designing a video converter comprising an encoder and a decoder by using the multi-interaction attention mechanism unit designed in the step 2), training, and acquiring answers corresponding to the video question-answering tasks by using the trained video converter.
2. The method for solving the video question-answering task by using the video converter combined with the relational interaction as claimed in claim 1, wherein the step 1) is specifically as follows:
for the video frames of the video question-answering task, the trained video object recognition network is used for acquiring the appearance characteristics of the objects in the video
Figure FDA0002636154250000011
And object position characteristics
Figure FDA0002636154250000021
Wherein N represents the number of objects contained in the video and the appearance characteristics of the object N
Figure FDA0002636154250000022
For high-dimensional vectors obtained using a trained model, the position characteristics of each object
Figure FDA0002636154250000023
Is a 5-dimensional vector (x)n,yn,wn,hn,tn),
Figure FDA0002636154250000024
Front four dimensions (x)n,yn,wn,hn) The object bounding box center point coordinates representing object n,
Figure FDA0002636154250000025
of the fifth dimension tnRepresenting the number of frame sequence where the object n is located;
location features for object m
Figure FDA0002636154250000026
Position characteristics of object n
Figure FDA0002636154250000027
Calculating to obtain a 5-dimensional relative relation vector (X) according to the following formulamn,Ymn,Wmn,Hmn,Tmn),
Figure FDA0002636154250000028
Figure FDA0002636154250000029
Figure FDA00026361542500000210
Figure FDA00026361542500000211
Figure FDA00026361542500000212
Thereafter, the obtained 5-dimensional relative relationship vector (X)mn,Ymn,Wmn,Hmn,Tmn) Mapping position codes of sine and cosine functions with different frequencies into high-dimensional expressions, and connecting the high-dimensional expressions obtained by mapping to obtain relative relation characteristics
Figure FDA00026361542500000213
The weight of the space-time relation between the object m and the object n is calculated according to the following formula
Figure FDA00026361542500000214
Figure FDA00026361542500000215
Wherein, WrIs a trainable weight vector;
obtaining a space-time relation matrix W of video objects by using the obtained space-time relation weights among all the objects in the videoR
3. The method for solving the video question-answering task by using the video converter combined with the relational interaction as claimed in claim 2, wherein the step 2) is specifically as follows:
designing a multi-interaction attention mechanism unit for an input matrix
Figure FDA0002636154250000031
And matrix
Figure FDA0002636154250000032
Calculating a column vector K in the three-dimensional tensor K according to the following formulaij
Figure FDA0002636154250000033
Wherein q isiA column vector, v, representing the ith column of the input matrix QjA column vector representing the jth column of the input matrix V,
Figure FDA0002636154250000034
a multiplication operation at the representative element level; all column vectors K to be obtainedij(i∈[1,2,...,lq],j∈[1,2,...,lv]) Combining to obtain a three-dimensional tensor K; divide K into several sub-tensors
Figure FDA0002636154250000035
For the sub-tensor K', a weight sum vector p is calculated according to the following calculation formula,
Figure FDA0002636154250000036
wherein, wijFor trainable weight scalars, b1Is a trainable bias value; copying the obtained weight and vector p for s times to form a new three-dimensional tensor M;
the obtained three-dimensional tensor K and the last bit of the new three-dimensional tensor M are subjected to summation compression to obtain a weight matrix W at an element levelEWeight matrix W with segment levelSUsing the resulting element-level weight matrix WESegment-level weight matrix WSAnd input matrix
Figure FDA0002636154250000037
The output O of the multi-interaction attention mechanism unit containing the comprehensive information contained in the input sequence is obtained according to the following calculation formula,
Figure FDA0002636154250000038
wherein,
Figure FDA0002636154250000041
representing element-level multiplication operations, softmax () represents softmax function computation operations.
4. The method for solving the video question-answering task by using the video converter combined with the relational interaction as claimed in claim 3, wherein the step 3) is specifically as follows:
the video converter in the step 3) consists of an encoder and a decoder, wherein the encoder of the video converter comprises three parts: a question text encoding part, a video object encoding part and a video frame encoding part; the problem text coding part mechanism is as follows: for a question text input by a video question-answering task, mapping words contained in the question text is used as an input sequence, a position coding technology in an original converter is combined to obtain a question text position information characteristic, the question word mapping and the question word position information characteristic are input into a designed multi-interaction attention mechanism unit, and the output of the multi-interaction attention mechanism unit is subjected to connecting operation and linear mapping operation and then is input into a forward conveying unit; after the output of the forward conveying unit passes through two linear mapping units which take ReLU as an activation function, the output corresponding to the problem text coding part is obtained;
the mechanism of the video frame coding part of the coder is as follows: for a video frame sequence input by a video question-answering task, acquiring video frame characteristics by using ResNet as an input sequence, acquiring video frame position information characteristics by combining a position coding technology in an original converter, inputting the video frame characteristics and the video frame position information characteristics into a designed multi-interaction attention mechanism unit, inputting the output of the multi-interaction attention mechanism unit into another multi-interaction attention mechanism unit through connection operation and linear mapping operation by combining the output corresponding to a question text coding part, and inputting the output of the multi-interaction attention mechanism unit into a forward conveying unit through connection operation and linear mapping operation; after the output of the forward conveying unit passes through two linear mapping units which take ReLU as an activation function, the output corresponding to the video frame coding part is obtained; re-inputting the output corresponding to the video frame coding part into the video frame coding part, and performing T-time circulation to obtain the final output corresponding to the video frame coding part;
the mechanism of the video object coding part of the coder is as follows: utilizing object appearance features in acquired video
Figure FDA0002636154250000051
And object position characteristics
Figure FDA0002636154250000052
As an input sequence, inputting the appearance characteristics and the position characteristics of an object in a video into a designed multi-interaction attention mechanism unit, inputting the output of the multi-interaction attention mechanism unit into another multi-interaction attention mechanism unit through connection operation and linear mapping operation in combination with the corresponding output of a problem text encoding part, and inputting the output of the multi-interaction attention mechanism unit into a forward conveying unit through connection operation and linear mapping operation; after the output of the forward conveying unit passes through two linear mapping units which take ReLU as an activation function, the output corresponding to the video object coding part is obtained; re-inputting the output corresponding to the video object coding part into the video object coding part, and performing T-time circulation to obtain the final output corresponding to the video object coding part;
connecting the output corresponding to the video frame coding part with the output corresponding to the video object coding part, and inputting the output to a linear mapping unit to obtain the encoder output of the video converter;
the video converter has three decoders, which respectively aim at a multinomial selection type video question-answering task, an open digital type video question-answering task and an open text type video question-answering task:
for the multi-choice video question-answering task, an evaluation score s for each candidate answer is calculated by the following formula,
Figure FDA0002636154250000061
wherein,
Figure FDA0002636154250000062
transpose representing trainable weight matrices, FvoRepresenting the obtained encoder output of the video converter;
for the open digital video question-answering task, the digital answer n of the open digital video question-answering task is calculated by the following formula,
Figure FDA0002636154250000063
wherein,
Figure FDA0002636154250000064
representing transposes of trainable weight matrices, b2Representing trainable offsets, FvoRepresents the obtained encoder output of the video converter, Round () represents the Round function calculation operation;
for the open text type video question-answering task, the answer word probability distribution o of the open text type video question-answering task is calculated by the following formula,
Figure FDA0002636154250000065
wherein,
Figure FDA0002636154250000066
representing transposes of trainable weight matrices, b3Representing trainable offsets, FvoRepresenting the obtained encoder output of the video converter, softmax () represents a softmax function computation operation; taking the word corresponding to the maximum probability value in the obtained answer word probability distribution o as an answer of the open text type video question-answering task;
after training, the trained video converter is used for the new video question-answering task, and answers corresponding to the video question-answering task can be obtained.
CN201910112159.5A 2019-02-13 2019-02-13 Method for solving video question-answering task by utilizing video converter combined with relational interaction Active CN109840506B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910112159.5A CN109840506B (en) 2019-02-13 2019-02-13 Method for solving video question-answering task by utilizing video converter combined with relational interaction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910112159.5A CN109840506B (en) 2019-02-13 2019-02-13 Method for solving video question-answering task by utilizing video converter combined with relational interaction

Publications (2)

Publication Number Publication Date
CN109840506A CN109840506A (en) 2019-06-04
CN109840506B true CN109840506B (en) 2020-11-20

Family

ID=66884667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910112159.5A Active CN109840506B (en) 2019-02-13 2019-02-13 Method for solving video question-answering task by utilizing video converter combined with relational interaction

Country Status (1)

Country Link
CN (1) CN109840506B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348462B (en) * 2019-07-09 2022-03-04 北京金山数字娱乐科技有限公司 Image feature determination and visual question and answer method, device, equipment and medium
CN110378269A (en) * 2019-07-10 2019-10-25 浙江大学 Pass through the movable method not previewed in image query positioning video
CN110727824B (en) * 2019-10-11 2022-04-01 浙江大学 Method for solving question-answering task of object relationship in video by using multiple interaction attention mechanism

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463609A (en) * 2017-06-27 2017-12-12 浙江大学 It is a kind of to solve the method for video question and answer using Layered Space-Time notice codec network mechanism
CN107818306A (en) * 2017-10-31 2018-03-20 天津大学 A kind of video answering method based on attention model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463609A (en) * 2017-06-27 2017-12-12 浙江大学 It is a kind of to solve the method for video question and answer using Layered Space-Time notice codec network mechanism
CN107818306A (en) * 2017-10-31 2018-03-20 天津大学 A kind of video answering method based on attention model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Attention Is All You Need;Ashish Vaswani等;《31st Conference on Neural Information Processing Systems (NIPS 2017)》;20171231;全文 *
Relation Networks for Object Detection;Han Hu等;《The IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2018》;20180614;全文 *
基于时空注意力网络的视频问答;杨启凡;《中国优秀硕士学位论文全文数据库 信息科技辑》;20180115(第2018年第12期);全文 *

Also Published As

Publication number Publication date
CN109840506A (en) 2019-06-04

Similar Documents

Publication Publication Date Title
CN113610126B (en) Label-free knowledge distillation method based on multi-target detection model and storage medium
CN109544524B (en) Attention mechanism-based multi-attribute image aesthetic evaluation system
CN107229757B (en) Video retrieval method based on deep learning and Hash coding
CN105740909B (en) Text recognition method under a kind of natural scene based on spatial alternation
CN109840506B (en) Method for solving video question-answering task by utilizing video converter combined with relational interaction
CN110727824B (en) Method for solving question-answering task of object relationship in video by using multiple interaction attention mechanism
CN113609326B (en) Image description generation method based on relationship between external knowledge and target
CN110377711B (en) Method for solving open type long video question-answering task by utilizing layered convolution self-attention network
CN109902164B (en) Method for solving question-answering of open long format video by using convolution bidirectional self-attention network
CN113204633B (en) Semantic matching distillation method and device
CN109145083B (en) Candidate answer selecting method based on deep learning
CN113704437A (en) Knowledge base question-answering method integrating multi-head attention mechanism and relative position coding
CN114428866A (en) Video question-answering method based on object-oriented double-flow attention network
CN115080801A (en) Cross-modal retrieval method and system based on federal learning and data binary representation
CN113761153A (en) Question and answer processing method and device based on picture, readable medium and electronic equipment
CN114528928A (en) Two-training image classification algorithm based on Transformer
CN111008302B (en) Method for solving video question-answer problem by using graph theory-based multiple interaction network mechanism
WO2024114321A1 (en) Image data processing method and apparatus, computer device, computer-readable storage medium, and computer program product
CN114861754A (en) Knowledge tracking method and system based on external attention mechanism
CN115965789A (en) Scene perception attention-based remote sensing image semantic segmentation method
CN116109978A (en) Self-constrained dynamic text feature-based unsupervised video description method
CN118072020A (en) DINO optimization-based weak supervision remote sensing image semantic segmentation method
CN112231455A (en) Machine reading understanding method and system
CN109815927B (en) Method for solving video time text positioning task by using countermeasure bidirectional interactive network
CN117173715A (en) Attention visual question-answering method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant