CN109840506A

CN109840506A - The method for solving video question-answering task using the video converter of marriage relation interaction

Info

Publication number: CN109840506A
Application number: CN201910112159.5A
Authority: CN
Inventors: 璧垫床; 赵洲
Original assignee: Hangzhou Yi Zhi Intelligent Technology Co Ltd
Current assignee: Hangzhou Yi Zhi Intelligent Technology Co Ltd
Priority date: 2019-02-13
Filing date: 2019-02-13
Publication date: 2019-06-04
Anticipated expiration: 2039-02-13
Also published as: CN109840506B

Abstract

The invention discloses a kind of method that the video converter using marriage relation interaction solves video question-answering task, mainly includes the following steps: 1) to design a kind of answer that the video converter model using marriage relation interaction is completed for video question-answering task and obtain.2) training obtains final video converter model, and the answer of video question-answering task is generated using the model.Compared to general video question-answering task solution, present invention utilizes interactive relationship information, can preferably complete video question-answering task.Present invention effect acquired in video question-answering task is more preferable compared to traditional method.

Description

The method for solving video question-answering task using the video converter of marriage relation interaction

Technical field

The present invention relates to video question-answering tasks more particularly to a kind of video converter using marriage relation interaction to solve view The method of frequency question-answering task.

Background technique

Video question-answering task is a very challenging task, has attracted the concern of many people at present.In the task In the problem of needing system that can be directed to some particular video frequency, provide corresponding answer.Video question-answering task is still at present One more novel task, it is also immature to its research.Computer can be applied to for the research of video question-answering task The related fieldss such as vision and natural language processing.

Current existing video question-answering task solution is usually to utilize traditional image question and answer related approaches.Utilize volume Product neural network obtains the coding of image, and the coding of problem is obtained using Recognition with Recurrent Neural Network, and image and problem is used in combination Coding, generates the feature coding for being mixed with image and problem information, and decoder utilizes the feature for being mixed with image and problem information Coding obtains final image quiz answers.

Such method is due to lacking the answer for the analysis of the timing information contained in video, for video question-answering task Generate inaccuracy.To solve the above-mentioned problems, the present invention solves video question and answer using the video converter that marriage relation interacts Location tasks improve the accuracy that video question-answering task forms video quiz answers.

Summary of the invention

It is an object of the invention to solve the problems of the prior art, in order to overcome the prior art for video question-answering task The problem of accurate video quiz answers can not be provided, the present invention provide a kind of Video Quality Metric interacted using marriage relation The method of device solution video question-answering task.Specific technical solution of the present invention is:

The method for solving video question-answering task using the video converter of marriage relation interaction, comprises the following steps:

1. designing a kind of the video object Relation acquisition method, the video object is obtained using the video object Relation acquisition method Time-space relationship matrix.

2. design one kind interacts attention mechanism unit more, using in more interaction attention mechanism unit combination steps 1 The time-space relationship matrix of the video object of acquisition, the more interaction attention mechanism for obtaining the integrated information contained containing list entries are defeated Out.

3. the more interaction attention mechanism units designed using step 2, video of the design containing encoder and decoder turns Parallel operation is trained, and the answer of corresponding video question-answering task is obtained using the video converter trained.

Above-mentioned steps can specifically use following implementation:

For the video frame of video question-answering task, the object in video is obtained using trained the video object identification network Barment tagWith object's position featureWherein N, which is represented, contains in video Object number, the barment tag of object nPosition for the high dimension vector obtained using trained model, each object is special SignFor a 5 dimensional vector (x_n,y_n,w_n,h_n,t_n),The preceding four-dimension (x_n,y_n,w_n,h_n) represent the object bounds frame of object n Center point coordinate,The 5th dimension t_nRepresent frame number number locating for object n.

For the position feature of object mWith the position feature of object n5 dimension relativeness vector (X are calculated according to following formula_mn,Y_mn,W_mn,H_mn,T_mn),

Later, relativeness vector (X is tieed up by the 5 of acquisition_mn,Y_mn,W_mn,H_mn,T_mn) using just remaining comprising different frequency The position encoded of string function is mapped as higher-dimension expression, and the higher-dimension expression connection that mapping obtains is obtained relativeness featureIt presses The time-space relationship weight of object m Yu object n are calculated according to following formula

Wherein, W_rFor trainable weight vectors.

The time-space relationship matrix W of the video object is obtained using the time-space relationship weight between all objects in the video of acquisition_R。

It designs one kind and interacts attention mechanism unit more, for the matrix Q=(q of input₁,q₂,...,q_lq) and matrix V= (v₁,v₂,...,v_lv), the column vector K in three-dimensional tensor K is calculated according to following formula_ij,

K_ij=q_iοv_j

Wherein, q_iRepresent the column vector of the column of input matrix Q i-th, v_jRepresent the column vector of input matrix V jth column, ο representative element The multiplication of plain rank operates.By all column vector K of acquisition_ij(i∈[1,2,...,l_q],j∈[1,2,...,l_v]) group closes Come, obtains three-dimensional tensor K.By K and it is divided into several sub- tensorsFor sub- tensor K', formula meter is calculated as follows Calculation obtains weight and vector p,

Wherein, w_ijFor trainable weight scalar, b₁For trainable bias.Obtained weight and vector p are replicated S*s times, form new three-dimensional tensor M.

Last position of obtained three-dimensional tensor K and new three-dimensional tensor M are subjected to summation compression, it is other to obtain Element-Level Weight matrix W_EWith the other weight matrix W of fragment stage_S, utilize the other weight matrix W of obtained Element-Level_E, the other weight of fragment stage Matrix W_SWith input matrix V=(v₁,v₂,...,v_lv), formula is calculated as follows and obtains the integrated information contained containing list entries More interaction attention mechanism units export O,

Wherein,The multiplication of representative element rank operates, and softmax () represents softmax function calculating operation.

The video converter that the present invention designs is made of encoder and decoder two parts, and the encoder of video converter contains There are three parts: question text coded portion, object-oriented video coding part, coding video frames part.Wherein question text encodes Partial Mechanism are as follows: the problem of being inputted for video question-answering task text, using the mapping of the word wherein contained as input sequence Column, the position encoded technology being used in combination in original conversion device obtain question text location information feature, problem word are mapped Design are input to questionnaire word location information feature to interact in attention mechanism unit more, will interact attention mechanism list more The output of member is operated by attended operation and Linear Mapping, to supply unit before being input to later.By preceding to the defeated of supply unit Out after the Linear Mapping unit by two using ReLU as activation primitive, the corresponding output of question text coded portion is obtained.

Encoded video frame coded portion mechanism are as follows: for the sequence of frames of video of video question-answering task input, utilize ResNet obtains video frame feature as list entries, and the position encoded technology being used in combination in original conversion device obtains video frame Video frame feature is input to design with video frame location information feature more interact attention mechanism unit by location information feature In, the output of more interaction attention mechanism units is operated by attended operation and Linear Mapping, in conjunction with question text coding unit It is point corresponding to be input in another more interaction attention mechanism units, by outputs warps of more interaction attention mechanism units It crosses attended operation and Linear Mapping operates, to supply unit before being input to.By the preceding output to supply unit by two with After ReLU is as the Linear Mapping unit of activation primitive, the corresponding output in coding video frames part is obtained.By coding video frames portion Divide corresponding output to be re-entered into above-mentioned coding video frames part, carries out T circulation, obtain final coding video frames part Corresponding output.

Encoder video object coding Partial Mechanism are as follows: utilize the objects looks feature in the video obtainedWith object's position featureAs list entries, by the objects looks in video Feature is input to design with object's position feature more and interacts in attention mechanism unit, by more interaction attention mechanism units Output is operated by attended operation and Linear Mapping, is input to another interact in conjunction with question text coded portion is corresponding more In attention mechanism unit, the output of more interaction attention mechanism units is operated by attended operation and Linear Mapping, it is defeated Enter to preceding to supply unit.Two Linear Mapping units using ReLU as activation primitive are passed through into the preceding output to supply unit Afterwards, the corresponding output in object-oriented video coding part is obtained.The corresponding output in object-oriented video coding part is re-entered into above-mentioned Object-oriented video coding part carries out T circulation, obtains the final corresponding output in object-oriented video coding part.

The corresponding output in coding video frames part output corresponding with object-oriented video coding part is attached operation, it is defeated Enter to after a Linear Mapping unit, obtains the encoder output of video converter.

There are three types of the decoders of video converter, is directed to multiple video question-answering task, open numeric type respectively Video question-answering task and open text-type video question-answering task:

For multiple video question-answering task, the assessment for each candidate answers is calculated using following formula Score s,

Wherein,Represent the transposition of trainable weight matrix, F_voThe encoder for representing the video converter obtained is defeated Out.

For open numeric type video question-answering task, open numeric type video question and answer are calculated using following formula The digital answer n of task,

Wherein,Represent the transposition of trainable weight matrix, b₂Represent trainable biasing, F_voRepresent the view obtained The encoder output of frequency converter, Round () represent round function calculating operation.

For open text-type video question-answering task, open text-type video question and answer are calculated using following formula The answer word probability of task is distributed o,

Wherein,Represent the transposition of trainable weight matrix, b₃Represent trainable biasing, F_voRepresent the view obtained The encoder output of frequency converter, softmax () represent softmax function calculating operation.By the answer word probability of acquisition point Answer of the word of most probable value as open text-type video question-answering task is corresponded in cloth o.

By training, it is directed to new video question-answering task using the video converter trained, video question and answer can be obtained The corresponding answer of task.

Detailed description of the invention

Fig. 1 is that the video of the marriage relation interaction for solving video question-answering task of an embodiment according to the present invention turns The overall schematic of parallel operation.

Specific embodiment

The present invention is further elaborated and is illustrated with reference to the accompanying drawings and detailed description.

As shown in Figure 1, the present invention solves the method packet of video question-answering task using the video converter of marriage relation interaction Include following steps:

1) a kind of the video object Relation acquisition method is designed, obtains the video object using the video object Relation acquisition method Time-space relationship matrix；

2) it designs one kind and interacts attention mechanism unit more, using in more interaction attention mechanism unit combination steps 1) The time-space relationship matrix of the video object of acquisition, the more interaction attention mechanism for obtaining the integrated information contained containing list entries are defeated Out；

3) using more interaction attention mechanism units of step 2) design, video of the design containing encoder and decoder turns Parallel operation is trained, and the answer of corresponding video question-answering task is obtained using the video converter trained.

The step 1), the specific steps are that:

Wherein, W_rFor trainable weight vectors.

The step 2), the specific steps are that:

It designs one kind and interacts attention mechanism unit more, for the matrix of inputWith matrixThe column vector K in three-dimensional tensor K is calculated according to following formula_ij,

K_ij=q_iοv_j

Last position of obtained three-dimensional tensor K and new three-dimensional tensor M are subjected to summation compression, it is other to obtain Element-Level Weight matrix W_EWith the other weight matrix W of fragment stage_S, utilize the other weight matrix W of obtained Element-Level_E, the other weight of fragment stage Matrix W_SWith input matrixFormula is calculated as follows and obtains the integrated information contained containing list entries More interaction attention mechanism units export O,

The step 3), the specific steps are that:

Video converter in step 3) is made of encoder and decoder two parts, and the encoder of video converter contains Three parts: question text coded portion, object-oriented video coding part, coding video frames part.Wherein question text coding unit Extension set is made as: the problem of inputting for video question-answering task text, using the mapping of the word wherein contained as list entries, The position encoded technology being used in combination in original conversion device obtains question text location information feature, and problem word is mapped and asked Topic word position information characteristics are input in more interaction attention mechanism units of design, by more interaction attention mechanism units Output is operated by attended operation and Linear Mapping, to supply unit before being input to later.By the preceding output warp to supply unit After crossing two Linear Mapping units using ReLU as activation primitive, the corresponding output of question text coded portion is obtained.

The above method is applied in the following example below, it is specific in embodiment to embody technical effect of the invention Step repeats no more.

Embodiment

The present invention tests on TGIF-QA experimental data set.TGIF-QA experimental data set is containing there are four types of video question and answer Task: it finds and gives the psychomotor task (Action) of number of repetition in video, the action state in video is asked to change task (Trans), ask in video with the maximally related frame task (Frame) of video question-answering task problem, ask in video give movement weight Again task (Count) is counted.In order to objectively evaluate the performance of algorithm of the invention, the present invention in selected test set, For finding the psychomotor task (Action) of given number of repetition in video, the action state in video being asked to change task (Trans), ask in video has used accuracy (ACC) to evaluate with the maximally related frame task (Frame) of video question-answering task problem Standard evaluates effect of the invention, for asking the number of repetition task (Count) for giving movement in video to use Mean Square Error evaluation criterion (MSE) evaluates effect of the invention.According to being described in specific embodiment The step of, resulting experimental result is as shown in table 1, and this method is expressed as VideoTransformer (multi):

1 present invention of table is directed to the test result of TGIF-QA data set.

Claims

1. the method for solving video question-answering task using the video converter of marriage relation interaction is appointed for solving video question and answer Business, it is characterised in that include the following steps:

1) design a kind of the video object Relation acquisition method, using the video object Relation acquisition method obtain the video object when Void relation matrix；

2) it designs one kind and interacts attention mechanism unit more, obtained using in more interaction attention mechanism unit combination steps 1) The video object time-space relationship matrix, obtain more interaction attention mechanism output of the integrated information contained containing list entries；

3) using more interaction attention mechanism units of step 2) design, the Video Quality Metric containing encoder and decoder is designed Device is trained, and the answer of corresponding video question-answering task is obtained using the video converter trained.

2. the method for solving video question-answering task using the video converter of marriage relation interaction according to claim 1, It is characterized in that, the step 1) specifically:

For the video frame of video question-answering task, the objects looks in video are obtained using trained the video object identification network FeatureWith object's position featureWherein N represents the object contained in video Number, the barment tag of object nFor the high dimension vector obtained using trained model, the position feature of each object For a 5 dimensional vector (x_n,y_n,w_n,h_n,t_n),The preceding four-dimension (x_n,y_n,w_n,h_n) represent the object bounds frame central point of object n Coordinate,The 5th dimension t_nRepresent frame number number locating for object n；

Later, relativeness vector (X is tieed up by the 5 of acquisition_mn,Y_mn,W_mn,H_mn,T_mn) utilize the sin cos functions comprising different frequency It is position encoded be mapped as higher-dimension expression, will mapping obtain higher-dimension expression connection obtaining relativeness featureAccording to as follows The time-space relationship weight of object m Yu object n is calculated in formula

Wherein, W_rFor trainable weight vectors；

3. the method that the video converter according to claim 2 using marriage relation interaction solves video question-answering task, It is characterized in that, the step 2) specifically:

K_ij=q_iοv_j

Wherein, q_iRepresent the column vector of the column of input matrix Q i-th, v_jRepresent the column vector of input matrix V jth column, ο representative element grade Other multiplication operation；By all column vector K of acquisition_ij(i∈[1,2,...,l_q],j∈[1,2,...,l_v]) combine, it obtains Obtain three-dimensional tensor K；By K and it is divided into several sub- tensorsFor sub- tensor K', formula is calculated as follows and calculates To weight and vector p,

Wherein, w_ijFor trainable weight scalar, b₁For trainable bias；Obtained weight and vector p are replicated into s*s It is secondary, form new three-dimensional tensor M；

Last position of obtained three-dimensional tensor K and new three-dimensional tensor M are subjected to summation compression, obtain the other weight of Element-Level Matrix W_EWith the other weight matrix W of fragment stage_S, utilize the other weight matrix W of obtained Element-Level_E, the other weight matrix of fragment stage W_SWith input matrixFormula is calculated as follows and obtains the mostly mutual of the integrated information contained containing list entries Dynamic attention mechanism unit exports O,

4. the method that the video converter according to claim 3 using marriage relation interaction solves video question-answering task, It is characterized in that, the step 3) specifically:

Video converter in step 3) is made of encoder and decoder two parts, there are three the encoder of video converter contains Part: question text coded portion, object-oriented video coding part, coding video frames part；Wherein question text coding unit extension set It is made as: the problem of being inputted for video question-answering task text, using the mapping of the word wherein contained as list entries, in conjunction with Question text location information feature is obtained using the position encoded technology in original conversion device, problem word is mapped and questionnaire Word location information feature is input in more interaction attention mechanism units of design, by the output of more interaction attention mechanism units It is operated by attended operation and Linear Mapping, to supply unit before being input to later；The preceding output to supply unit is passed through two After a Linear Mapping unit using ReLU as activation primitive, the corresponding output of question text coded portion is obtained；

Encoded video frame coded portion mechanism are as follows: for the sequence of frames of video of video question-answering task input, obtained using ResNet Video frame feature is obtained as list entries, the position encoded technology being used in combination in original conversion device obtains video frame location information Video frame feature is input to design with video frame location information feature more interacted in attention mechanism unit by feature, will be more The output for interacting attention mechanism unit is operated by attended operation and Linear Mapping, corresponding in conjunction with question text coded portion It is input in another more interaction attention mechanism units, by the output of more interaction attention mechanism units by connection behaviour Make to operate with Linear Mapping, to supply unit before being input to；By the preceding output to supply unit by two using ReLU as sharp After the Linear Mapping unit of function living, the corresponding output in coding video frames part is obtained；Coding video frames part is corresponding defeated It is re-entered into above-mentioned coding video frames part out, carries out T circulation, it is corresponding defeated to obtain final coding video frames part Out；

Encoder video object coding Partial Mechanism are as follows: utilize the objects looks feature in the video obtainedWith object's position featureAs list entries, by the objects looks in video Feature is input to design with object's position feature more and interacts in attention mechanism unit, by more interaction attention mechanism units Output is operated by attended operation and Linear Mapping, is input to another interact in conjunction with question text coded portion is corresponding more In attention mechanism unit, the output of more interaction attention mechanism units is operated by attended operation and Linear Mapping, it is defeated Enter to preceding to supply unit；Two Linear Mapping units using ReLU as activation primitive are passed through into the preceding output to supply unit Afterwards, the corresponding output in object-oriented video coding part is obtained；The corresponding output in object-oriented video coding part is re-entered into above-mentioned Object-oriented video coding part carries out T circulation, obtains the final corresponding output in object-oriented video coding part；

The corresponding output in coding video frames part output corresponding with object-oriented video coding part is attached operation, is input to After one Linear Mapping unit, the encoder output of video converter is obtained；

There are three types of the decoders of video converter, is directed to multiple video question-answering task, open numeric type video respectively Question-answering task and open text-type video question-answering task:

For multiple video question-answering task, the assessment score for each candidate answers is calculated using following formula S,

Wherein,Represent the transposition of trainable weight matrix, F_voRepresent the encoder output of the video converter obtained；

For open numeric type video question-answering task, open numeric type video question-answering task is calculated using following formula Digital answer n,

Wherein,Represent the transposition of trainable weight matrix, b₂Represent trainable biasing, F_voRepresent the Video Quality Metric obtained The encoder output of device, Round () represent round function calculating operation；

For open text-type video question-answering task, open text-type video question-answering task is calculated using following formula Answer word probability be distributed o,

Wherein,Represent the transposition of trainable weight matrix, b₃Represent trainable biasing, F_voRepresent the Video Quality Metric obtained The encoder output of device, softmax () represent softmax function calculating operation；It will be in the answer word probability distribution o of acquisition Answer of the word of corresponding most probable value as open text-type video question-answering task；

By training, it is directed to new video question-answering task using the video converter trained, video question-answering task can be obtained Corresponding answer.