CN109840506A - The method for solving video question-answering task using the video converter of marriage relation interaction - Google Patents

The method for solving video question-answering task using the video converter of marriage relation interaction Download PDF

Info

Publication number
CN109840506A
CN109840506A CN201910112159.5A CN201910112159A CN109840506A CN 109840506 A CN109840506 A CN 109840506A CN 201910112159 A CN201910112159 A CN 201910112159A CN 109840506 A CN109840506 A CN 109840506A
Authority
CN
China
Prior art keywords
video
question
answering task
output
attention mechanism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910112159.5A
Other languages
Chinese (zh)
Other versions
CN109840506B (en
Inventor
璧垫床
赵洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yi Zhi Intelligent Technology Co Ltd
Original Assignee
Hangzhou Yi Zhi Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yi Zhi Intelligent Technology Co Ltd filed Critical Hangzhou Yi Zhi Intelligent Technology Co Ltd
Priority to CN201910112159.5A priority Critical patent/CN109840506B/en
Publication of CN109840506A publication Critical patent/CN109840506A/en
Application granted granted Critical
Publication of CN109840506B publication Critical patent/CN109840506B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a kind of method that the video converter using marriage relation interaction solves video question-answering task, mainly includes the following steps: 1) to design a kind of answer that the video converter model using marriage relation interaction is completed for video question-answering task and obtain.2) training obtains final video converter model, and the answer of video question-answering task is generated using the model.Compared to general video question-answering task solution, present invention utilizes interactive relationship information, can preferably complete video question-answering task.Present invention effect acquired in video question-answering task is more preferable compared to traditional method.

Description

The method for solving video question-answering task using the video converter of marriage relation interaction
Technical field
The present invention relates to video question-answering tasks more particularly to a kind of video converter using marriage relation interaction to solve view The method of frequency question-answering task.
Background technique
Video question-answering task is a very challenging task, has attracted the concern of many people at present.In the task In the problem of needing system that can be directed to some particular video frequency, provide corresponding answer.Video question-answering task is still at present One more novel task, it is also immature to its research.Computer can be applied to for the research of video question-answering task The related fieldss such as vision and natural language processing.
Current existing video question-answering task solution is usually to utilize traditional image question and answer related approaches.Utilize volume Product neural network obtains the coding of image, and the coding of problem is obtained using Recognition with Recurrent Neural Network, and image and problem is used in combination Coding, generates the feature coding for being mixed with image and problem information, and decoder utilizes the feature for being mixed with image and problem information Coding obtains final image quiz answers.
Such method is due to lacking the answer for the analysis of the timing information contained in video, for video question-answering task Generate inaccuracy.To solve the above-mentioned problems, the present invention solves video question and answer using the video converter that marriage relation interacts Location tasks improve the accuracy that video question-answering task forms video quiz answers.
Summary of the invention
It is an object of the invention to solve the problems of the prior art, in order to overcome the prior art for video question-answering task The problem of accurate video quiz answers can not be provided, the present invention provide a kind of Video Quality Metric interacted using marriage relation The method of device solution video question-answering task.Specific technical solution of the present invention is:
The method for solving video question-answering task using the video converter of marriage relation interaction, comprises the following steps:
1. designing a kind of the video object Relation acquisition method, the video object is obtained using the video object Relation acquisition method Time-space relationship matrix.
2. design one kind interacts attention mechanism unit more, using in more interaction attention mechanism unit combination steps 1 The time-space relationship matrix of the video object of acquisition, the more interaction attention mechanism for obtaining the integrated information contained containing list entries are defeated Out.
3. the more interaction attention mechanism units designed using step 2, video of the design containing encoder and decoder turns Parallel operation is trained, and the answer of corresponding video question-answering task is obtained using the video converter trained.
Above-mentioned steps can specifically use following implementation:
For the video frame of video question-answering task, the object in video is obtained using trained the video object identification network Barment tagWith object's position featureWherein N, which is represented, contains in video Object number, the barment tag of object nPosition for the high dimension vector obtained using trained model, each object is special SignFor a 5 dimensional vector (xn,yn,wn,hn,tn),The preceding four-dimension (xn,yn,wn,hn) represent the object bounds frame of object n Center point coordinate,The 5th dimension tnRepresent frame number number locating for object n.
For the position feature of object mWith the position feature of object n5 dimension relativeness vector (X are calculated according to following formulamn,Ymn,Wmn,Hmn,Tmn),
Later, relativeness vector (X is tieed up by the 5 of acquisitionmn,Ymn,Wmn,Hmn,Tmn) using just remaining comprising different frequency The position encoded of string function is mapped as higher-dimension expression, and the higher-dimension expression connection that mapping obtains is obtained relativeness featureIt presses The time-space relationship weight of object m Yu object n are calculated according to following formula
Wherein, WrFor trainable weight vectors.
The time-space relationship matrix W of the video object is obtained using the time-space relationship weight between all objects in the video of acquisitionR
It designs one kind and interacts attention mechanism unit more, for the matrix Q=(q of input1,q2,...,qlq) and matrix V= (v1,v2,...,vlv), the column vector K in three-dimensional tensor K is calculated according to following formulaij,
Kij=qiοvj
Wherein, qiRepresent the column vector of the column of input matrix Q i-th, vjRepresent the column vector of input matrix V jth column, ο representative element The multiplication of plain rank operates.By all column vector K of acquisitionij(i∈[1,2,...,lq],j∈[1,2,...,lv]) group closes Come, obtains three-dimensional tensor K.By K and it is divided into several sub- tensorsFor sub- tensor K', formula meter is calculated as follows Calculation obtains weight and vector p,
Wherein, wijFor trainable weight scalar, b1For trainable bias.Obtained weight and vector p are replicated S*s times, form new three-dimensional tensor M.
Last position of obtained three-dimensional tensor K and new three-dimensional tensor M are subjected to summation compression, it is other to obtain Element-Level Weight matrix WEWith the other weight matrix W of fragment stageS, utilize the other weight matrix W of obtained Element-LevelE, the other weight of fragment stage Matrix WSWith input matrix V=(v1,v2,...,vlv), formula is calculated as follows and obtains the integrated information contained containing list entries More interaction attention mechanism units export O,
Wherein,The multiplication of representative element rank operates, and softmax () represents softmax function calculating operation.
The video converter that the present invention designs is made of encoder and decoder two parts, and the encoder of video converter contains There are three parts: question text coded portion, object-oriented video coding part, coding video frames part.Wherein question text encodes Partial Mechanism are as follows: the problem of being inputted for video question-answering task text, using the mapping of the word wherein contained as input sequence Column, the position encoded technology being used in combination in original conversion device obtain question text location information feature, problem word are mapped Design are input to questionnaire word location information feature to interact in attention mechanism unit more, will interact attention mechanism list more The output of member is operated by attended operation and Linear Mapping, to supply unit before being input to later.By preceding to the defeated of supply unit Out after the Linear Mapping unit by two using ReLU as activation primitive, the corresponding output of question text coded portion is obtained.
Encoded video frame coded portion mechanism are as follows: for the sequence of frames of video of video question-answering task input, utilize ResNet obtains video frame feature as list entries, and the position encoded technology being used in combination in original conversion device obtains video frame Video frame feature is input to design with video frame location information feature more interact attention mechanism unit by location information feature In, the output of more interaction attention mechanism units is operated by attended operation and Linear Mapping, in conjunction with question text coding unit It is point corresponding to be input in another more interaction attention mechanism units, by outputs warps of more interaction attention mechanism units It crosses attended operation and Linear Mapping operates, to supply unit before being input to.By the preceding output to supply unit by two with After ReLU is as the Linear Mapping unit of activation primitive, the corresponding output in coding video frames part is obtained.By coding video frames portion Divide corresponding output to be re-entered into above-mentioned coding video frames part, carries out T circulation, obtain final coding video frames part Corresponding output.
Encoder video object coding Partial Mechanism are as follows: utilize the objects looks feature in the video obtainedWith object's position featureAs list entries, by the objects looks in video Feature is input to design with object's position feature more and interacts in attention mechanism unit, by more interaction attention mechanism units Output is operated by attended operation and Linear Mapping, is input to another interact in conjunction with question text coded portion is corresponding more In attention mechanism unit, the output of more interaction attention mechanism units is operated by attended operation and Linear Mapping, it is defeated Enter to preceding to supply unit.Two Linear Mapping units using ReLU as activation primitive are passed through into the preceding output to supply unit Afterwards, the corresponding output in object-oriented video coding part is obtained.The corresponding output in object-oriented video coding part is re-entered into above-mentioned Object-oriented video coding part carries out T circulation, obtains the final corresponding output in object-oriented video coding part.
The corresponding output in coding video frames part output corresponding with object-oriented video coding part is attached operation, it is defeated Enter to after a Linear Mapping unit, obtains the encoder output of video converter.
There are three types of the decoders of video converter, is directed to multiple video question-answering task, open numeric type respectively Video question-answering task and open text-type video question-answering task:
For multiple video question-answering task, the assessment for each candidate answers is calculated using following formula Score s,
Wherein,Represent the transposition of trainable weight matrix, FvoThe encoder for representing the video converter obtained is defeated Out.
For open numeric type video question-answering task, open numeric type video question and answer are calculated using following formula The digital answer n of task,
Wherein,Represent the transposition of trainable weight matrix, b2Represent trainable biasing, FvoRepresent the view obtained The encoder output of frequency converter, Round () represent round function calculating operation.
For open text-type video question-answering task, open text-type video question and answer are calculated using following formula The answer word probability of task is distributed o,
Wherein,Represent the transposition of trainable weight matrix, b3Represent trainable biasing, FvoRepresent the view obtained The encoder output of frequency converter, softmax () represent softmax function calculating operation.By the answer word probability of acquisition point Answer of the word of most probable value as open text-type video question-answering task is corresponded in cloth o.
By training, it is directed to new video question-answering task using the video converter trained, video question and answer can be obtained The corresponding answer of task.
Detailed description of the invention
Fig. 1 is that the video of the marriage relation interaction for solving video question-answering task of an embodiment according to the present invention turns The overall schematic of parallel operation.
Specific embodiment
The present invention is further elaborated and is illustrated with reference to the accompanying drawings and detailed description.
As shown in Figure 1, the present invention solves the method packet of video question-answering task using the video converter of marriage relation interaction Include following steps:
1) a kind of the video object Relation acquisition method is designed, obtains the video object using the video object Relation acquisition method Time-space relationship matrix;
2) it designs one kind and interacts attention mechanism unit more, using in more interaction attention mechanism unit combination steps 1) The time-space relationship matrix of the video object of acquisition, the more interaction attention mechanism for obtaining the integrated information contained containing list entries are defeated Out;
3) using more interaction attention mechanism units of step 2) design, video of the design containing encoder and decoder turns Parallel operation is trained, and the answer of corresponding video question-answering task is obtained using the video converter trained.
The step 1), the specific steps are that:
For the video frame of video question-answering task, the object in video is obtained using trained the video object identification network Barment tagWith object's position featureWherein N, which is represented, contains in video Object number, the barment tag of object nPosition for the high dimension vector obtained using trained model, each object is special SignFor a 5 dimensional vector (xn,yn,wn,hn,tn),The preceding four-dimension (xn,yn,wn,hn) represent the object bounds frame of object n Center point coordinate,The 5th dimension tnRepresent frame number number locating for object n.
For the position feature of object mWith the position feature of object n5 dimension relativeness vector (X are calculated according to following formulamn,Ymn,Wmn,Hmn,Tmn),
Later, relativeness vector (X is tieed up by the 5 of acquisitionmn,Ymn,Wmn,Hmn,Tmn) using just remaining comprising different frequency The position encoded of string function is mapped as higher-dimension expression, and the higher-dimension expression connection that mapping obtains is obtained relativeness featureIt presses The time-space relationship weight of object m Yu object n are calculated according to following formula
Wherein, WrFor trainable weight vectors.
The time-space relationship matrix W of the video object is obtained using the time-space relationship weight between all objects in the video of acquisitionR
The step 2), the specific steps are that:
It designs one kind and interacts attention mechanism unit more, for the matrix of inputWith matrixThe column vector K in three-dimensional tensor K is calculated according to following formulaij,
Kij=qiοvj
Wherein, qiRepresent the column vector of the column of input matrix Q i-th, vjRepresent the column vector of input matrix V jth column, ο representative element The multiplication of plain rank operates.By all column vector K of acquisitionij(i∈[1,2,...,lq],j∈[1,2,...,lv]) group closes Come, obtains three-dimensional tensor K.By K and it is divided into several sub- tensorsFor sub- tensor K', formula meter is calculated as follows Calculation obtains weight and vector p,
Wherein, wijFor trainable weight scalar, b1For trainable bias.Obtained weight and vector p are replicated S*s times, form new three-dimensional tensor M.
Last position of obtained three-dimensional tensor K and new three-dimensional tensor M are subjected to summation compression, it is other to obtain Element-Level Weight matrix WEWith the other weight matrix W of fragment stageS, utilize the other weight matrix W of obtained Element-LevelE, the other weight of fragment stage Matrix WSWith input matrixFormula is calculated as follows and obtains the integrated information contained containing list entries More interaction attention mechanism units export O,
Wherein,The multiplication of representative element rank operates, and softmax () represents softmax function calculating operation.
The step 3), the specific steps are that:
Video converter in step 3) is made of encoder and decoder two parts, and the encoder of video converter contains Three parts: question text coded portion, object-oriented video coding part, coding video frames part.Wherein question text coding unit Extension set is made as: the problem of inputting for video question-answering task text, using the mapping of the word wherein contained as list entries, The position encoded technology being used in combination in original conversion device obtains question text location information feature, and problem word is mapped and asked Topic word position information characteristics are input in more interaction attention mechanism units of design, by more interaction attention mechanism units Output is operated by attended operation and Linear Mapping, to supply unit before being input to later.By the preceding output warp to supply unit After crossing two Linear Mapping units using ReLU as activation primitive, the corresponding output of question text coded portion is obtained.
Encoded video frame coded portion mechanism are as follows: for the sequence of frames of video of video question-answering task input, utilize ResNet obtains video frame feature as list entries, and the position encoded technology being used in combination in original conversion device obtains video frame Video frame feature is input to design with video frame location information feature more interact attention mechanism unit by location information feature In, the output of more interaction attention mechanism units is operated by attended operation and Linear Mapping, in conjunction with question text coding unit It is point corresponding to be input in another more interaction attention mechanism units, by outputs warps of more interaction attention mechanism units It crosses attended operation and Linear Mapping operates, to supply unit before being input to.By the preceding output to supply unit by two with After ReLU is as the Linear Mapping unit of activation primitive, the corresponding output in coding video frames part is obtained.By coding video frames portion Divide corresponding output to be re-entered into above-mentioned coding video frames part, carries out T circulation, obtain final coding video frames part Corresponding output.
Encoder video object coding Partial Mechanism are as follows: utilize the objects looks feature in the video obtainedWith object's position featureAs list entries, by the objects looks in video Feature is input to design with object's position feature more and interacts in attention mechanism unit, by more interaction attention mechanism units Output is operated by attended operation and Linear Mapping, is input to another interact in conjunction with question text coded portion is corresponding more In attention mechanism unit, the output of more interaction attention mechanism units is operated by attended operation and Linear Mapping, it is defeated Enter to preceding to supply unit.Two Linear Mapping units using ReLU as activation primitive are passed through into the preceding output to supply unit Afterwards, the corresponding output in object-oriented video coding part is obtained.The corresponding output in object-oriented video coding part is re-entered into above-mentioned Object-oriented video coding part carries out T circulation, obtains the final corresponding output in object-oriented video coding part.
The corresponding output in coding video frames part output corresponding with object-oriented video coding part is attached operation, it is defeated Enter to after a Linear Mapping unit, obtains the encoder output of video converter.
There are three types of the decoders of video converter, is directed to multiple video question-answering task, open numeric type respectively Video question-answering task and open text-type video question-answering task:
For multiple video question-answering task, the assessment for each candidate answers is calculated using following formula Score s,
Wherein,Represent the transposition of trainable weight matrix, FvoThe encoder for representing the video converter obtained is defeated Out.
For open numeric type video question-answering task, open numeric type video question and answer are calculated using following formula The digital answer n of task,
Wherein,Represent the transposition of trainable weight matrix, b2Represent trainable biasing, FvoRepresent the view obtained The encoder output of frequency converter, Round () represent round function calculating operation.
For open text-type video question-answering task, open text-type video question and answer are calculated using following formula The answer word probability of task is distributed o,
Wherein,Represent the transposition of trainable weight matrix, b3Represent trainable biasing, FvoRepresent the view obtained The encoder output of frequency converter, softmax () represent softmax function calculating operation.By the answer word probability of acquisition point Answer of the word of most probable value as open text-type video question-answering task is corresponded in cloth o.
By training, it is directed to new video question-answering task using the video converter trained, video question and answer can be obtained The corresponding answer of task.
The above method is applied in the following example below, it is specific in embodiment to embody technical effect of the invention Step repeats no more.
Embodiment
The present invention tests on TGIF-QA experimental data set.TGIF-QA experimental data set is containing there are four types of video question and answer Task: it finds and gives the psychomotor task (Action) of number of repetition in video, the action state in video is asked to change task (Trans), ask in video with the maximally related frame task (Frame) of video question-answering task problem, ask in video give movement weight Again task (Count) is counted.In order to objectively evaluate the performance of algorithm of the invention, the present invention in selected test set, For finding the psychomotor task (Action) of given number of repetition in video, the action state in video being asked to change task (Trans), ask in video has used accuracy (ACC) to evaluate with the maximally related frame task (Frame) of video question-answering task problem Standard evaluates effect of the invention, for asking the number of repetition task (Count) for giving movement in video to use Mean Square Error evaluation criterion (MSE) evaluates effect of the invention.According to being described in specific embodiment The step of, resulting experimental result is as shown in table 1, and this method is expressed as VideoTransformer (multi):
1 present invention of table is directed to the test result of TGIF-QA data set.

Claims (4)

1. the method for solving video question-answering task using the video converter of marriage relation interaction is appointed for solving video question and answer Business, it is characterised in that include the following steps:
1) design a kind of the video object Relation acquisition method, using the video object Relation acquisition method obtain the video object when Void relation matrix;
2) it designs one kind and interacts attention mechanism unit more, obtained using in more interaction attention mechanism unit combination steps 1) The video object time-space relationship matrix, obtain more interaction attention mechanism output of the integrated information contained containing list entries;
3) using more interaction attention mechanism units of step 2) design, the Video Quality Metric containing encoder and decoder is designed Device is trained, and the answer of corresponding video question-answering task is obtained using the video converter trained.
2. the method for solving video question-answering task using the video converter of marriage relation interaction according to claim 1, It is characterized in that, the step 1) specifically:
For the video frame of video question-answering task, the objects looks in video are obtained using trained the video object identification network FeatureWith object's position featureWherein N represents the object contained in video Number, the barment tag of object nFor the high dimension vector obtained using trained model, the position feature of each object For a 5 dimensional vector (xn,yn,wn,hn,tn),The preceding four-dimension (xn,yn,wn,hn) represent the object bounds frame central point of object n Coordinate,The 5th dimension tnRepresent frame number number locating for object n;
For the position feature of object mWith the position feature of object n5 dimension relativeness vector (X are calculated according to following formulamn,Ymn,Wmn,Hmn,Tmn),
Later, relativeness vector (X is tieed up by the 5 of acquisitionmn,Ymn,Wmn,Hmn,Tmn) utilize the sin cos functions comprising different frequency It is position encoded be mapped as higher-dimension expression, will mapping obtain higher-dimension expression connection obtaining relativeness featureAccording to as follows The time-space relationship weight of object m Yu object n is calculated in formula
Wherein, WrFor trainable weight vectors;
The time-space relationship matrix W of the video object is obtained using the time-space relationship weight between all objects in the video of acquisitionR
3. the method that the video converter according to claim 2 using marriage relation interaction solves video question-answering task, It is characterized in that, the step 2) specifically:
It designs one kind and interacts attention mechanism unit more, for the matrix of inputWith matrixThe column vector K in three-dimensional tensor K is calculated according to following formulaij,
Kij=qiοvj
Wherein, qiRepresent the column vector of the column of input matrix Q i-th, vjRepresent the column vector of input matrix V jth column, ο representative element grade Other multiplication operation;By all column vector K of acquisitionij(i∈[1,2,...,lq],j∈[1,2,...,lv]) combine, it obtains Obtain three-dimensional tensor K;By K and it is divided into several sub- tensorsFor sub- tensor K', formula is calculated as follows and calculates To weight and vector p,
Wherein, wijFor trainable weight scalar, b1For trainable bias;Obtained weight and vector p are replicated into s*s It is secondary, form new three-dimensional tensor M;
Last position of obtained three-dimensional tensor K and new three-dimensional tensor M are subjected to summation compression, obtain the other weight of Element-Level Matrix WEWith the other weight matrix W of fragment stageS, utilize the other weight matrix W of obtained Element-LevelE, the other weight matrix of fragment stage WSWith input matrixFormula is calculated as follows and obtains the mostly mutual of the integrated information contained containing list entries Dynamic attention mechanism unit exports O,
Wherein,The multiplication of representative element rank operates, and softmax () represents softmax function calculating operation.
4. the method that the video converter according to claim 3 using marriage relation interaction solves video question-answering task, It is characterized in that, the step 3) specifically:
Video converter in step 3) is made of encoder and decoder two parts, there are three the encoder of video converter contains Part: question text coded portion, object-oriented video coding part, coding video frames part;Wherein question text coding unit extension set It is made as: the problem of being inputted for video question-answering task text, using the mapping of the word wherein contained as list entries, in conjunction with Question text location information feature is obtained using the position encoded technology in original conversion device, problem word is mapped and questionnaire Word location information feature is input in more interaction attention mechanism units of design, by the output of more interaction attention mechanism units It is operated by attended operation and Linear Mapping, to supply unit before being input to later;The preceding output to supply unit is passed through two After a Linear Mapping unit using ReLU as activation primitive, the corresponding output of question text coded portion is obtained;
Encoded video frame coded portion mechanism are as follows: for the sequence of frames of video of video question-answering task input, obtained using ResNet Video frame feature is obtained as list entries, the position encoded technology being used in combination in original conversion device obtains video frame location information Video frame feature is input to design with video frame location information feature more interacted in attention mechanism unit by feature, will be more The output for interacting attention mechanism unit is operated by attended operation and Linear Mapping, corresponding in conjunction with question text coded portion It is input in another more interaction attention mechanism units, by the output of more interaction attention mechanism units by connection behaviour Make to operate with Linear Mapping, to supply unit before being input to;By the preceding output to supply unit by two using ReLU as sharp After the Linear Mapping unit of function living, the corresponding output in coding video frames part is obtained;Coding video frames part is corresponding defeated It is re-entered into above-mentioned coding video frames part out, carries out T circulation, it is corresponding defeated to obtain final coding video frames part Out;
Encoder video object coding Partial Mechanism are as follows: utilize the objects looks feature in the video obtainedWith object's position featureAs list entries, by the objects looks in video Feature is input to design with object's position feature more and interacts in attention mechanism unit, by more interaction attention mechanism units Output is operated by attended operation and Linear Mapping, is input to another interact in conjunction with question text coded portion is corresponding more In attention mechanism unit, the output of more interaction attention mechanism units is operated by attended operation and Linear Mapping, it is defeated Enter to preceding to supply unit;Two Linear Mapping units using ReLU as activation primitive are passed through into the preceding output to supply unit Afterwards, the corresponding output in object-oriented video coding part is obtained;The corresponding output in object-oriented video coding part is re-entered into above-mentioned Object-oriented video coding part carries out T circulation, obtains the final corresponding output in object-oriented video coding part;
The corresponding output in coding video frames part output corresponding with object-oriented video coding part is attached operation, is input to After one Linear Mapping unit, the encoder output of video converter is obtained;
There are three types of the decoders of video converter, is directed to multiple video question-answering task, open numeric type video respectively Question-answering task and open text-type video question-answering task:
For multiple video question-answering task, the assessment score for each candidate answers is calculated using following formula S,
Wherein,Represent the transposition of trainable weight matrix, FvoRepresent the encoder output of the video converter obtained;
For open numeric type video question-answering task, open numeric type video question-answering task is calculated using following formula Digital answer n,
Wherein,Represent the transposition of trainable weight matrix, b2Represent trainable biasing, FvoRepresent the Video Quality Metric obtained The encoder output of device, Round () represent round function calculating operation;
For open text-type video question-answering task, open text-type video question-answering task is calculated using following formula Answer word probability be distributed o,
Wherein,Represent the transposition of trainable weight matrix, b3Represent trainable biasing, FvoRepresent the Video Quality Metric obtained The encoder output of device, softmax () represent softmax function calculating operation;It will be in the answer word probability distribution o of acquisition Answer of the word of corresponding most probable value as open text-type video question-answering task;
By training, it is directed to new video question-answering task using the video converter trained, video question-answering task can be obtained Corresponding answer.
CN201910112159.5A 2019-02-13 2019-02-13 Method for solving video question-answering task by utilizing video converter combined with relational interaction Active CN109840506B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910112159.5A CN109840506B (en) 2019-02-13 2019-02-13 Method for solving video question-answering task by utilizing video converter combined with relational interaction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910112159.5A CN109840506B (en) 2019-02-13 2019-02-13 Method for solving video question-answering task by utilizing video converter combined with relational interaction

Publications (2)

Publication Number Publication Date
CN109840506A true CN109840506A (en) 2019-06-04
CN109840506B CN109840506B (en) 2020-11-20

Family

ID=66884667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910112159.5A Active CN109840506B (en) 2019-02-13 2019-02-13 Method for solving video question-answering task by utilizing video converter combined with relational interaction

Country Status (1)

Country Link
CN (1) CN109840506B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348462A (en) * 2019-07-09 2019-10-18 北京金山数字娱乐科技有限公司 A kind of characteristics of image determination, vision answering method, device, equipment and medium
CN110378269A (en) * 2019-07-10 2019-10-25 浙江大学 Pass through the movable method not previewed in image query positioning video
CN110727824A (en) * 2019-10-11 2020-01-24 浙江大学 Method for solving question-answering task of object relationship in video by using multiple interaction attention mechanism

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463609A (en) * 2017-06-27 2017-12-12 浙江大学 It is a kind of to solve the method for video question and answer using Layered Space-Time notice codec network mechanism
CN107818306A (en) * 2017-10-31 2018-03-20 天津大学 A kind of video answering method based on attention model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463609A (en) * 2017-06-27 2017-12-12 浙江大学 It is a kind of to solve the method for video question and answer using Layered Space-Time notice codec network mechanism
CN107818306A (en) * 2017-10-31 2018-03-20 天津大学 A kind of video answering method based on attention model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ASHISH VASWANI等: "Attention Is All You Need", 《31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS (NIPS 2017)》 *
HAN HU等: "Relation Networks for Object Detection", 《THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION(CVPR),2018》 *
杨启凡: "基于时空注意力网络的视频问答", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348462A (en) * 2019-07-09 2019-10-18 北京金山数字娱乐科技有限公司 A kind of characteristics of image determination, vision answering method, device, equipment and medium
CN110348462B (en) * 2019-07-09 2022-03-04 北京金山数字娱乐科技有限公司 Image feature determination and visual question and answer method, device, equipment and medium
CN110378269A (en) * 2019-07-10 2019-10-25 浙江大学 Pass through the movable method not previewed in image query positioning video
CN110727824A (en) * 2019-10-11 2020-01-24 浙江大学 Method for solving question-answering task of object relationship in video by using multiple interaction attention mechanism
CN110727824B (en) * 2019-10-11 2022-04-01 浙江大学 Method for solving question-answering task of object relationship in video by using multiple interaction attention mechanism

Also Published As

Publication number Publication date
CN109840506B (en) 2020-11-20

Similar Documents

Publication Publication Date Title
CN109241424B (en) A kind of recommended method
CN108804715A (en) Merge multitask coordinated recognition methods and the system of audiovisual perception
CN107766447A (en) It is a kind of to solve the method for video question and answer using multilayer notice network mechanism
CN111046275B (en) User label determining method and device based on artificial intelligence and storage medium
CN110196928B (en) Fully parallelized end-to-end multi-turn dialogue system with domain expansibility and method
CN109840506A (en) The method for solving video question-answering task using the video converter of marriage relation interaction
CN108228674B (en) DKT-based information processing method and device
CN108491514A (en) The method and device putd question in conversational system, electronic equipment, computer-readable medium
CN111680147A (en) Data processing method, device, equipment and readable storage medium
CN110209789A (en) A kind of multi-modal dialog system and method for user's attention guidance
CN109670576A (en) A kind of multiple scale vision concern Image Description Methods
CN110059220A (en) A kind of film recommended method based on deep learning Yu Bayesian probability matrix decomposition
CN109448703A (en) In conjunction with the audio scene recognition method and system of deep neural network and topic model
CN106503659A (en) Action identification method based on sparse coding tensor resolution
CN109902164A (en) It is two-way from the method for noticing that network solves open long format video question and answer using convolution
CN110046271A (en) A kind of remote sensing images based on vocal guidance describe method
CN113888399B (en) Face age synthesis method based on style fusion and domain selection structure
CN111666385A (en) Customer service question-answering system based on deep learning and implementation method
CN115080707A (en) Training method and device for dialogue generating model, electronic equipment and storage medium
Song An Evaluation Method of English Teaching Ability Based on Deep Learning
CN112231455A (en) Machine reading understanding method and system
CN116109978A (en) Self-constrained dynamic text feature-based unsupervised video description method
CN115132181A (en) Speech recognition method, speech recognition apparatus, electronic device, storage medium, and program product
Panesar et al. Improving visual question answering by leveraging depth and adapting explainability
CN113569867A (en) Image processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant