CN113869324A - Video common-sense knowledge reasoning implementation method based on multi-mode fusion - Google Patents
Video common-sense knowledge reasoning implementation method based on multi-mode fusion Download PDFInfo
- Publication number
- CN113869324A CN113869324A CN202110954600.1A CN202110954600A CN113869324A CN 113869324 A CN113869324 A CN 113869324A CN 202110954600 A CN202110954600 A CN 202110954600A CN 113869324 A CN113869324 A CN 113869324A
- Authority
- CN
- China
- Prior art keywords
- feature
- video
- cap
- attention
- decoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 230000004927 fusion Effects 0.000 title claims abstract description 15
- 239000013598 vector Substances 0.000 claims description 25
- 230000006870 function Effects 0.000 claims description 17
- 238000010586 diagram Methods 0.000 claims description 15
- 230000007246 mechanism Effects 0.000 claims description 14
- 238000013507 mapping Methods 0.000 claims description 7
- 238000005457 optimization Methods 0.000 claims description 7
- AMKVXSZCKVJAGH-UHFFFAOYSA-N naratriptan Chemical compound C12=CC(CCS(=O)(=O)NC)=CC=C2NC=C1C1CCN(C)CC1 AMKVXSZCKVJAGH-UHFFFAOYSA-N 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000007476 Maximum Likelihood Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 239000012634 fragment Substances 0.000 description 4
- 230000005055 memory storage Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 230000008033 biological extinction Effects 0.000 description 2
- 238000004880 explosion Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a video common knowledge reasoning implementation method based on multi-mode fusion, which comprises the following steps: 1) respectively extracting intra-frame space characteristics V from input videoiInter-frame timing feature VtAnd sound characteristics Vs(ii) a 2) Spatial feature V in frameiInter-frame timing feature VtAnd sound characteristics VsFusing to obtain multi-modal video characteristics V of the input videoE(ii) a 3) Extracting the descriptive text of the input video to obtain language characteristics CcapAnd apply the video feature VEAnd language feature CcapPerforming fusion to obtain a context feature [ V ]E,Ccap](ii) a 4) Characterizing the context [ V ]E,Ccap]The input common sense inference decoder obtains a probability distribution of the answers and then predicts a common sense knowledge text sequence of the input video based on the obtained probability distribution of the answers. The result obtained by the method has higher prediction precision and interpretability.
Description
Technical Field
The invention relates to the technical field of computer vision and natural language processing, in particular to a method for realizing common sense knowledge reasoning by fusing video multi-modal information and executing word level and semantic level by using a multi-head attention mechanism.
Background
Video understanding is a cross-over technology combining the computer vision field and the natural language processing field, which means that a computer is used for expressing a video frame input sequence, and mathematical modeling is carried out on time information and spatial information contained in the video sequence so as to achieve the purpose of deeply analyzing video contents. The video description (video capturing) is based on video understanding, and a machine model is used for deep mining and analysis understanding of information contained in a video, and then a machine model output natural language is called as a description of the video.
Recently, interest in video common sense knowledge reasoning research has increased because it provides a deeper level of underlying associations for video and language, thus facilitating a higher level of visual language reasoning. Where the "Video 2 common" task is intended to give a piece of Video, generate a Video description, and three types of common knowledge including attributes (attribute), intent (intent), and result (effect). However, the video understanding model currently studied has the following problems: 1) different knowledge is modeled by adopting an independent module, which is contrary to common sense and intuition, cannot bridge implicit association among various common sense information and has a large number of redundant parameters; 2) neglecting the internal logic closed loop of the common sense knowledge, the inference capability is lacked, the semantic interpretation of the complex video cannot be dealt with, and the inference of the common sense knowledge of the video is difficult to realize.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a video commonsense knowledge reasoning implementation method based on multi-modal fusion, which is characterized in that a mixed reasoning network based on a multi-head attention mechanism is designed to jointly execute word-level reasoning and semantic-level reasoning on video content to form a logic closed loop, share knowledge and have higher prediction precision and interpretability.
In order to achieve the above and other objects, the present invention provides a video common sense information inference method based on multi-modal fusion, the technical solution is as follows: designing a hybrid inference network framework (hybrid Net) based on a multi-head attention mechanism, wherein the hybrid Net comprises an abstract decoder, a common sense decoder (attribute, result and intention decoder), and a word-level (word-level) inference and a semantic-level (semantic-level) inference; fused video multimodal information including video static frame information (extracted with ResNet 152), dynamic timing information (extracted with I3D), and sound information (extracted with soudnet); aiming at word level reasoning, a specially designed Memory Module (MMHA) is introduced, and word level prediction is realized by dynamically merging multi-head attention mapping through an attention diagram analyzed from historical information; regarding semantic level reasoning, a plurality of common knowledge is adopted for learning together, wherein different common knowledge information forms a logic closed loop through implicit cross-semantic learning, and knowledge is shared.
The video common sense knowledge reasoning implementation method comprises the following main steps:
step S1, extracting spatial features V in frame from input video respectivelyiInter-frame timing feature VtAnd sound characteristics Vs;
Step S2, the three video characteristics of the step S1 are fused to obtain a multi-modal video characteristic vector VE;
Step S3, extracting the characteristics of the descriptive text of the input video to obtain a language characteristic vector CcapAnd the video characteristic V of the step S2EAnd language feature CcapFusing to obtain updated complete context characteristics VE,Ccap];
Step S4, the context feature [ V ] obtained in S3E,Ccap]As the input of the common sense inference decoder, the probability distribution of the answer is obtained through a specially designed multi-head attention model, and the text sequence of the common sense knowledge of the video is predicted according to the answer probability distribution.
As a preferable scheme: in step S1, the input video is extractedGet the multi-modal feature vector VEIncluding extracting intra-frame spatial information V using ResNet152iI3D extracting inter-frame timing information VtAnd SoundNet extracting sound feature VsThe concrete formula is as follows:
Vi=ResNet(V)
Vt=I3d(V)
Vs=SoundNet(V)
wherein a video V is given and divided into K segments { S } at equal intervals1,S2,…,SK}, each fragment TKFrom the corresponding segment SKObtained by medium random sampling, (T)1,T2,…,TK) Representing the video frequency band SKSampling the obtained fragment sequence, taking the obtained video fragment sequence as input to extract features, and obtaining the spatial feature dimension ofThe characteristic dimension of the time sequence isCharacteristic dimension of sound isWherein the length Li=20,Lt=10,Ls10, hidden layer dimension D1024.
As a preferable scheme: in step S2, the three modal features of the video in step S1 are fused, and each modal feature is mapped to a new feature space through a linear layer and a long-short term memory network, and then a position code PE and a segment code SE are added to obtain a multi-modal video feature vector VEThe formula is as follows:
Ei=SEi+PEi+LSTM(FC(Vi))
Et=SEt+PEt+LSTM(FC(Vt))
Es=SEs+PEs+LSTM(FC(Vs))
VE=[Ei,Et,Es]
the position code PE adopts trigonometric function fixed code, and the segment code SE uses an embedded layer for dynamic learning to distinguish three modal information.
As a preferable scheme: in step S3, the description text of the input video is subjected to the embedding layer coding and the position coding to obtain the text abstract coding Tcap,TcapAs a query vector Q of the subsequent digest decoder, the video feature V in step S2EThe calculation of the multi-head attention mechanism is carried out as the key K and value V vector of the abstract decoder, and the formula is as follows:
yt=FFN(Z)
where Z is the output of the attention mechanism, dkEqual to the dimension of the key K, FFN is a feed forward network consisting of a linear layer and a normalization layer,the maximum likelihood estimate is the optimization function of the digest decoder, ytIs the current morpheme to be predicted, v is the input video, thetacapThe method is mainly used for feature extraction of the abstract, and the abstract decoder can obtain the text code T by using the loss function trainingcapHigh dimensional characteristic expression of Ccap=[yt|1<t<MaxLength]Finally, the video characteristic V of the step S2EAnd language feature CcapSplicing at feature level to obtain updated complete context feature VE,Ccap]。
As a preferable scheme: step by stepIn step S4, the context feature [ V ] obtained in step S3E,Ccap]As the input of the common sense reasoning decoder, the multi-head attention mechanism of the common sense decoder comprises an independently designed Memory Module (MMHA), and when the probability distribution of the common sense knowledge answer is predicted, the historically predicted lemma yt-1Obtaining a conditional attention map A by an MMHA moduleconditionIn order to avoid the problems of gradient extinction and gradient explosion of the memory storage module in long sequence decoding, the MMHA further includes specially designed gate operations and residual connection, and the specific formula is as follows:
M′t=fmlp(Z+Mt-1)+Z+Mt-1
for each prediction of a text sequence lemma, MMHA serves as a bypass network, a query vector is obtained through a history prediction sequence, and Q is Mt-1Wq,K=[Mt-1;yt-1],V=[Mt-1;yt-1]Wv,yt-1The vector is obtained by predicting the lemma through an embedding layer in the last time step; m'tAs the memory vector for temporary update, the memory vector M of the previous time stept-1Obtained through multiple perceptrons; forget gate valueAnd input the value of the gateObtained by the linear transformation and the activation of tanh function of the historical memory value, and finally the dot product sum through a gate mechanismThe dot-and-add operation obtains updated Mt。
When predicting a common sense text sequence, firstly, historical lemmas are sent to a decoder network (including an attribute decoder, a result decoder and an intention decoder) and a bypass network MMHA to obtain respective feature vectors X, and the feature vectors X are respectively sent to an attention layer to obtain aoAnd Ac。AcThen obtaining A through convolution kernel convolution layer at lower right cornertriangle,AtriangleAnd AcWeighted summation to obtain updated conditional attention map AconditionIt is associated with the attention map A generated by the current multi-head attention moduleoThe fusion results in a guiding attention diagram AguideGuidance attention map AguideBridging upper level attention maps A by means of residual concatenationpreviousObtaining the fused attention AmergeThe concrete formula is as follows:
Atriangle=Convtriangle(Ac)
Acondition=α·Atriangle+(1-α)·Ac
Aguide=β·Acondition+(1-β)·Ao
Amerge=γ·Aprevious+(1-γ)·Aguide
wherein A isoIs a feature diagram of multi-head attention generation of a common sense decoder, AcIs a characteristic map generated by the MMHA through history, AtriangleIs AcMapping the special mask convolution layer to obtain a feature map, wherein the special convolution layer is obtained by shifting the center position to the lower right corner through standard convolution, ApreviousIs a characteristic diagram generated by a multi-head attention mechanism on the previous layer, alpha, beta and gamma are taken as hyper-parameters, the weight of each attention is adjusted, and finally the fused attention diagram A is obtainedmergeSending subsequent attention calculation operations and linear layers.
In step S4, A is generatedmergeMasking the feature map to cover out the attention value after the current sequence, and sending the feature map to the softmax normalization function to obtain the normalized attention value, wherein the attention value is obtained by the context feature vector VE,Ccap]Performing dot product and obtaining the probability distribution of the answer through a linear layer, wherein the specific formula is as follows:
Aout=softmax(MASK(Amerge))
Y=FFN(Aout[VE,Ccap]T)
wherein A isoutRepresenting the final attention map, Y is the resultant vector generated by the multi-headed attention mechanism of the particular design.
Training common sense decoderThree maximum likelihood estimates are optimized, whereinThe optimization function is as follows:
wherein y istIs the current word element to be predicted, VEIs an input video feature, CcapIs a textual feature of the video summary, ΘattAre the model parameters of the attribute decoder. In the same way as above, the first and second,also consistent with the above formula, the hybrid inference network (hybrid net) is optimized by a multi-task learning loss functionCross entropy is used for calculation.
In step S4, the prediction of the common knowledge text sequence is obtained from the final answer probability distribution, and the specific formula is as follows:
wherein D isATT,DEFF,DINTAfter training, attribute decoder, result decoder and intention decoder, respectively, VEMulti-modal features representing input video, CcapText lemma representing features of video descriptive text, current knowledgeFrom history lemma Catt,Ceff,CintAre generated sequentially in an autoregressive manner.
Compared with the prior art, the invention has the following positive effects:
the invention is based on a hybrid reasoning network framework (hybrid Net) of a multi-head attention mechanism, can execute common knowledge reasoning of word-level and semantic-level, introduces multi-modal characteristic information during video characteristic extraction, can provide abundant video-level semantic characteristics, and simultaneously, the models share a video encoder and a text encoder, thereby greatly reducing the number of parameters and improving the reasoning speed of the models. In addition, the memory storage module (MMHA) designed by the invention can effectively bridge historical word element information, enhance the generalization of the common sense reasoning method and improve the prediction accuracy of the model.
Drawings
FIG. 1 is a flow chart of the steps of a video common sense knowledge reasoning implementation method based on multi-mode fusion;
FIG. 2 is a system architecture diagram of a video common sense knowledge reasoning implementation method based on multi-mode fusion according to the present invention;
FIG. 3 is a diagram illustrating an internal structure of a memory module according to an embodiment of the present invention;
FIG. 4 is a diagram of a particular multi-headed attention mechanism map in an embodiment of the invention.
Detailed Description
Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.
The invention provides a video common knowledge reasoning implementation method based on multi-mode fusion, which comprises the following specific processes:
1. extraction of video multimodal features and text description features
FIG. 1 is a flow chart of steps of a video common sense knowledge reasoning implementation method based on multi-modal fusion, aiming at feature extraction of videos and texts, the method comprises the following steps:
in step S1, the method for extracting multi-modal feature vectors from the input video includes extracting intra-frame spatial information V using ResNet152iI3D extracting inter-frame timing information VtAnd SoundNet extracting sound feature VsThe concrete formula is as follows:
Vi=ResNet(V)
Vt=I3d(V)
Vs=SoundNet(V)
wherein a video V is given and divided into K segments { S } at equal intervals1,S2,…,SK}, each fragment TKFrom the corresponding segment SKObtained by medium random sampling, (T)1,T2,…,TK) Representing a sequence of segments, the spatial feature dimension of the input video being obtained as Twenty frames of image features in total, and the time sequence feature dimension isTen frames in total are time sequence characteristics, and the dimension of the sound characteristic isAnd extracting sound characteristic V by using SoundNet pre-training models, The length of the sound characteristic sequence is ten, and the characteristic dimensions of the three sequences are D-1024.
In step S2, the three video features of step S1 are fused, each modal feature is mapped to a new feature space through a linear layer and a long-short term memory network, and finally, a position code PE and a segment code SE are added to obtain a multi-modal video feature vector VEThe formula is as follows:
Ei=SEi+PEi+LSTM(FC(Vi))
Et=SEt+PEt+LSTM(FC(Vt))
Es=SEs+PEs+LSTM(FC(Vs))
VE=[Ei,Et,Es]
the segment codes use an embedded layer for optimization learning, the position codes use fixed codes, and the formula is as follows:
pos represents the position of the current sequence, the model output dimension dmodelAnd (4) splicing the three modal characteristics to obtain a multi-modal characteristic V of the videoE=[Ei,Et,Es]Dimension ofLi=20,Lt=10,Ls=10。
In step S3, the text abstract of the input video is subjected to the embedding layer coding and the position coding to obtain a text abstract code Tcap,TcapVideo feature V in step S2 as subsequent digest decoder query vector QEThe multi-headed attention is calculated as the key K and value V vectors of the decoder, by the following formula:
MultiHead(Q,K,V)=Concat(head1,...,headh)WO
whereinThe maximum likelihood estimation is the optimization function of the abstract decoder, the loss function is trained, and the text code T can be obtainedcapHigh dimensional characteristic expression of Ccap=[yt|1<t<MaxLength]. The video characteristic V of the step S2E=[Ei,Et,Es]And step S3 language feature CcapFusing, and splicing at the feature level to obtain the updated complete context feature VE,Ccap]。
2. Common sense knowledge reasoning at word level
FIG. 3 is a diagram of the internal structure of a memory access module (MMHA) that can perform word-level conventional knowledge reasoning, according to an embodiment of the invention. In order to avoid the problems of gradient extinction and gradient explosion of a memory storage module in long sequence decoding, the MMHA comprises specially designed gate operation and residual error connection, and the specific formula is as follows:
M′t=fmlp(Z+Mt-1)+Z+Mt-1
for each prediction of the sequence text lemmas, MMHA serves as a bypass network, a query vector is obtained through a history prediction sequence, and Q is Mt-1Wq,K=[Mt-1;yt-1],V=[Mt-1;yt-1]Wv,yt-1The vector is obtained by predicting the lemma through an embedding layer in the last time step; m'tAs the memory vector for temporary update, the memory vector M of the previous time stept-1Obtained through multiple perceptrons; forget gate valueAnd input the value of the gateThe historical memory value is obtained by linear transformation and tanh function activation, wherein sigma activation function uses sigmoid, and the sigma activation function is finally obtained by point multiplication and point addition of a door mechanismTo updated Mt。
FIG. 4 is a diagram of a particular multi-headed attention mechanism map in an embodiment of the invention.
In step S4, when the probability distribution of the common-sense knowledge answer is predicted by the independently designed Memory Module (MMHA) in the multi-head attention mechanism of the model, the historically predicted lemmas pass through the MMHA module from the current memory state MtObtaining a conditional attention map A by linear transformationconditionIt is associated with the attention map A generated by the current multi-head attention moduleoThe fusion results in a guiding attention diagram AguideGuidance attention map AguideBridging upper level attention maps A by means of residual concatenationpreviousTo obtain the final attention AmergeThe concrete formula is as follows:
Acondition=α·Atriangle+(1-α)·Ac
Aguide=β·Acondition+(1-β)·Ao
Amerge=γ·Aprevious+(1-γ)·Aguide
wherein A isoIs a feature map generated by a multi-head attention module in a decoder, AcIs a characteristic map generated by the MMHA through history, AtriangleIs AcMapping a special convolutional layer to obtain a characteristic diagram, wherein the special convolutional layer is obtained by shifting the center position to the lower right corner by convolution with the standard kernel size of 3 multiplied by 3, alpha, beta and gamma are respectively 0.1, 0.4 and 0.1 as hyper-parameters, adjusting the weight of each attention, and finally mapping the fused attention diagram AmergeSending subsequent attention calculation operations and linear layers. The specific formula is as follows:
Aout=softmax(MASK(Amerge))
in which A is producedmergeMasking the invisible attention value of the feature map by masking operation, and obtaining the normalized attention A by the softmax normalization function againoutThe attention is given by the probability distribution X of the answer obtained by dot product with the value V vector and the linear layerout。
3. Semantic level common sense knowledge reasoning
In step S4, the context feature [ V ] obtained in step S3E,Ccap]And as the input of the common sense inference decoder, obtaining the probability distribution of the answer through a specially designed multi-head attention model, and predicting the text sequence of the common sense knowledge of the video according to the answer probability distribution. Training common sense decoderThree maximum likelihood estimation optimization, different common knowledge information forms a logic closed loop through implicit cross-semantic learning, and knowledge is shared, whereinThe optimization function is as follows:
wherein y istIs the current word element to be predicted, VEIs an input video feature, CcapIs a textual feature of the video summary, ΘattAre the model parameters of the attribute decoder. In the same way as above, the first and second,also consistent with the above formula, the model finally optimizes the hybrid inference network (hybrid net) through a loss function of multi-task learning, wherein the loss function is Cross entropy calculation is adopted, and an autoregressive mode is utilized to predict a text sequence, and the specific formula is as follows:
wherein D isATT,DEFF,DINTAfter training, attribute decoder, result decoder, intention decoder, VEMulti-modal features representing input video, CcapRepresents the characteristics of the video summary of the input,the text lemmas of the current attribute, intention and result sequence respectively, and the text sequence prediction of the final common knowledge is generated in turn by means of autoregression.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.
Claims (6)
1. A video common sense knowledge reasoning implementation method based on multi-mode fusion comprises the following steps:
1) respectively extracting intra-frame space characteristics V from input videoiInter-frame timing feature VtAnd sound characteristics Vs;
2) Spatial feature V in frameiInter-frame timing feature VtAnd sound characteristics VsFusing to obtain multi-modal video characteristics V of the input videoE;
3) Extracting the descriptive text of the input video to obtain language characteristics CcapAnd apply the video feature VEAnd language feature CcapPerforming fusion to obtain a context feature [ V ]E,Ccap];
4) Characterizing the context [ V ]E,Ccap]The input common sense inference decoder obtains a probability distribution of the answers and then predicts a common sense knowledge text sequence of the input video based on the obtained probability distribution of the answers.
2. The method of claim 1, wherein the video feature V is obtainedEThe method comprises the following steps: spatial feature V in frameiMapping to a feature space through a linear layer and a long-short term memory network, and adding an intra-frame space feature ViCorresponding position-coding PEiSum segment encoding SEiObtaining the characteristic Ei(ii) a Characterizing the inter-frame timing VtMapping to a feature space through a linear layer and a long-short term memory network, and adding an inter-frame timing feature VtCorresponding position-coding PEtSum segment encoding SEtObtaining the characteristic Et(ii) a Feature V of soundsMapping to a feature space through a linear layer and a long-short term memory network, and adding sound features VsCorresponding position-coding PEsSum segment encoding SEsObtaining the characteristic Es(ii) a Then E isi,Et,EsFusing to obtain the video characteristic VE=[Ei,Et,Es]。
3. The method of claim 1, wherein the linguistic feature C is obtainedcapThe method comprises the following steps: the description text of the input video is coded by an embedding layer and a position to obtain a text abstract code TcapThen, thenWill TcapVideo features V as abstract decoder query vectors QEAs the key K and value y vector of the abstract decoder, calculating a multi-head attention mechanism to obtain the language feature Ccap。
4. The method of claim 3, wherein the multi-attention mechanism is calculated by the formula:
5. The method of claim 4, wherein the probability distribution of the answers is obtained by: lexical element y predicted by common sense reasoning decoder by using historyt-1Obtaining a conditional attention diagram A through a memory module MMHAcondition(ii) a The conditional attention map is then applied to aconditionThe multi-head attention module generates a feature map A according to historical lemmasoThe fusion results in a guiding attention diagram AguideThe guidance attention map AguideBridging upper level attention maps A by means of residual concatenationpreviousObtaining the fused attention Amerge(ii) a Then to the attention AmergeMasking operation is carried out, the attention value behind the current sequence is covered, and then the attention value is sent to a normalization function to obtain the normalized attention; the normalized attention is then compared to the video feature VEDot product and linear layerTo the probability distribution of the answer.
6. The method of claim 1, utilizing Predicting a common sense knowledge text sequence of the input video; wherein D isATTAs attribute decoders, DEFFAs a result of the decoder, DINTFor purpose decoder, VEMulti-modal features representing input video, CcapText lemma representing features of video descriptive text, current knowledgeFrom history lemma Catt,Ceff,CintAre generated sequentially in an autoregressive manner.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110954600.1A CN113869324A (en) | 2021-08-19 | 2021-08-19 | Video common-sense knowledge reasoning implementation method based on multi-mode fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110954600.1A CN113869324A (en) | 2021-08-19 | 2021-08-19 | Video common-sense knowledge reasoning implementation method based on multi-mode fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113869324A true CN113869324A (en) | 2021-12-31 |
Family
ID=78990660
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110954600.1A Pending CN113869324A (en) | 2021-08-19 | 2021-08-19 | Video common-sense knowledge reasoning implementation method based on multi-mode fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113869324A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114339450A (en) * | 2022-03-11 | 2022-04-12 | 中国科学技术大学 | Video comment generation method, system, device and storage medium |
CN116012374A (en) * | 2023-03-15 | 2023-04-25 | 译企科技(成都)有限公司 | Three-dimensional PET-CT head and neck tumor segmentation system and method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113034401A (en) * | 2021-04-08 | 2021-06-25 | 中国科学技术大学 | Video denoising method and device, storage medium and electronic equipment |
CN113191230A (en) * | 2021-04-20 | 2021-07-30 | 内蒙古工业大学 | Gait recognition method based on gait space-time characteristic decomposition |
-
2021
- 2021-08-19 CN CN202110954600.1A patent/CN113869324A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113034401A (en) * | 2021-04-08 | 2021-06-25 | 中国科学技术大学 | Video denoising method and device, storage medium and electronic equipment |
CN113191230A (en) * | 2021-04-20 | 2021-07-30 | 内蒙古工业大学 | Gait recognition method based on gait space-time characteristic decomposition |
Non-Patent Citations (1)
Title |
---|
WEIJIANG YU 等: "Hybrid Reasoning Network for Video-based Commonsense Captioning", 《ARXIV:2108.02365V1》, 5 August 2021 (2021-08-05), pages 1 - 3 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114339450A (en) * | 2022-03-11 | 2022-04-12 | 中国科学技术大学 | Video comment generation method, system, device and storage medium |
CN116012374A (en) * | 2023-03-15 | 2023-04-25 | 译企科技(成都)有限公司 | Three-dimensional PET-CT head and neck tumor segmentation system and method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xu et al. | Multimodal learning with transformers: A survey | |
Zhou et al. | A comprehensive survey on pretrained foundation models: A history from bert to chatgpt | |
WO2021233112A1 (en) | Multimodal machine learning-based translation method, device, equipment, and storage medium | |
Wu et al. | Video sentiment analysis with bimodal information-augmented multi-head attention | |
WO2021169745A1 (en) | User intention recognition method and apparatus based on statement context relationship prediction | |
CN118349673A (en) | Training method of text processing model, text processing method and device | |
WO2023160472A1 (en) | Model training method and related device | |
Zhang et al. | Explicit contextual semantics for text comprehension | |
CN113869324A (en) | Video common-sense knowledge reasoning implementation method based on multi-mode fusion | |
Zhou et al. | Learning with annotation of various degrees | |
CN116432019A (en) | Data processing method and related equipment | |
Pang et al. | A novel syntax-aware automatic graphics code generation with attention-based deep neural network | |
CN111597816A (en) | Self-attention named entity recognition method, device, equipment and storage medium | |
Pal et al. | R-GRU: Regularized gated recurrent unit for handwritten mathematical expression recognition | |
CN114677631A (en) | Cultural resource video Chinese description generation method based on multi-feature fusion and multi-stage training | |
CN112738647A (en) | Video description method and system based on multi-level coder-decoder | |
Chen et al. | Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion | |
Zhou et al. | An image captioning model based on bidirectional depth residuals and its application | |
D’Ulizia | Exploring multimodal input fusion strategies | |
CN115964497A (en) | Event extraction method integrating attention mechanism and convolutional neural network | |
CN116341564A (en) | Problem reasoning method and device based on semantic understanding | |
Cao et al. | Predict, pretrained, select and answer: Interpretable and scalable complex question answering over knowledge bases | |
Wang et al. | A stack-propagation framework with slot filling for multi-domain dialogue state tracking | |
CN115470327A (en) | Medical question-answering method based on knowledge graph and related equipment | |
CN114936564A (en) | Multi-language semantic matching method and system based on alignment variational self-coding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |