CN113869324A - Video common-sense knowledge reasoning implementation method based on multi-mode fusion - Google Patents

Video common-sense knowledge reasoning implementation method based on multi-mode fusion Download PDF

Info

Publication number
CN113869324A
CN113869324A CN202110954600.1A CN202110954600A CN113869324A CN 113869324 A CN113869324 A CN 113869324A CN 202110954600 A CN202110954600 A CN 202110954600A CN 113869324 A CN113869324 A CN 113869324A
Authority
CN
China
Prior art keywords
feature
video
cap
attention
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110954600.1A
Other languages
Chinese (zh)
Inventor
方跃坚
梁健
余伟江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202110954600.1A priority Critical patent/CN113869324A/en
Publication of CN113869324A publication Critical patent/CN113869324A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video common knowledge reasoning implementation method based on multi-mode fusion, which comprises the following steps: 1) respectively extracting intra-frame space characteristics V from input videoiInter-frame timing feature VtAnd sound characteristics Vs(ii) a 2) Spatial feature V in frameiInter-frame timing feature VtAnd sound characteristics VsFusing to obtain multi-modal video characteristics V of the input videoE(ii) a 3) Extracting the descriptive text of the input video to obtain language characteristics CcapAnd apply the video feature VEAnd language feature CcapPerforming fusion to obtain a context feature [ V ]E,Ccap](ii) a 4) Characterizing the context [ V ]E,Ccap]The input common sense inference decoder obtains a probability distribution of the answers and then predicts a common sense knowledge text sequence of the input video based on the obtained probability distribution of the answers. The result obtained by the method has higher prediction precision and interpretability.

Description

Video common-sense knowledge reasoning implementation method based on multi-mode fusion
Technical Field
The invention relates to the technical field of computer vision and natural language processing, in particular to a method for realizing common sense knowledge reasoning by fusing video multi-modal information and executing word level and semantic level by using a multi-head attention mechanism.
Background
Video understanding is a cross-over technology combining the computer vision field and the natural language processing field, which means that a computer is used for expressing a video frame input sequence, and mathematical modeling is carried out on time information and spatial information contained in the video sequence so as to achieve the purpose of deeply analyzing video contents. The video description (video capturing) is based on video understanding, and a machine model is used for deep mining and analysis understanding of information contained in a video, and then a machine model output natural language is called as a description of the video.
Recently, interest in video common sense knowledge reasoning research has increased because it provides a deeper level of underlying associations for video and language, thus facilitating a higher level of visual language reasoning. Where the "Video 2 common" task is intended to give a piece of Video, generate a Video description, and three types of common knowledge including attributes (attribute), intent (intent), and result (effect). However, the video understanding model currently studied has the following problems: 1) different knowledge is modeled by adopting an independent module, which is contrary to common sense and intuition, cannot bridge implicit association among various common sense information and has a large number of redundant parameters; 2) neglecting the internal logic closed loop of the common sense knowledge, the inference capability is lacked, the semantic interpretation of the complex video cannot be dealt with, and the inference of the common sense knowledge of the video is difficult to realize.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a video commonsense knowledge reasoning implementation method based on multi-modal fusion, which is characterized in that a mixed reasoning network based on a multi-head attention mechanism is designed to jointly execute word-level reasoning and semantic-level reasoning on video content to form a logic closed loop, share knowledge and have higher prediction precision and interpretability.
In order to achieve the above and other objects, the present invention provides a video common sense information inference method based on multi-modal fusion, the technical solution is as follows: designing a hybrid inference network framework (hybrid Net) based on a multi-head attention mechanism, wherein the hybrid Net comprises an abstract decoder, a common sense decoder (attribute, result and intention decoder), and a word-level (word-level) inference and a semantic-level (semantic-level) inference; fused video multimodal information including video static frame information (extracted with ResNet 152), dynamic timing information (extracted with I3D), and sound information (extracted with soudnet); aiming at word level reasoning, a specially designed Memory Module (MMHA) is introduced, and word level prediction is realized by dynamically merging multi-head attention mapping through an attention diagram analyzed from historical information; regarding semantic level reasoning, a plurality of common knowledge is adopted for learning together, wherein different common knowledge information forms a logic closed loop through implicit cross-semantic learning, and knowledge is shared.
The video common sense knowledge reasoning implementation method comprises the following main steps:
step S1, extracting spatial features V in frame from input video respectivelyiInter-frame timing feature VtAnd sound characteristics Vs
Step S2, the three video characteristics of the step S1 are fused to obtain a multi-modal video characteristic vector VE
Step S3, extracting the characteristics of the descriptive text of the input video to obtain a language characteristic vector CcapAnd the video characteristic V of the step S2EAnd language feature CcapFusing to obtain updated complete context characteristics VE,Ccap];
Step S4, the context feature [ V ] obtained in S3E,Ccap]As the input of the common sense inference decoder, the probability distribution of the answer is obtained through a specially designed multi-head attention model, and the text sequence of the common sense knowledge of the video is predicted according to the answer probability distribution.
As a preferable scheme: in step S1, the input video is extractedGet the multi-modal feature vector VEIncluding extracting intra-frame spatial information V using ResNet152iI3D extracting inter-frame timing information VtAnd SoundNet extracting sound feature VsThe concrete formula is as follows:
Vi=ResNet(V)
Vt=I3d(V)
Vs=SoundNet(V)
wherein a video V is given and divided into K segments { S } at equal intervals1,S2,…,SK}, each fragment TKFrom the corresponding segment SKObtained by medium random sampling, (T)1,T2,…,TK) Representing the video frequency band SKSampling the obtained fragment sequence, taking the obtained video fragment sequence as input to extract features, and obtaining the spatial feature dimension of
Figure BDA0003219949970000021
The characteristic dimension of the time sequence is
Figure BDA0003219949970000022
Characteristic dimension of sound is
Figure BDA0003219949970000023
Wherein the length Li=20,Lt=10,Ls10, hidden layer dimension D1024.
As a preferable scheme: in step S2, the three modal features of the video in step S1 are fused, and each modal feature is mapped to a new feature space through a linear layer and a long-short term memory network, and then a position code PE and a segment code SE are added to obtain a multi-modal video feature vector VEThe formula is as follows:
Ei=SEi+PEi+LSTM(FC(Vi))
Et=SEt+PEt+LSTM(FC(Vt))
Es=SEs+PEs+LSTM(FC(Vs))
VE=[Ei,Et,Es]
the position code PE adopts trigonometric function fixed code, and the segment code SE uses an embedded layer for dynamic learning to distinguish three modal information.
As a preferable scheme: in step S3, the description text of the input video is subjected to the embedding layer coding and the position coding to obtain the text abstract coding Tcap,TcapAs a query vector Q of the subsequent digest decoder, the video feature V in step S2EThe calculation of the multi-head attention mechanism is carried out as the key K and value V vector of the abstract decoder, and the formula is as follows:
Figure BDA0003219949970000031
yt=FFN(Z)
Figure BDA0003219949970000032
where Z is the output of the attention mechanism, dkEqual to the dimension of the key K, FFN is a feed forward network consisting of a linear layer and a normalization layer,
Figure BDA0003219949970000033
the maximum likelihood estimate is the optimization function of the digest decoder, ytIs the current morpheme to be predicted, v is the input video, thetacapThe method is mainly used for feature extraction of the abstract, and the abstract decoder can obtain the text code T by using the loss function trainingcapHigh dimensional characteristic expression of Ccap=[yt|1<t<MaxLength]Finally, the video characteristic V of the step S2EAnd language feature CcapSplicing at feature level to obtain updated complete context feature VE,Ccap]。
As a preferable scheme: step by stepIn step S4, the context feature [ V ] obtained in step S3E,Ccap]As the input of the common sense reasoning decoder, the multi-head attention mechanism of the common sense decoder comprises an independently designed Memory Module (MMHA), and when the probability distribution of the common sense knowledge answer is predicted, the historically predicted lemma yt-1Obtaining a conditional attention map A by an MMHA moduleconditionIn order to avoid the problems of gradient extinction and gradient explosion of the memory storage module in long sequence decoding, the MMHA further includes specially designed gate operations and residual connection, and the specific formula is as follows:
M′t=fmlp(Z+Mt-1)+Z+Mt-1
Figure BDA0003219949970000034
Figure BDA0003219949970000035
Figure BDA0003219949970000036
for each prediction of a text sequence lemma, MMHA serves as a bypass network, a query vector is obtained through a history prediction sequence, and Q is Mt-1Wq,K=[Mt-1;yt-1],V=[Mt-1;yt-1]Wv,yt-1The vector is obtained by predicting the lemma through an embedding layer in the last time step; m'tAs the memory vector for temporary update, the memory vector M of the previous time stept-1Obtained through multiple perceptrons; forget gate value
Figure BDA0003219949970000037
And input the value of the gate
Figure BDA0003219949970000038
Obtained by the linear transformation and the activation of tanh function of the historical memory value, and finally the dot product sum through a gate mechanismThe dot-and-add operation obtains updated Mt
When predicting a common sense text sequence, firstly, historical lemmas are sent to a decoder network (including an attribute decoder, a result decoder and an intention decoder) and a bypass network MMHA to obtain respective feature vectors X, and the feature vectors X are respectively sent to an attention layer to obtain aoAnd Ac。AcThen obtaining A through convolution kernel convolution layer at lower right cornertriangle,AtriangleAnd AcWeighted summation to obtain updated conditional attention map AconditionIt is associated with the attention map A generated by the current multi-head attention moduleoThe fusion results in a guiding attention diagram AguideGuidance attention map AguideBridging upper level attention maps A by means of residual concatenationpreviousObtaining the fused attention AmergeThe concrete formula is as follows:
Figure BDA0003219949970000041
Atriangle=Convtriangle(Ac)
Acondition=α·Atriangle+(1-α)·Ac
Aguide=β·Acondition+(1-β)·Ao
Amerge=γ·Aprevious+(1-γ)·Aguide
wherein A isoIs a feature diagram of multi-head attention generation of a common sense decoder, AcIs a characteristic map generated by the MMHA through history, AtriangleIs AcMapping the special mask convolution layer to obtain a feature map, wherein the special convolution layer is obtained by shifting the center position to the lower right corner through standard convolution, ApreviousIs a characteristic diagram generated by a multi-head attention mechanism on the previous layer, alpha, beta and gamma are taken as hyper-parameters, the weight of each attention is adjusted, and finally the fused attention diagram A is obtainedmergeSending subsequent attention calculation operations and linear layers.
In step S4, A is generatedmergeMasking the feature map to cover out the attention value after the current sequence, and sending the feature map to the softmax normalization function to obtain the normalized attention value, wherein the attention value is obtained by the context feature vector VE,Ccap]Performing dot product and obtaining the probability distribution of the answer through a linear layer, wherein the specific formula is as follows:
Aout=softmax(MASK(Amerge))
Y=FFN(Aout[VE,Ccap]T)
wherein A isoutRepresenting the final attention map, Y is the resultant vector generated by the multi-headed attention mechanism of the particular design.
Training common sense decoder
Figure BDA0003219949970000042
Three maximum likelihood estimates are optimized, wherein
Figure BDA0003219949970000043
The optimization function is as follows:
Figure BDA0003219949970000044
wherein y istIs the current word element to be predicted, VEIs an input video feature, CcapIs a textual feature of the video summary, ΘattAre the model parameters of the attribute decoder. In the same way as above, the first and second,
Figure BDA0003219949970000045
also consistent with the above formula, the hybrid inference network (hybrid net) is optimized by a multi-task learning loss function
Figure BDA0003219949970000046
Cross entropy is used for calculation.
In step S4, the prediction of the common knowledge text sequence is obtained from the final answer probability distribution, and the specific formula is as follows:
Figure BDA0003219949970000051
Figure BDA0003219949970000052
Figure BDA0003219949970000053
wherein D isATT,DEFF,DINTAfter training, attribute decoder, result decoder and intention decoder, respectively, VEMulti-modal features representing input video, CcapText lemma representing features of video descriptive text, current knowledge
Figure BDA0003219949970000054
From history lemma Catt,Ceff,CintAre generated sequentially in an autoregressive manner.
Compared with the prior art, the invention has the following positive effects:
the invention is based on a hybrid reasoning network framework (hybrid Net) of a multi-head attention mechanism, can execute common knowledge reasoning of word-level and semantic-level, introduces multi-modal characteristic information during video characteristic extraction, can provide abundant video-level semantic characteristics, and simultaneously, the models share a video encoder and a text encoder, thereby greatly reducing the number of parameters and improving the reasoning speed of the models. In addition, the memory storage module (MMHA) designed by the invention can effectively bridge historical word element information, enhance the generalization of the common sense reasoning method and improve the prediction accuracy of the model.
Drawings
FIG. 1 is a flow chart of the steps of a video common sense knowledge reasoning implementation method based on multi-mode fusion;
FIG. 2 is a system architecture diagram of a video common sense knowledge reasoning implementation method based on multi-mode fusion according to the present invention;
FIG. 3 is a diagram illustrating an internal structure of a memory module according to an embodiment of the present invention;
FIG. 4 is a diagram of a particular multi-headed attention mechanism map in an embodiment of the invention.
Detailed Description
Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.
The invention provides a video common knowledge reasoning implementation method based on multi-mode fusion, which comprises the following specific processes:
1. extraction of video multimodal features and text description features
FIG. 1 is a flow chart of steps of a video common sense knowledge reasoning implementation method based on multi-modal fusion, aiming at feature extraction of videos and texts, the method comprises the following steps:
in step S1, the method for extracting multi-modal feature vectors from the input video includes extracting intra-frame spatial information V using ResNet152iI3D extracting inter-frame timing information VtAnd SoundNet extracting sound feature VsThe concrete formula is as follows:
Vi=ResNet(V)
Vt=I3d(V)
Vs=SoundNet(V)
wherein a video V is given and divided into K segments { S } at equal intervals1,S2,…,SK}, each fragment TKFrom the corresponding segment SKObtained by medium random sampling, (T)1,T2,…,TK) Representing a sequence of segments, the spatial feature dimension of the input video being obtained as
Figure BDA0003219949970000061
Figure BDA0003219949970000062
Twenty frames of image features in total, and the time sequence feature dimension is
Figure BDA0003219949970000063
Ten frames in total are time sequence characteristics, and the dimension of the sound characteristic is
Figure BDA0003219949970000064
And extracting sound characteristic V by using SoundNet pre-training models
Figure BDA00032199499700000611
Figure BDA0003219949970000065
The length of the sound characteristic sequence is ten, and the characteristic dimensions of the three sequences are D-1024.
In step S2, the three video features of step S1 are fused, each modal feature is mapped to a new feature space through a linear layer and a long-short term memory network, and finally, a position code PE and a segment code SE are added to obtain a multi-modal video feature vector VEThe formula is as follows:
Ei=SEi+PEi+LSTM(FC(Vi))
Et=SEt+PEt+LSTM(FC(Vt))
Es=SEs+PEs+LSTM(FC(Vs))
VE=[Ei,Et,Es]
the segment codes use an embedded layer for optimization learning, the position codes use fixed codes, and the formula is as follows:
Figure BDA0003219949970000066
Figure BDA0003219949970000067
pos represents the position of the current sequence, the model output dimension dmodelAnd (4) splicing the three modal characteristics to obtain a multi-modal characteristic V of the videoE=[Ei,Et,Es]Dimension of
Figure BDA0003219949970000068
Li=20,Lt=10,Ls=10。
In step S3, the text abstract of the input video is subjected to the embedding layer coding and the position coding to obtain a text abstract code Tcap,TcapVideo feature V in step S2 as subsequent digest decoder query vector QEThe multi-headed attention is calculated as the key K and value V vectors of the decoder, by the following formula:
MultiHead(Q,K,V)=Concat(head1,...,headh)WO
Figure BDA0003219949970000069
Figure BDA00032199499700000610
wherein
Figure BDA0003219949970000071
The maximum likelihood estimation is the optimization function of the abstract decoder, the loss function is trained, and the text code T can be obtainedcapHigh dimensional characteristic expression of Ccap=[yt|1<t<MaxLength]. The video characteristic V of the step S2E=[Ei,Et,Es]And step S3 language feature CcapFusing, and splicing at the feature level to obtain the updated complete context feature VE,Ccap]。
2. Common sense knowledge reasoning at word level
FIG. 3 is a diagram of the internal structure of a memory access module (MMHA) that can perform word-level conventional knowledge reasoning, according to an embodiment of the invention. In order to avoid the problems of gradient extinction and gradient explosion of a memory storage module in long sequence decoding, the MMHA comprises specially designed gate operation and residual error connection, and the specific formula is as follows:
M′t=fmlp(Z+Mt-1)+Z+Mt-1
Figure BDA0003219949970000072
Figure BDA0003219949970000073
Figure BDA0003219949970000074
for each prediction of the sequence text lemmas, MMHA serves as a bypass network, a query vector is obtained through a history prediction sequence, and Q is Mt-1Wq,K=[Mt-1;yt-1],V=[Mt-1;yt-1]Wv,yt-1The vector is obtained by predicting the lemma through an embedding layer in the last time step; m'tAs the memory vector for temporary update, the memory vector M of the previous time stept-1Obtained through multiple perceptrons; forget gate value
Figure BDA0003219949970000075
And input the value of the gate
Figure BDA0003219949970000076
The historical memory value is obtained by linear transformation and tanh function activation, wherein sigma activation function uses sigmoid, and the sigma activation function is finally obtained by point multiplication and point addition of a door mechanismTo updated Mt
FIG. 4 is a diagram of a particular multi-headed attention mechanism map in an embodiment of the invention.
In step S4, when the probability distribution of the common-sense knowledge answer is predicted by the independently designed Memory Module (MMHA) in the multi-head attention mechanism of the model, the historically predicted lemmas pass through the MMHA module from the current memory state MtObtaining a conditional attention map A by linear transformationconditionIt is associated with the attention map A generated by the current multi-head attention moduleoThe fusion results in a guiding attention diagram AguideGuidance attention map AguideBridging upper level attention maps A by means of residual concatenationpreviousTo obtain the final attention AmergeThe concrete formula is as follows:
Figure BDA0003219949970000077
Acondition=α·Atriangle+(1-α)·Ac
Aguide=β·Acondition+(1-β)·Ao
Amerge=γ·Aprevious+(1-γ)·Aguide
wherein A isoIs a feature map generated by a multi-head attention module in a decoder, AcIs a characteristic map generated by the MMHA through history, AtriangleIs AcMapping a special convolutional layer to obtain a characteristic diagram, wherein the special convolutional layer is obtained by shifting the center position to the lower right corner by convolution with the standard kernel size of 3 multiplied by 3, alpha, beta and gamma are respectively 0.1, 0.4 and 0.1 as hyper-parameters, adjusting the weight of each attention, and finally mapping the fused attention diagram AmergeSending subsequent attention calculation operations and linear layers. The specific formula is as follows:
Aout=softmax(MASK(Amerge))
Figure BDA0003219949970000081
in which A is producedmergeMasking the invisible attention value of the feature map by masking operation, and obtaining the normalized attention A by the softmax normalization function againoutThe attention is given by the probability distribution X of the answer obtained by dot product with the value V vector and the linear layerout
3. Semantic level common sense knowledge reasoning
In step S4, the context feature [ V ] obtained in step S3E,Ccap]And as the input of the common sense inference decoder, obtaining the probability distribution of the answer through a specially designed multi-head attention model, and predicting the text sequence of the common sense knowledge of the video according to the answer probability distribution. Training common sense decoder
Figure BDA0003219949970000082
Three maximum likelihood estimation optimization, different common knowledge information forms a logic closed loop through implicit cross-semantic learning, and knowledge is shared, wherein
Figure BDA0003219949970000083
The optimization function is as follows:
Figure BDA0003219949970000084
wherein y istIs the current word element to be predicted, VEIs an input video feature, CcapIs a textual feature of the video summary, ΘattAre the model parameters of the attribute decoder. In the same way as above, the first and second,
Figure BDA0003219949970000085
also consistent with the above formula, the model finally optimizes the hybrid inference network (hybrid net) through a loss function of multi-task learning, wherein the loss function is
Figure BDA0003219949970000086
Figure BDA0003219949970000087
Cross entropy calculation is adopted, and an autoregressive mode is utilized to predict a text sequence, and the specific formula is as follows:
Figure BDA0003219949970000088
Figure BDA0003219949970000089
Figure BDA00032199499700000810
wherein D isATT,DEFF,DINTAfter training, attribute decoder, result decoder, intention decoder, VEMulti-modal features representing input video, CcapRepresents the characteristics of the video summary of the input,
Figure BDA00032199499700000811
the text lemmas of the current attribute, intention and result sequence respectively, and the text sequence prediction of the final common knowledge is generated in turn by means of autoregression.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.

Claims (6)

1. A video common sense knowledge reasoning implementation method based on multi-mode fusion comprises the following steps:
1) respectively extracting intra-frame space characteristics V from input videoiInter-frame timing feature VtAnd sound characteristics Vs
2) Spatial feature V in frameiInter-frame timing feature VtAnd sound characteristics VsFusing to obtain multi-modal video characteristics V of the input videoE
3) Extracting the descriptive text of the input video to obtain language characteristics CcapAnd apply the video feature VEAnd language feature CcapPerforming fusion to obtain a context feature [ V ]E,Ccap];
4) Characterizing the context [ V ]E,Ccap]The input common sense inference decoder obtains a probability distribution of the answers and then predicts a common sense knowledge text sequence of the input video based on the obtained probability distribution of the answers.
2. The method of claim 1, wherein the video feature V is obtainedEThe method comprises the following steps: spatial feature V in frameiMapping to a feature space through a linear layer and a long-short term memory network, and adding an intra-frame space feature ViCorresponding position-coding PEiSum segment encoding SEiObtaining the characteristic Ei(ii) a Characterizing the inter-frame timing VtMapping to a feature space through a linear layer and a long-short term memory network, and adding an inter-frame timing feature VtCorresponding position-coding PEtSum segment encoding SEtObtaining the characteristic Et(ii) a Feature V of soundsMapping to a feature space through a linear layer and a long-short term memory network, and adding sound features VsCorresponding position-coding PEsSum segment encoding SEsObtaining the characteristic Es(ii) a Then E isi,Et,EsFusing to obtain the video characteristic VE=[Ei,Et,Es]。
3. The method of claim 1, wherein the linguistic feature C is obtainedcapThe method comprises the following steps: the description text of the input video is coded by an embedding layer and a position to obtain a text abstract code TcapThen, thenWill TcapVideo features V as abstract decoder query vectors QEAs the key K and value y vector of the abstract decoder, calculating a multi-head attention mechanism to obtain the language feature Ccap
4. The method of claim 3, wherein the multi-attention mechanism is calculated by the formula:
Figure FDA0003219949960000011
yt=FFN(Z)、
Figure FDA0003219949960000012
wherein d iskBeing the dimension of the key K, FFN is a feed forward network,
Figure FDA0003219949960000013
is an optimization function of the digest decoder, ytIs the morpheme to be predicted, v is the input video, thetacapAre the model parameters of the digest decoder.
5. The method of claim 4, wherein the probability distribution of the answers is obtained by: lexical element y predicted by common sense reasoning decoder by using historyt-1Obtaining a conditional attention diagram A through a memory module MMHAcondition(ii) a The conditional attention map is then applied to aconditionThe multi-head attention module generates a feature map A according to historical lemmasoThe fusion results in a guiding attention diagram AguideThe guidance attention map AguideBridging upper level attention maps A by means of residual concatenationpreviousObtaining the fused attention Amerge(ii) a Then to the attention AmergeMasking operation is carried out, the attention value behind the current sequence is covered, and then the attention value is sent to a normalization function to obtain the normalized attention; the normalized attention is then compared to the video feature VEDot product and linear layerTo the probability distribution of the answer.
6. The method of claim 1, utilizing
Figure FDA0003219949960000021
Figure FDA0003219949960000022
Predicting a common sense knowledge text sequence of the input video; wherein D isATTAs attribute decoders, DEFFAs a result of the decoder, DINTFor purpose decoder, VEMulti-modal features representing input video, CcapText lemma representing features of video descriptive text, current knowledge
Figure FDA0003219949960000023
From history lemma Catt,Ceff,CintAre generated sequentially in an autoregressive manner.
CN202110954600.1A 2021-08-19 2021-08-19 Video common-sense knowledge reasoning implementation method based on multi-mode fusion Pending CN113869324A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110954600.1A CN113869324A (en) 2021-08-19 2021-08-19 Video common-sense knowledge reasoning implementation method based on multi-mode fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110954600.1A CN113869324A (en) 2021-08-19 2021-08-19 Video common-sense knowledge reasoning implementation method based on multi-mode fusion

Publications (1)

Publication Number Publication Date
CN113869324A true CN113869324A (en) 2021-12-31

Family

ID=78990660

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110954600.1A Pending CN113869324A (en) 2021-08-19 2021-08-19 Video common-sense knowledge reasoning implementation method based on multi-mode fusion

Country Status (1)

Country Link
CN (1) CN113869324A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114339450A (en) * 2022-03-11 2022-04-12 中国科学技术大学 Video comment generation method, system, device and storage medium
CN116012374A (en) * 2023-03-15 2023-04-25 译企科技(成都)有限公司 Three-dimensional PET-CT head and neck tumor segmentation system and method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113034401A (en) * 2021-04-08 2021-06-25 中国科学技术大学 Video denoising method and device, storage medium and electronic equipment
CN113191230A (en) * 2021-04-20 2021-07-30 内蒙古工业大学 Gait recognition method based on gait space-time characteristic decomposition

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113034401A (en) * 2021-04-08 2021-06-25 中国科学技术大学 Video denoising method and device, storage medium and electronic equipment
CN113191230A (en) * 2021-04-20 2021-07-30 内蒙古工业大学 Gait recognition method based on gait space-time characteristic decomposition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WEIJIANG YU 等: "Hybrid Reasoning Network for Video-based Commonsense Captioning", 《ARXIV:2108.02365V1》, 5 August 2021 (2021-08-05), pages 1 - 3 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114339450A (en) * 2022-03-11 2022-04-12 中国科学技术大学 Video comment generation method, system, device and storage medium
CN116012374A (en) * 2023-03-15 2023-04-25 译企科技(成都)有限公司 Three-dimensional PET-CT head and neck tumor segmentation system and method

Similar Documents

Publication Publication Date Title
Xu et al. Multimodal learning with transformers: A survey
Zhou et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt
WO2021233112A1 (en) Multimodal machine learning-based translation method, device, equipment, and storage medium
Wu et al. Video sentiment analysis with bimodal information-augmented multi-head attention
WO2021169745A1 (en) User intention recognition method and apparatus based on statement context relationship prediction
CN118349673A (en) Training method of text processing model, text processing method and device
WO2023160472A1 (en) Model training method and related device
Zhang et al. Explicit contextual semantics for text comprehension
CN113869324A (en) Video common-sense knowledge reasoning implementation method based on multi-mode fusion
Zhou et al. Learning with annotation of various degrees
CN116432019A (en) Data processing method and related equipment
Pang et al. A novel syntax-aware automatic graphics code generation with attention-based deep neural network
CN111597816A (en) Self-attention named entity recognition method, device, equipment and storage medium
Pal et al. R-GRU: Regularized gated recurrent unit for handwritten mathematical expression recognition
CN114677631A (en) Cultural resource video Chinese description generation method based on multi-feature fusion and multi-stage training
CN112738647A (en) Video description method and system based on multi-level coder-decoder
Chen et al. Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion
Zhou et al. An image captioning model based on bidirectional depth residuals and its application
D’Ulizia Exploring multimodal input fusion strategies
CN115964497A (en) Event extraction method integrating attention mechanism and convolutional neural network
CN116341564A (en) Problem reasoning method and device based on semantic understanding
Cao et al. Predict, pretrained, select and answer: Interpretable and scalable complex question answering over knowledge bases
Wang et al. A stack-propagation framework with slot filling for multi-domain dialogue state tracking
CN115470327A (en) Medical question-answering method based on knowledge graph and related equipment
CN114936564A (en) Multi-language semantic matching method and system based on alignment variational self-coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination