CN113869324A

CN113869324A - Video common-sense knowledge reasoning implementation method based on multi-mode fusion

Info

Publication number: CN113869324A
Application number: CN202110954600.1A
Authority: CN
Inventors: 方跃坚; 梁健; 余伟江
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2021-12-31

Abstract

The invention discloses a video common knowledge reasoning implementation method based on multi-mode fusion, which comprises the following steps: 1) respectively extracting intra-frame space characteristics V from input video_iInter-frame timing feature V_tAnd sound characteristics V_s(ii) a 2) Spatial feature V in frame_iInter-frame timing feature V_tAnd sound characteristics V_sFusing to obtain multi-modal video characteristics V of the input video_E(ii) a 3) Extracting the descriptive text of the input video to obtain language characteristics C_capAnd apply the video feature V_EAnd language feature C_capPerforming fusion to obtain a context feature [ V ]_E，C_cap](ii) a 4) Characterizing the context [ V ]_E，C_cap]The input common sense inference decoder obtains a probability distribution of the answers and then predicts a common sense knowledge text sequence of the input video based on the obtained probability distribution of the answers. The result obtained by the method has higher prediction precision and interpretability.

Description

Video common-sense knowledge reasoning implementation method based on multi-mode fusion

Technical Field

The invention relates to the technical field of computer vision and natural language processing, in particular to a method for realizing common sense knowledge reasoning by fusing video multi-modal information and executing word level and semantic level by using a multi-head attention mechanism.

Background

Video understanding is a cross-over technology combining the computer vision field and the natural language processing field, which means that a computer is used for expressing a video frame input sequence, and mathematical modeling is carried out on time information and spatial information contained in the video sequence so as to achieve the purpose of deeply analyzing video contents. The video description (video capturing) is based on video understanding, and a machine model is used for deep mining and analysis understanding of information contained in a video, and then a machine model output natural language is called as a description of the video.

Recently, interest in video common sense knowledge reasoning research has increased because it provides a deeper level of underlying associations for video and language, thus facilitating a higher level of visual language reasoning. Where the "Video 2 common" task is intended to give a piece of Video, generate a Video description, and three types of common knowledge including attributes (attribute), intent (intent), and result (effect). However, the video understanding model currently studied has the following problems: 1) different knowledge is modeled by adopting an independent module, which is contrary to common sense and intuition, cannot bridge implicit association among various common sense information and has a large number of redundant parameters; 2) neglecting the internal logic closed loop of the common sense knowledge, the inference capability is lacked, the semantic interpretation of the complex video cannot be dealt with, and the inference of the common sense knowledge of the video is difficult to realize.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a video commonsense knowledge reasoning implementation method based on multi-modal fusion, which is characterized in that a mixed reasoning network based on a multi-head attention mechanism is designed to jointly execute word-level reasoning and semantic-level reasoning on video content to form a logic closed loop, share knowledge and have higher prediction precision and interpretability.

In order to achieve the above and other objects, the present invention provides a video common sense information inference method based on multi-modal fusion, the technical solution is as follows: designing a hybrid inference network framework (hybrid Net) based on a multi-head attention mechanism, wherein the hybrid Net comprises an abstract decoder, a common sense decoder (attribute, result and intention decoder), and a word-level (word-level) inference and a semantic-level (semantic-level) inference; fused video multimodal information including video static frame information (extracted with ResNet 152), dynamic timing information (extracted with I3D), and sound information (extracted with soudnet); aiming at word level reasoning, a specially designed Memory Module (MMHA) is introduced, and word level prediction is realized by dynamically merging multi-head attention mapping through an attention diagram analyzed from historical information; regarding semantic level reasoning, a plurality of common knowledge is adopted for learning together, wherein different common knowledge information forms a logic closed loop through implicit cross-semantic learning, and knowledge is shared.

The video common sense knowledge reasoning implementation method comprises the following main steps:

step S1, extracting spatial features V in frame from input video respectively_iInter-frame timing feature V_tAnd sound characteristics V_s；

Step S2, the three video characteristics of the step S1 are fused to obtain a multi-modal video characteristic vector V_E；

Step S3, extracting the characteristics of the descriptive text of the input video to obtain a language characteristic vector C_capAnd the video characteristic V of the step S2_EAnd language feature C_capFusing to obtain updated complete context characteristics V_E,C_cap]；

Step S4, the context feature [ V ] obtained in S3_E,C_cap]As the input of the common sense inference decoder, the probability distribution of the answer is obtained through a specially designed multi-head attention model, and the text sequence of the common sense knowledge of the video is predicted according to the answer probability distribution.

As a preferable scheme: in step S1, the input video is extractedGet the multi-modal feature vector V_EIncluding extracting intra-frame spatial information V using ResNet152_iI3D extracting inter-frame timing information V_tAnd SoundNet extracting sound feature V_sThe concrete formula is as follows:

V_i＝ResNet(V)

V_t＝I3d(V)

V_s＝SoundNet(V)

wherein a video V is given and divided into K segments { S } at equal intervals₁,S₂,…,S_K}, each fragment T_KFrom the corresponding segment S_KObtained by medium random sampling, (T)₁,T₂,…,T_K) Representing the video frequency band S_KSampling the obtained fragment sequence, taking the obtained video fragment sequence as input to extract features, and obtaining the spatial feature dimension of

The characteristic dimension of the time sequence is

Characteristic dimension of sound is

Wherein the length L_i＝20,L_t＝10,L_s10, hidden layer dimension D1024.

As a preferable scheme: in step S2, the three modal features of the video in step S1 are fused, and each modal feature is mapped to a new feature space through a linear layer and a long-short term memory network, and then a position code PE and a segment code SE are added to obtain a multi-modal video feature vector V_EThe formula is as follows:

E_i＝SE_i+PE_i+LSTM(FC(V_i))

E_t＝SE_t+PE_t+LSTM(FC(V_t))

E_s＝SE_s+PE_s+LSTM(FC(V_s))

V_E＝[E_i,E_t,E_s]

the position code PE adopts trigonometric function fixed code, and the segment code SE uses an embedded layer for dynamic learning to distinguish three modal information.

As a preferable scheme: in step S3, the description text of the input video is subjected to the embedding layer coding and the position coding to obtain the text abstract coding T_cap，T_capAs a query vector Q of the subsequent digest decoder, the video feature V in step S2_EThe calculation of the multi-head attention mechanism is carried out as the key K and value V vector of the abstract decoder, and the formula is as follows:

y_t＝FFN(Z)

where Z is the output of the attention mechanism, d_kEqual to the dimension of the key K, FFN is a feed forward network consisting of a linear layer and a normalization layer,

the maximum likelihood estimate is the optimization function of the digest decoder, y_tIs the current morpheme to be predicted, v is the input video, theta_capThe method is mainly used for feature extraction of the abstract, and the abstract decoder can obtain the text code T by using the loss function training_capHigh dimensional characteristic expression of C_cap＝[y_t|1＜t＜MaxLength]Finally, the video characteristic V of the step S2_EAnd language feature C_capSplicing at feature level to obtain updated complete context feature V_E,C_cap]。

As a preferable scheme: step by stepIn step S4, the context feature [ V ] obtained in step S3_E,C_cap]As the input of the common sense reasoning decoder, the multi-head attention mechanism of the common sense decoder comprises an independently designed Memory Module (MMHA), and when the probability distribution of the common sense knowledge answer is predicted, the historically predicted lemma y_t-1Obtaining a conditional attention map A by an MMHA module_conditionIn order to avoid the problems of gradient extinction and gradient explosion of the memory storage module in long sequence decoding, the MMHA further includes specially designed gate operations and residual connection, and the specific formula is as follows:

M′_t＝f_mlp(Z+M_t-1)+Z+M_t-1

for each prediction of a text sequence lemma, MMHA serves as a bypass network, a query vector is obtained through a history prediction sequence, and Q is M_t-1W_q,K＝[M_t-1；y_t-1],V＝[M_t-1；y_t-1]W_v，y_t-1The vector is obtained by predicting the lemma through an embedding layer in the last time step; m'_tAs the memory vector for temporary update, the memory vector M of the previous time step_t-1Obtained through multiple perceptrons; forget gate value

And input the value of the gate

Obtained by the linear transformation and the activation of tanh function of the historical memory value, and finally the dot product sum through a gate mechanismThe dot-and-add operation obtains updated M_t。

When predicting a common sense text sequence, firstly, historical lemmas are sent to a decoder network (including an attribute decoder, a result decoder and an intention decoder) and a bypass network MMHA to obtain respective feature vectors X, and the feature vectors X are respectively sent to an attention layer to obtain a_oAnd A_c。A_cThen obtaining A through convolution kernel convolution layer at lower right corner_triangle，A_triangleAnd A_cWeighted summation to obtain updated conditional attention map A_conditionIt is associated with the attention map A generated by the current multi-head attention module_oThe fusion results in a guiding attention diagram A_guideGuidance attention map A_guideBridging upper level attention maps A by means of residual concatenation_previousObtaining the fused attention A_mergeThe concrete formula is as follows:

A_triangle＝Conv_triangle(A_c)

A_condition＝α·A_triangle+(1-α)·A_c

A_guide＝β·A_condition+(1-β)·A_o

A_merge＝γ·A_previous+(1-γ)·A_guide

wherein A is_oIs a feature diagram of multi-head attention generation of a common sense decoder, A_cIs a characteristic map generated by the MMHA through history, A_triangleIs A_cMapping the special mask convolution layer to obtain a feature map, wherein the special convolution layer is obtained by shifting the center position to the lower right corner through standard convolution, A_previousIs a characteristic diagram generated by a multi-head attention mechanism on the previous layer, alpha, beta and gamma are taken as hyper-parameters, the weight of each attention is adjusted, and finally the fused attention diagram A is obtained_mergeSending subsequent attention calculation operations and linear layers.

In step S4, A is generated_mergeMasking the feature map to cover out the attention value after the current sequence, and sending the feature map to the softmax normalization function to obtain the normalized attention value, wherein the attention value is obtained by the context feature vector V_E，C_cap]Performing dot product and obtaining the probability distribution of the answer through a linear layer, wherein the specific formula is as follows:

A_out＝softmax(MASK(A_merge))

Y＝FFN(A_out[V_E，C_cap]^T)

wherein A is_outRepresenting the final attention map, Y is the resultant vector generated by the multi-headed attention mechanism of the particular design.

Training common sense decoder

Three maximum likelihood estimates are optimized, wherein

The optimization function is as follows:

wherein y is_tIs the current word element to be predicted, V_EIs an input video feature, C_capIs a textual feature of the video summary, Θ_attAre the model parameters of the attribute decoder. In the same way as above, the first and second,

also consistent with the above formula, the hybrid inference network (hybrid net) is optimized by a multi-task learning loss function

Cross entropy is used for calculation.

In step S4, the prediction of the common knowledge text sequence is obtained from the final answer probability distribution, and the specific formula is as follows:

wherein D is_ATT,D_EFF,D_INTAfter training, attribute decoder, result decoder and intention decoder, respectively, V_EMulti-modal features representing input video, C_capText lemma representing features of video descriptive text, current knowledge

From history lemma C_att,C_eff,C_intAre generated sequentially in an autoregressive manner.

Compared with the prior art, the invention has the following positive effects:

the invention is based on a hybrid reasoning network framework (hybrid Net) of a multi-head attention mechanism, can execute common knowledge reasoning of word-level and semantic-level, introduces multi-modal characteristic information during video characteristic extraction, can provide abundant video-level semantic characteristics, and simultaneously, the models share a video encoder and a text encoder, thereby greatly reducing the number of parameters and improving the reasoning speed of the models. In addition, the memory storage module (MMHA) designed by the invention can effectively bridge historical word element information, enhance the generalization of the common sense reasoning method and improve the prediction accuracy of the model.

Drawings

FIG. 1 is a flow chart of the steps of a video common sense knowledge reasoning implementation method based on multi-mode fusion;

FIG. 2 is a system architecture diagram of a video common sense knowledge reasoning implementation method based on multi-mode fusion according to the present invention;

FIG. 3 is a diagram illustrating an internal structure of a memory module according to an embodiment of the present invention;

FIG. 4 is a diagram of a particular multi-headed attention mechanism map in an embodiment of the invention.

Detailed Description

Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.

The invention provides a video common knowledge reasoning implementation method based on multi-mode fusion, which comprises the following specific processes:

1. extraction of video multimodal features and text description features

FIG. 1 is a flow chart of steps of a video common sense knowledge reasoning implementation method based on multi-modal fusion, aiming at feature extraction of videos and texts, the method comprises the following steps:

in step S1, the method for extracting multi-modal feature vectors from the input video includes extracting intra-frame spatial information V using ResNet152_iI3D extracting inter-frame timing information V_tAnd SoundNet extracting sound feature V_sThe concrete formula is as follows:

V_i＝ResNet(V)

V_t＝I3d(V)

V_s＝SoundNet(V)

wherein a video V is given and divided into K segments { S } at equal intervals₁,S₂,…,S_K}, each fragment T_KFrom the corresponding segment S_KObtained by medium random sampling, (T)₁,T₂,…,T_K) Representing a sequence of segments, the spatial feature dimension of the input video being obtained as

Twenty frames of image features in total, and the time sequence feature dimension is

Ten frames in total are time sequence characteristics, and the dimension of the sound characteristic is

And extracting sound characteristic V by using SoundNet pre-training model_s，

The length of the sound characteristic sequence is ten, and the characteristic dimensions of the three sequences are D-1024.

In step S2, the three video features of step S1 are fused, each modal feature is mapped to a new feature space through a linear layer and a long-short term memory network, and finally, a position code PE and a segment code SE are added to obtain a multi-modal video feature vector V_EThe formula is as follows:

E_i＝SE_i+PE_i+LSTM(FC(V_i))

E_t＝SE_t+PE_t+LSTM(FC(V_t))

E_s＝SE_s+PE_s+LSTM(FC(V_s))

V_E＝[E_i,E_t,E_s]

the segment codes use an embedded layer for optimization learning, the position codes use fixed codes, and the formula is as follows:

pos represents the position of the current sequence, the model output dimension d_modelAnd (4) splicing the three modal characteristics to obtain a multi-modal characteristic V of the video_E＝[E_i,E_t,E_s]Dimension of

L_i＝20,L_t＝10,L_s＝10。

In step S3, the text abstract of the input video is subjected to the embedding layer coding and the position coding to obtain a text abstract code T_cap，T_capVideo feature V in step S2 as subsequent digest decoder query vector Q_EThe multi-headed attention is calculated as the key K and value V vectors of the decoder, by the following formula:

MultiHead(Q，K，V)＝Concat(head₁，...，head_h)W^O

wherein

The maximum likelihood estimation is the optimization function of the abstract decoder, the loss function is trained, and the text code T can be obtained_capHigh dimensional characteristic expression of C_cap＝[y_t|1＜t＜MaxLength]. The video characteristic V of the step S2_E＝[E_i，E_t，E_s]And step S3 language feature C_capFusing, and splicing at the feature level to obtain the updated complete context feature V_E，C_cap]。

2. Common sense knowledge reasoning at word level

FIG. 3 is a diagram of the internal structure of a memory access module (MMHA) that can perform word-level conventional knowledge reasoning, according to an embodiment of the invention. In order to avoid the problems of gradient extinction and gradient explosion of a memory storage module in long sequence decoding, the MMHA comprises specially designed gate operation and residual error connection, and the specific formula is as follows:

M′_t＝f_mlp(Z+M_t-1)+Z+M_t-1

for each prediction of the sequence text lemmas, MMHA serves as a bypass network, a query vector is obtained through a history prediction sequence, and Q is M_t-1W_q，K＝[M_t-1；y_t-1]，V＝[M_t-1；y_t-1]W_v，y_t-1The vector is obtained by predicting the lemma through an embedding layer in the last time step; m'_tAs the memory vector for temporary update, the memory vector M of the previous time step_t-1Obtained through multiple perceptrons; forget gate value

And input the value of the gate

The historical memory value is obtained by linear transformation and tanh function activation, wherein sigma activation function uses sigmoid, and the sigma activation function is finally obtained by point multiplication and point addition of a door mechanismTo updated M_t。

In step S4, when the probability distribution of the common-sense knowledge answer is predicted by the independently designed Memory Module (MMHA) in the multi-head attention mechanism of the model, the historically predicted lemmas pass through the MMHA module from the current memory state M_tObtaining a conditional attention map A by linear transformation_conditionIt is associated with the attention map A generated by the current multi-head attention module_oThe fusion results in a guiding attention diagram A_guideGuidance attention map A_guideBridging upper level attention maps A by means of residual concatenation_previousTo obtain the final attention A_mergeThe concrete formula is as follows:

A_condition＝α·A_triangle+(1-α)·A_c

A_guide＝β·A_condition+(1-β)·A_o

A_merge＝γ·A_previous+(1-γ)·A_guide

wherein A is_oIs a feature map generated by a multi-head attention module in a decoder, A_cIs a characteristic map generated by the MMHA through history, A_triangleIs A_cMapping a special convolutional layer to obtain a characteristic diagram, wherein the special convolutional layer is obtained by shifting the center position to the lower right corner by convolution with the standard kernel size of 3 multiplied by 3, alpha, beta and gamma are respectively 0.1, 0.4 and 0.1 as hyper-parameters, adjusting the weight of each attention, and finally mapping the fused attention diagram A_mergeSending subsequent attention calculation operations and linear layers. The specific formula is as follows:

A_out＝softmax(MASK(A_merge))

in which A is produced_mergeMasking the invisible attention value of the feature map by masking operation, and obtaining the normalized attention A by the softmax normalization function again_outThe attention is given by the probability distribution X of the answer obtained by dot product with the value V vector and the linear layer_out。

3. Semantic level common sense knowledge reasoning

In step S4, the context feature [ V ] obtained in step S3_E,C_cap]And as the input of the common sense inference decoder, obtaining the probability distribution of the answer through a specially designed multi-head attention model, and predicting the text sequence of the common sense knowledge of the video according to the answer probability distribution. Training common sense decoder

Three maximum likelihood estimation optimization, different common knowledge information forms a logic closed loop through implicit cross-semantic learning, and knowledge is shared, wherein

The optimization function is as follows:

also consistent with the above formula, the model finally optimizes the hybrid inference network (hybrid net) through a loss function of multi-task learning, wherein the loss function is

Cross entropy calculation is adopted, and an autoregressive mode is utilized to predict a text sequence, and the specific formula is as follows:

wherein D is_ATT,D_EFF,D_INTAfter training, attribute decoder, result decoder, intention decoder, V_EMulti-modal features representing input video, C_capRepresents the characteristics of the video summary of the input,

the text lemmas of the current attribute, intention and result sequence respectively, and the text sequence prediction of the final common knowledge is generated in turn by means of autoregression.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.

Claims

1. A video common sense knowledge reasoning implementation method based on multi-mode fusion comprises the following steps:

1) respectively extracting intra-frame space characteristics V from input video_iInter-frame timing feature V_tAnd sound characteristics V_s；

2) Spatial feature V in frame_iInter-frame timing feature V_tAnd sound characteristics V_sFusing to obtain multi-modal video characteristics V of the input video_E；

3) Extracting the descriptive text of the input video to obtain language characteristics C_capAnd apply the video feature V_EAnd language feature C_capPerforming fusion to obtain a context feature [ V ]_E，C_cap]；

4) Characterizing the context [ V ]_E，C_cap]The input common sense inference decoder obtains a probability distribution of the answers and then predicts a common sense knowledge text sequence of the input video based on the obtained probability distribution of the answers.

2. The method of claim 1, wherein the video feature V is obtained_EThe method comprises the following steps: spatial feature V in frame_iMapping to a feature space through a linear layer and a long-short term memory network, and adding an intra-frame space feature V_iCorresponding position-coding PE_iSum segment encoding SE_iObtaining the characteristic E_i(ii) a Characterizing the inter-frame timing V_tMapping to a feature space through a linear layer and a long-short term memory network, and adding an inter-frame timing feature V_tCorresponding position-coding PE_tSum segment encoding SE_tObtaining the characteristic E_t(ii) a Feature V of sound_sMapping to a feature space through a linear layer and a long-short term memory network, and adding sound features V_sCorresponding position-coding PE_sSum segment encoding SE_sObtaining the characteristic E_s(ii) a Then E is_i，E_t，E_sFusing to obtain the video characteristic V_E＝[E_i，E_t，E_s]。

3. The method of claim 1, wherein the linguistic feature C is obtained_capThe method comprises the following steps: the description text of the input video is coded by an embedding layer and a position to obtain a text abstract code T_capThen, thenWill T_capVideo features V as abstract decoder query vectors Q_EAs the key K and value y vector of the abstract decoder, calculating a multi-head attention mechanism to obtain the language feature C_cap。

4. The method of claim 3, wherein the multi-attention mechanism is calculated by the formula:

y_t＝FFN(Z)、

wherein d is_kBeing the dimension of the key K, FFN is a feed forward network,

is an optimization function of the digest decoder, y_tIs the morpheme to be predicted, v is the input video, theta_capAre the model parameters of the digest decoder.

5. The method of claim 4, wherein the probability distribution of the answers is obtained by: lexical element y predicted by common sense reasoning decoder by using history_t-1Obtaining a conditional attention diagram A through a memory module MMHA_condition(ii) a The conditional attention map is then applied to a_conditionThe multi-head attention module generates a feature map A according to historical lemmas_oThe fusion results in a guiding attention diagram A_guideThe guidance attention map A_guideBridging upper level attention maps A by means of residual concatenation_previousObtaining the fused attention A_merge(ii) a Then to the attention A_mergeMasking operation is carried out, the attention value behind the current sequence is covered, and then the attention value is sent to a normalization function to obtain the normalized attention; the normalized attention is then compared to the video feature V_EDot product and linear layerTo the probability distribution of the answer.

6. The method of claim 1, utilizing

Predicting a common sense knowledge text sequence of the input video; wherein D is_ATTAs attribute decoders, D_EFFAs a result of the decoder, D_INTFor purpose decoder, V_EMulti-modal features representing input video, C_capText lemma representing features of video descriptive text, current knowledge

From history lemma C_att，C_eff，C_intAre generated sequentially in an autoregressive manner.