CN113806587A

CN113806587A - Multi-mode feature fusion video description text generation method

Info

Publication number: CN113806587A
Application number: CN202110975443.2A
Authority: CN
Inventors: 朱虹; 刘媛媛; 李阳辉; 张雨嘉; 王栋; 史静
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2021-12-17

Abstract

The invention discloses a method for generating a multi-modal feature fusion video description text, which comprises the following steps: 1) establishing a data set, a verification set and a semantic dictionary; 2) constructing a multi-modal feature fusion network to obtain an aggregation feature; 3) obtaining a subject, a predicate, and an object describing a sentence using an encoder that perceives video motion using a grammar; 4) generating a description text of the video by using the motion guidance decoder; 5) training a video text to generate a network model; 6) and generating a text description sentence of the video, and after finishing the network training in the steps 1 to 5, obtaining all parameters of a video text generation network model, wherein the video to be described is taken as an input video, and after the steps 2 to 4, the text description of the video to be described is obtained. The method has higher accuracy.

Description

Multi-mode feature fusion video description text generation method

Technical Field

The invention belongs to the technical field of video text description generation, and relates to a multi-modal feature fusion video description text generation method.

Background

The task of video text description is to automatically generate a complete and natural sentence to describe video content, and accurately understanding the content contained in the video is of great significance and wide application in practice. For example, in the case of massive video data, video text description can be used for fast and efficient video retrieval, and the generated video text description can also be used for intelligent auditing of videos.

In the video description text generation process, if semantic information contained in the video multi-modal features cannot be better learned, the problem of semantic inconsistency between video content and generated descriptions can be caused. At present, 2D and 3D convolutional neural networks have successfully improved the technology of learning representation from visual, audio and motion information, but solve the problem of how to aggregate the extracted multi-modal characteristics of the video, and still be a research idea that can improve the accuracy of text description.

Disclosure of Invention

The invention aims to provide a method for generating a multi-modal feature fusion video description text, which solves the problem that in the prior art, in the generation of the video description text, the semantics of the video content and the generated description are inconsistent.

The technical scheme adopted by the invention is that the method for generating the multi-mode feature fused video description text is implemented according to the following steps:

step 1, establishing a data set, a verification set and a semantic dictionary;

step 2, constructing a multi-mode feature fusion network to obtain an aggregation feature;

step 3, obtaining a subject, a predicate and an object of a description statement by using an encoder for sensing video action by grammar;

step 4, generating a description text of the video by using the motion guidance decoder;

step 5, training a video text to generate a network model;

step 6, generating a text description sentence of the video,

after completing the network training through the steps 1 to 5, obtaining all parameters of a video text generation network model; and taking the video to be described as an input video, and after the steps 2 to 4 are carried out, obtaining the text description of the video to be described.

The method has the advantages that in the video description text generation network model, the RGB characteristics obtained by the 2D convolutional neural network and the time sequence characteristics obtained by the 3D convolutional neural network are combined through the multi-mode characteristic fusion model to obtain the aggregation characteristics which are more in line with the text description requirements, and the aggregation characteristics are combined with predicate coding information generated by the grammar perception prediction action module and are sent to the decoding network model to perform text description on the input video. Compared with the algorithm index of the mainstream paper retrieved at present, the video description text generated by the method has higher accuracy.

Drawings

FIG. 1 is a flow chart of a multimodal feature fusion model of the method of the present invention;

FIG. 2 is a flow chart of a self-attention mechanism in the method of the present invention;

FIG. 3 is a flow chart of an encoder based on a syntax aware predictive action module in the method of the present invention;

FIG. 4 is a flow chart of the encoding layer in the syntax aware predictive action module of the method of the present invention;

fig. 5 is a flow chart of a decoder model employed by the method of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The method for generating the multi-modal feature fusion video description text is specifically implemented according to the following steps:

step 1, establishing a data set, a verification set and a semantic dictionary,

1.1) constructing a data set and a validation set,

because the text of the video describes the requirement of the generalization capability of the data set of the network, except for the data set which needs to be self-made and labeled under special conditions, the public data set is generally recommended;

the method preferably generates data sets MSR-VTT and MSVD according to the video description text with relatively high use rate at present; the MSVD data set comprises 1967 YouTube public short videos, each video mainly displays an activity, the duration is 10-25 s, the videos comprise different people, animals, actions, scenes and the like, the video contents are labeled by the different people, and each video corresponds to 40 text description sentences on average; 10000 videos in the MSR-VTT data set, wherein each video corresponds to 20 text description sentences on average;

randomly selecting a part (preferably 80% of the total data in the step) from the MSR-VTT and MSVD data sets as data samples of a training set, and using the rest 20% of the samples as verification set samples;

1.2) establishing a semantic dictionary,

from sample labels of a training set and a verification set, sorting all words from high to low according to occurrence times, selecting the first m words to form a semantic concept set, wherein m is an empirical value, and preferably, m is 80% to 85% of the total number of words;

each word is assigned with an integer sequence number from 0 to m, and then four additional marks are added (to each word), namely a start mark < bos >, an end mark < eos >, a blank mark < pad > and a replacement mark < unk >, so that m +4 integer sequence numbers are formed, and a semantic dictionary vocab {1, 2. Performing minimum preprocessing on the sentences with labels, namely deleting punctuation marks, adding < bos > and < eos > at the beginning and the end of each sentence respectively, replacing words not contained in a semantic dictionary with < unk >, fixing the length of the sentences to be L, wherein L is an empirical value, and preferably selecting L from [20,40] according to the statistical result of the sentence length described by the video text; if the sentence is too long to delete the excess words, the sentence is too short to be supplemented to a fixed length using < pad >;

assuming that the total number of samples of the training set is N, i is the ith video sample, i is 1, 2. And carrying out semantic dictionary labeling on the data set sample by utilizing the established semantic dictionary, wherein the expression is as follows:

Y_i＝[y_i,1,y_i,2,...,y_i,L]，i＝1,2,...,N (1)

wherein, y_i,tThe sequence number of the t word of the ith video in the text semantic dictionary is an integer, and t is 1, 2.

Step 2, constructing a multi-mode feature fusion network to obtain the aggregation features,

2.1) extracting multi-modal features of the video,

firstly, sampling an input video (the input video is a training sample video given in the step 1 during model training, and the input video is a video required to be described by a text after training is finished) at equal intervals to obtain a video with a length of a T frame after preprocessing, (T is an empirical value and is determined by combining with the content of the video required to be described, and preferably T belongs to the field of 16,64]) (ii) a Then, two-dimensional RGB characteristics and target area characteristics of the preprocessed video are extracted, and the last average pooling layer output M of a two-dimensional convolution network Inception ResnetV2(IRV2) (which is the prior art) is adopted₁(M₁1536) dimensional feature vector describes two-dimensional RGB feature V of a video^rM output from RoI Pooling layer using FasterRCNN network (prior art)₂(M₂1024) dimensional feature vector as a plurality of target region features V in the video^b(ii) a Then, taking each video sequence of 16 frames as a slice, repeating 8 frames per slice, using the output M of the fully-connected FC6 layer of the three-dimensional convolutional network C3D (prior art)₃(M₃2048) dimensional feature vector describes the three-dimensional temporal characteristics V of a video^mAnd jointly constructing visual multi-modal characteristics corresponding to each video, wherein the expression is as follows:

V＝{V^r,V^m,V^b} (2)

2.2) acquiring the aggregation characteristics by utilizing the multi-modal characteristic fusion model,

2.2.1) calculating the scene representation feature V with the auto-attention mechanism Module^rm'，

As shown in fig. 1, the two-dimensional RGB features V obtained in step 2.1) are combined by using the framework of the multi-modal feature fusion model^rAnd three-dimensional timing feature V^mStitching into a global feature V^rmObtaining the query vector Q with the same dimension through three different linear transformations^rmKey vector K^rmSum vector V^rmThe expression is:

{Q^rm，K^rm，V^rm}＝{V^rm·W_Q ^rm,V^rm·W_K ^rm,V^rm·W_V ^rm} (3)

wherein, W_Q ^rm、W_K ^rm、W_V ^rmIs a parameter learned by the training process;

2.2.2) As shown in FIG. 2, the query vector Q is first calculated using the Self-orientation module^rmAnd the key vector K^rmDot product between, to prevent the result from being too large, divide by a scale

Wherein

Is a matrix Q^rmIs then multiplied by the matrix V^rmNormalizing the result into probability distribution by utilizing softmax operation to obtain weighted scene representation characteristic V^rmThe expression is:

wherein the content of the first and second substances,

is a matrix Q^rmDimension (d);

2.2.3) constructing MA Module to calculate the motion characteristics V^m'，

The three-dimensional time sequence characteristics V obtained in the step 1 are processed^mCode E obtained by Ebedding verbiage of verb labeled by part-of-speech (pos) in semantic dictionary_posThe combination is carried out by an MA module which generates a motion characteristic V capable of focusing on the interaction relationship between objects through two linear conversion layers and a ReLU layer^m', the expression is:

wherein, W^m、

b^mIs the weight of the linear transformation learned by the training process, ReLU is the activation function;

2.2.4) constructing a dynamic attention module to solve the aggregation characteristic V',

the scene representation characteristics V obtained in the previous step 2.2.1) and step 2.2.2)^rm' and motion characteristics V^m' and two-dimensional RGB feature V^rPerforming multi-mode feature fusion through a dynamic attention module to obtain a final aggregation feature V'; firstly, V is firstly^rm' and V^m' dot multiplication, division by a scale of one dimension

Wherein

Is a matrix V^rm′Is then multiplied by the matrix V^rThen, normalizing the result into probability distribution through softmax operation to obtain an aggregation characteristic V', wherein the expression is as follows:

step 3, obtaining a subject, a predicate and an object of the descriptive statement by using an encoder for sensing the video action by the grammar,

the encoder for syntax-aware video motion adopts a basic structure of an SAAT network model (the SAAT network model is the prior art), belongs to a syntax-aware prediction motion module, and is divided into a component extractor-encoder (Cxe) and a component extractor-decoder (Cxd);

as shown in fig. 3, the component extractor-encoder (Cxe) and the component extractor-decoder (Cxd) are identical in structure, and each is composed of an Embedding layer and three identical Encoding layer stacks based on the self-attention mechanism;

as shown in fig. 4, each Encoding layer has a structure composed of a Self-Attention mechanism (Self-Attention), layer normalization (layerNorm) and a nonlinear feed-forward network (FFN), and the specific process is as follows:

3.1) obtaining scene semantic features V by a component extractor-encoder (Cxe)^bs，

The target area characteristics of the video obtained in the step 2.1) are

And a target position code R_lAs input sequence of Cxe, coded as scene semantic features

Where K is the number of target regions;

3.1.1) feature the target region of the video V^bAnd a target position code R_lSplicing in a cascading mode to obtain cascading characteristics R_cThe expression is:

wherein the content of the first and second substances,

k is 1,2, K, which is the target center coordinates and the width and height information of the target, w_f,h_fWidth and height of the video frame respectively;

3.1.2) Cascade characteristic R obtained in step 3.1.1)_cThrough the Embedding layer, three different linear transformations are utilized to obtain a mapping matrix Q with the same dimensionality_c，K_c，V_cThe expression is:

{Q_c，K_c，V_c}＝{R_c·W^Q,R_c·W^K,R_c·W^V} (8)

wherein, W^Q、W^K、W^VAre all parameters learned in the training process;

3.1.3) computing Q through an Encoding layer based on the self-attention mechanism (see FIG. 2)_cAnd K_cDot product between, to prevent the result from being too large, divide by a scale

Wherein d is_QIs a matrix Q_cIs then multiplied by the matrix V_cNormalizing the result into probability distribution by utilizing softmax operation to obtain weighted scene semantic feature V^bsThe expression is:

3.2) getting the video action, i.e. the predicate of the textual description sentence, with the component extractor-decoder (Cxd),

multi-modal features and scene semantic features V using video^bsAs an input to the component extractor-decoder (Cxd), a predicate, which is an action in the video, is decoded, the specific procedure being:

3.2.1) setting the Global RGB feature V^rQuery as predictive subject, scene semantic feature V^bsAs Key-Value, a self-attention mechanism (equation (10) is used) Obtaining a feature code E of the subject_sThen, a softmax layer (formula (11)) is used to obtain the word probability matrix p of the subject_θ(word|V^bs,V^rs) Then, the argmax function (formula (12)) is used to obtain the word matrix with the maximum probability, which is the corresponding subject s, and the expression is:

E_s＝fatt(V^rs,V^bs,V^bs) (10)

p_θ(word|V^bs,V^rs)＝softmax(W_s ^T·E_s) (11)

wherein the content of the first and second substances,

W_v、W_sis a parameter learned by the training process;

3.2.2) encoding characteristics E of the subject obtained in step 3.2.1)_sQuery as an action (i.e., predicate) for predicting video, time-series characteristic V of video^mCalculating a feature code E of a predicate using the formula (13) as Key-Value_aThe word probability matrix p of the predicate is calculated using equation (14)_θ(word|s,V^m) And obtaining a corresponding predicate a by using a formula (15), wherein the expression is as follows:

E_a＝fatt(E_s,V^ma,V^ma)) (13)

p_θ(word|s,V^ma)＝softmax(W_a ^T·E_a) (14)

wherein the content of the first and second substances,

W_m、W_ais a parameter learned by the training process;

3.2.3) encoding characteristics E of the predicate_aQuery as prediction object, scene semantic feature V of video^bsAs Key-Value, the feature code E of the object is calculated using the formula (16)_oCalculating the word probability matrix p of the object using equation (17)_θ(word|a,V^bs) The corresponding object o is obtained by using formula (18), and the expression is as follows:

E_o＝fatt(E_a,V^bs,V^bs)) (16)

p_θ(word|a,V^bs)＝softmax(W_o ^T·E_o) (17)

wherein, W_oIs a parameter learned by the training process;

step 4, generating a description text of the video by using the motion guidance decoder,

as shown in fig. 5, using the decoder model, the specific process is as follows:

4.1) calculating the predicate-feature code E by the Attention module_aAttention weight β of_t，

Encoding the characteristics of the predicate calculated by the formula (13) in step 3.2.2) into E_aAnd the LSTMs output h at the t-1 time_t-1Splicing, h₀To represent a start symbol<bos>To obtain an embedded vector E_wordtAttention distribution β output by Attention at time t_tThe expression of (1) is:

wherein the content of the first and second substances,

W_β、W_h、b_βis a parameter learned by the training process;

4.2) generating word predictions h by LSTMs (is a public technology)_tThe expression is:

wherein, W_LvIs a parameter learned by the training process;

4.3) calculating the predicted word probability,

obtaining word prediction probability p at t moment by the prediction result through a softmax function_θ(word_t) The word with the maximum probability is the predicted word at the current moment, and the expression is as follows:

p_θ(word_t)＝softmax(W_w·h_t) (24)

word_t＝argmax(p_θ(word_t)) (25)

wherein, W_wIs a parameter learned by the training process;

4.4) making t equal to t +1, and circularly executing the steps 4.1) to (4.3) until the predicted word_tFor end marking<eos>Until the end;

step 5, training the video text to generate a network model,

inputting all training samples given in the step 1 into a video text generation network model, repeating the step 2 to the step 4, and training by using standard cross entropy loss, wherein the total loss function of the video text generation network model is the minimum loss L of the SAAT module_sLoss L with video description text generator_cAnd the expression is:

L(θ)＝L_c+λ·L_s (26)

wherein, word_t ⁽ⁱ⁾Is the t word of the ith training sample,

l is the label word attribute given in step 1, (s, a, o)_(i)Subjects, predicates and objects which are the output of the ith training sample, (s, a, o)_(i)Is subject, predicate and object of the ith training sample label, V^b,V^r,V^mRespectively extracting target area characteristics, two-dimensional RGB characteristics and three-dimensional time sequence characteristics of the video in the step 2.1);

step 6, generating a text description sentence of the video,

after completing the network training through the steps 1 to 5, all parameters of the video text generation network model can be obtained; and taking the video to be described as an input video, and performing the steps 2 to 4 to obtain the text description of the video.

The constructed video text generation network model can generate video text description according to the video multi-modal features and the video motion guidance.

Claims

1. A multi-modal feature-fused video description text generation method is implemented according to the following steps:

step 1, establishing a data set, a verification set and a semantic dictionary;

step 5, training a video text to generate a network model;

step 6, generating a text description sentence of the video,

2. The method for generating multi-modal feature-fused video description text according to claim 1, wherein the specific process of step 1 is:

1.1) constructing a data set and a validation set,

except for the data set which needs to be self-made and labeled under special conditions, the data set which is disclosed is generally suggested to be adopted;

selecting a video description text to generate data sets MSR-VTT and MSVD, selecting 80% of all data as data samples of a training set, and using the rest 20% of the samples as verification set samples;

1.2) establishing a semantic dictionary,

sequencing all words from high to low according to the occurrence times from sample labels of the training set and the verification set, and selecting the first m words to form a semantic concept set;

assigning an integer sequence number from 0 to m to each word, and then adding four additional marks, namely a start mark < bos >, an end mark < eos >, a blank mark < pad > and a replacement mark < unk >, to each word, wherein m +4 integer sequence numbers are formed to form a semantic dictionary vocab {1, 2. The method comprises the following steps of performing minimum preprocessing on a labeled sentence, namely deleting punctuation marks, adding < bos > and < eos > at the beginning and the end of each sentence respectively, replacing words which are not contained in a semantic dictionary with < unk >, fixing the length of the sentence to be L, and if the sentence is too long and the word which exceeds the sentence is deleted, supplementing the sentence to be fixed length by using < pad >;

Y_i＝[y_i,1,y_i,2,...,y_i,L]，i＝1,2,...,N (1)

wherein, y_i,tAnd the sequence number of the t word of the ith video in the text semantic dictionary is an integer, and t is 1, 2.

3. The method for generating multi-modal feature-fused video description text according to claim 1, wherein the specific process of step 2 is:

2.1) extracting multi-modal features of the video,

the method comprises the steps of extracting multi-mode features of a video by adopting three network structures in a combined mode, firstly, sampling an input video at equal intervals to obtain a video with a length of a T frame after preprocessing; then, extracting two-dimensional RGB characteristics and target area characteristics of the preprocessed video, and outputting M by respectively adopting the last average pooling layer of a two-dimensional convolution network IncephetionResnetV 2₁Two-dimensional RGB feature V of dimensional feature vector description video^rM output by RoI Pooling layer using fast RCNN network₂The dimension feature vector is used as a plurality of target area features V in the video^b(ii) a Then, each video sequence of 16 frames is taken as a segment, each segment is repeated for 8 frames, and the output M of the fully-connected FC6 layer of the three-dimensional convolution network C3D is adopted₃Three-dimensional time sequence characteristic V of dimensional feature vector description video^mAnd jointly constructing visual multi-modal characteristics corresponding to each video, wherein the expression is as follows:

V＝{V^r,V^m,V^b} (2)

Using a multi-modal feature fusion model to combine the two-dimensional RGB features V obtained in the step 2.1)^rAnd three-dimensional timing feature V^mStitching into a global feature V^rmObtaining the query vector Q with the same dimension through three different linear transformations^rmKey vector K^rmSum vector V^rmThe expression is:

{Q^rm，K^rm，V^rm}＝{V^rm·W_Q ^rm,V^rm·W_K ^rm,V^rm·W_V ^rm} (3)

2.2.2) first calculate the query vector Q using the Self-Attention module^rmAnd the key vector K^rmDot product between, to prevent the result from being too large, divide by a scale

Wherein

Is a matrix Q^rmIs then multiplied by the matrix V^rmNormalizing the result into probability distribution by utilizing softmax operation to obtain weighted scene representation characteristic V^rm'The expression is:

wherein the content of the first and second substances,

is a matrix Q^rmDimension (d);

2.2.3) constructing MA Module to calculate the motion characteristics V^m'，

The three-dimensional time sequence characteristics V obtained in the step 1 are processed^mAnd a code E obtained by labeling the part of speech in the semantic dictionary and enabling the labeled verb to be subjected to Ebellding_posThe combination is carried out by an MA module which generates a motion characteristic V capable of focusing on the interaction relationship between objects through two linear conversion layers and a ReLU layer^m'The expression is:

wherein, W^m、W_a ^m、b^mIs the weight of the linear transformation learned by the training process, ReLU is the activation function;

the scene representation characteristics V obtained in the previous step 2.2.1) and step 2.2.2)^rm'And the motion characteristic V^m'And two-dimensional RGB feature V^rPerforming multi-mode feature fusion through a dynamic attention module to obtain a final aggregation feature V'; firstly, V is firstly^rm'And V^m'Dot product by one scale

Wherein

4. the method for generating multi-modal feature-fused video description text according to claim 1, wherein the specific process of step 3 is:

the encoder used for carrying out the grammar perception video action adopts a basic structure of an SAAT network model and is divided into a component extractor-encoder and a component extractor-decoder;

the component extractor-encoder and the component extractor-decoder have the same structure and are formed by stacking an Embedding layer and three Encoding layers which have the same structure and are based on a self-attention mechanism;

the structure of each Encoding layer consists of an attention mechanism, layer normalization and a nonlinear feedforward network, and the specific process is as follows:

3.1) obtaining scene semantic features V by a component extractor-encoder^bs，

The target area characteristics of the video obtained in the step 2.1) are

And a target position code R_lEncoding as a scene semantic feature as an input sequence to a component extractor-encoder

Where K is the number of target regions;

wherein the content of the first and second substances,

is the target center coordinates and the width and height information of the target, w_f,h_fWidth and height of the video frame respectively;

{Q_c，K_c，V_c}＝{R_c·W^Q,R_c·W^K,R_c·W^V} (8)

wherein, W^Q、W^K、W^VAre all parameters learned in the training process;

3.1.3) passing through a self-attention-basedEncoding layer computation Q of force mechanism_cAnd K_cDot product between, to prevent the result from being too large, divide by a scale

3.2) obtaining a video action, i.e. a predicate of a textual description statement,

multi-modal features and scene semantic features V using video^bsAs an input of the component extractor-decoder, decoding to obtain a predicate which is an action in the video, the specific process is as follows:

3.2.1) setting the Global RGB feature V^rQuery as predictive subject, scene semantic feature V^bsObtaining feature code E of subject by using self-attention mechanism as Key-Value_sAnd obtaining the word probability matrix p of the subject through a softmax layer_θ(word|V^bs,V^rs) Then, the word matrix with the maximum probability is obtained by using the argmax function, namely the corresponding subject s, and the expression is as follows:

E_s＝fatt(V^rs,V^bs,V^bs) (10)

p_θ(word|V^bs,V^rs)＝softmax(W_s ^T·E_s) (11)

wherein the content of the first and second substances,

is a parameter learned by the training process;

3.2.2) encoding characteristics E of the subject obtained in step 3.2.1)_sQuery as an action to predict video, temporal characteristics V of video^mCalculating a feature code E of a predicate using the formula (13) as Key-Value_aThe word probability matrix p of the predicate is calculated using equation (14)_θ(word|s,V^m) And obtaining a corresponding predicate a by using a formula (15), wherein the expression is as follows:

E_a＝fatt(E_s,V^ma,V^ma)) (13)

p_θ(word|s,V^ma)＝softmax(W_a ^T·E_a) (14)

wherein the content of the first and second substances,

is a parameter learned by the training process;

E_o＝fatt(E_a,V^bs,V^bs)) (16)

p_θ(word|a,V^bs)＝softmax(W_o ^T·E_o) (17)

wherein, W_oAre parameters learned by the training process.

5. The method for generating multi-modal feature-fused video description text according to claim 1, wherein the specific process of step 4 is:

Encoding the characteristics of the predicate calculated by the formula (13) in step 3.2.2) into E_aAnd the LSTMs output h at the t-1 time_t-1Splicing, h₀To represent a start symbol<bos>To obtain an embedded vector

Attention distribution β outputted at time t by Attention_tThe expression of (1) is:

wherein the content of the first and second substances,

b_βis a parameter learned by the training process;

4.2) generating word prediction result h through LSTMs_tThe expression is:

wherein, W_LvIs a parameter learned by the training process;

4.3) calculating the predicted word probability,

p_θ(word_t)＝softmax(W_w·h_t) (24)

word_t＝arg max(p_θ(word_t)) (25)

wherein, W_wIs a parameter learned by the training process;

4.4) making t equal to t +1, and circularly executing the steps 4.1) to (4.3) until the predicted word_tFor end marking<eos>Until now.

6. The method for generating multi-modal feature-fused video description text according to claim 1, wherein the specific process of step 5 is:

L(θ)＝L_c+λ·L_s (26)

wherein, word_t ⁽ⁱ⁾Is the t word of the ith training sample,

l is the label word attribute given in step 1, (s, a, o)_(i)Subjects, predicates and objects which are the output of the ith training sample, (s, a, o)_(i)Is subject, predicate and object of the ith training sample label, V^b,V^r,V^mExtracting video for step 2.1) respectivelyThe target area characteristic, the two-dimensional RGB characteristic and the three-dimensional time sequence characteristic.