CN113806587A - Multi-mode feature fusion video description text generation method - Google Patents
Multi-mode feature fusion video description text generation method Download PDFInfo
- Publication number
- CN113806587A CN113806587A CN202110975443.2A CN202110975443A CN113806587A CN 113806587 A CN113806587 A CN 113806587A CN 202110975443 A CN202110975443 A CN 202110975443A CN 113806587 A CN113806587 A CN 113806587A
- Authority
- CN
- China
- Prior art keywords
- video
- word
- feature
- expression
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7847—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7847—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
- G06F16/785—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using colour or luminescence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Library & Information Science (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for generating a multi-modal feature fusion video description text, which comprises the following steps: 1) establishing a data set, a verification set and a semantic dictionary; 2) constructing a multi-modal feature fusion network to obtain an aggregation feature; 3) obtaining a subject, a predicate, and an object describing a sentence using an encoder that perceives video motion using a grammar; 4) generating a description text of the video by using the motion guidance decoder; 5) training a video text to generate a network model; 6) and generating a text description sentence of the video, and after finishing the network training in the steps 1 to 5, obtaining all parameters of a video text generation network model, wherein the video to be described is taken as an input video, and after the steps 2 to 4, the text description of the video to be described is obtained. The method has higher accuracy.
Description
Technical Field
The invention belongs to the technical field of video text description generation, and relates to a multi-modal feature fusion video description text generation method.
Background
The task of video text description is to automatically generate a complete and natural sentence to describe video content, and accurately understanding the content contained in the video is of great significance and wide application in practice. For example, in the case of massive video data, video text description can be used for fast and efficient video retrieval, and the generated video text description can also be used for intelligent auditing of videos.
In the video description text generation process, if semantic information contained in the video multi-modal features cannot be better learned, the problem of semantic inconsistency between video content and generated descriptions can be caused. At present, 2D and 3D convolutional neural networks have successfully improved the technology of learning representation from visual, audio and motion information, but solve the problem of how to aggregate the extracted multi-modal characteristics of the video, and still be a research idea that can improve the accuracy of text description.
Disclosure of Invention
The invention aims to provide a method for generating a multi-modal feature fusion video description text, which solves the problem that in the prior art, in the generation of the video description text, the semantics of the video content and the generated description are inconsistent.
The technical scheme adopted by the invention is that the method for generating the multi-mode feature fused video description text is implemented according to the following steps:
step 2, constructing a multi-mode feature fusion network to obtain an aggregation feature;
step 3, obtaining a subject, a predicate and an object of a description statement by using an encoder for sensing video action by grammar;
step 4, generating a description text of the video by using the motion guidance decoder;
step 5, training a video text to generate a network model;
step 6, generating a text description sentence of the video,
after completing the network training through the steps 1 to 5, obtaining all parameters of a video text generation network model; and taking the video to be described as an input video, and after the steps 2 to 4 are carried out, obtaining the text description of the video to be described.
The method has the advantages that in the video description text generation network model, the RGB characteristics obtained by the 2D convolutional neural network and the time sequence characteristics obtained by the 3D convolutional neural network are combined through the multi-mode characteristic fusion model to obtain the aggregation characteristics which are more in line with the text description requirements, and the aggregation characteristics are combined with predicate coding information generated by the grammar perception prediction action module and are sent to the decoding network model to perform text description on the input video. Compared with the algorithm index of the mainstream paper retrieved at present, the video description text generated by the method has higher accuracy.
Drawings
FIG. 1 is a flow chart of a multimodal feature fusion model of the method of the present invention;
FIG. 2 is a flow chart of a self-attention mechanism in the method of the present invention;
FIG. 3 is a flow chart of an encoder based on a syntax aware predictive action module in the method of the present invention;
FIG. 4 is a flow chart of the encoding layer in the syntax aware predictive action module of the method of the present invention;
fig. 5 is a flow chart of a decoder model employed by the method of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The method for generating the multi-modal feature fusion video description text is specifically implemented according to the following steps:
1.1) constructing a data set and a validation set,
because the text of the video describes the requirement of the generalization capability of the data set of the network, except for the data set which needs to be self-made and labeled under special conditions, the public data set is generally recommended;
the method preferably generates data sets MSR-VTT and MSVD according to the video description text with relatively high use rate at present; the MSVD data set comprises 1967 YouTube public short videos, each video mainly displays an activity, the duration is 10-25 s, the videos comprise different people, animals, actions, scenes and the like, the video contents are labeled by the different people, and each video corresponds to 40 text description sentences on average; 10000 videos in the MSR-VTT data set, wherein each video corresponds to 20 text description sentences on average;
randomly selecting a part (preferably 80% of the total data in the step) from the MSR-VTT and MSVD data sets as data samples of a training set, and using the rest 20% of the samples as verification set samples;
1.2) establishing a semantic dictionary,
from sample labels of a training set and a verification set, sorting all words from high to low according to occurrence times, selecting the first m words to form a semantic concept set, wherein m is an empirical value, and preferably, m is 80% to 85% of the total number of words;
each word is assigned with an integer sequence number from 0 to m, and then four additional marks are added (to each word), namely a start mark < bos >, an end mark < eos >, a blank mark < pad > and a replacement mark < unk >, so that m +4 integer sequence numbers are formed, and a semantic dictionary vocab {1, 2. Performing minimum preprocessing on the sentences with labels, namely deleting punctuation marks, adding < bos > and < eos > at the beginning and the end of each sentence respectively, replacing words not contained in a semantic dictionary with < unk >, fixing the length of the sentences to be L, wherein L is an empirical value, and preferably selecting L from [20,40] according to the statistical result of the sentence length described by the video text; if the sentence is too long to delete the excess words, the sentence is too short to be supplemented to a fixed length using < pad >;
assuming that the total number of samples of the training set is N, i is the ith video sample, i is 1, 2. And carrying out semantic dictionary labeling on the data set sample by utilizing the established semantic dictionary, wherein the expression is as follows:
Yi=[yi,1,yi,2,...,yi,L],i=1,2,...,N (1)
wherein, yi,tThe sequence number of the t word of the ith video in the text semantic dictionary is an integer, and t is 1, 2.
Step 2, constructing a multi-mode feature fusion network to obtain the aggregation features,
2.1) extracting multi-modal features of the video,
firstly, sampling an input video (the input video is a training sample video given in the step 1 during model training, and the input video is a video required to be described by a text after training is finished) at equal intervals to obtain a video with a length of a T frame after preprocessing, (T is an empirical value and is determined by combining with the content of the video required to be described, and preferably T belongs to the field of 16,64]) (ii) a Then, two-dimensional RGB characteristics and target area characteristics of the preprocessed video are extracted, and the last average pooling layer output M of a two-dimensional convolution network Inception ResnetV2(IRV2) (which is the prior art) is adopted1(M11536) dimensional feature vector describes two-dimensional RGB feature V of a videorM output from RoI Pooling layer using FasterRCNN network (prior art)2(M21024) dimensional feature vector as a plurality of target region features V in the videob(ii) a Then, taking each video sequence of 16 frames as a slice, repeating 8 frames per slice, using the output M of the fully-connected FC6 layer of the three-dimensional convolutional network C3D (prior art)3(M32048) dimensional feature vector describes the three-dimensional temporal characteristics V of a videomAnd jointly constructing visual multi-modal characteristics corresponding to each video, wherein the expression is as follows:
V={Vr,Vm,Vb} (2)
2.2) acquiring the aggregation characteristics by utilizing the multi-modal characteristic fusion model,
2.2.1) calculating the scene representation feature V with the auto-attention mechanism Modulerm',
As shown in fig. 1, the two-dimensional RGB features V obtained in step 2.1) are combined by using the framework of the multi-modal feature fusion modelrAnd three-dimensional timing feature VmStitching into a global feature VrmObtaining the query vector Q with the same dimension through three different linear transformationsrmKey vector KrmSum vector VrmThe expression is:
{Qrm,Krm,Vrm}={Vrm·WQ rm,Vrm·WK rm,Vrm·WV rm} (3)
wherein, WQ rm、WK rm、WV rmIs a parameter learned by the training process;
2.2.2) As shown in FIG. 2, the query vector Q is first calculated using the Self-orientation modulermAnd the key vector KrmDot product between, to prevent the result from being too large, divide by a scaleWhereinIs a matrix QrmIs then multiplied by the matrix VrmNormalizing the result into probability distribution by utilizing softmax operation to obtain weighted scene representation characteristic VrmThe expression is:
2.2.3) constructing MA Module to calculate the motion characteristics Vm',
The three-dimensional time sequence characteristics V obtained in the step 1 are processedmCode E obtained by Ebedding verbiage of verb labeled by part-of-speech (pos) in semantic dictionaryposThe combination is carried out by an MA module which generates a motion characteristic V capable of focusing on the interaction relationship between objects through two linear conversion layers and a ReLU layerm', the expression is:
wherein, Wm、bmIs the weight of the linear transformation learned by the training process, ReLU is the activation function;
2.2.4) constructing a dynamic attention module to solve the aggregation characteristic V',
the scene representation characteristics V obtained in the previous step 2.2.1) and step 2.2.2)rm' and motion characteristics Vm' and two-dimensional RGB feature VrPerforming multi-mode feature fusion through a dynamic attention module to obtain a final aggregation feature V'; firstly, V is firstlyrm' and Vm' dot multiplication, division by a scale of one dimensionWhereinIs a matrix Vrm′Is then multiplied by the matrix VrThen, normalizing the result into probability distribution through softmax operation to obtain an aggregation characteristic V', wherein the expression is as follows:
step 3, obtaining a subject, a predicate and an object of the descriptive statement by using an encoder for sensing the video action by the grammar,
the encoder for syntax-aware video motion adopts a basic structure of an SAAT network model (the SAAT network model is the prior art), belongs to a syntax-aware prediction motion module, and is divided into a component extractor-encoder (Cxe) and a component extractor-decoder (Cxd);
as shown in fig. 3, the component extractor-encoder (Cxe) and the component extractor-decoder (Cxd) are identical in structure, and each is composed of an Embedding layer and three identical Encoding layer stacks based on the self-attention mechanism;
as shown in fig. 4, each Encoding layer has a structure composed of a Self-Attention mechanism (Self-Attention), layer normalization (layerNorm) and a nonlinear feed-forward network (FFN), and the specific process is as follows:
3.1) obtaining scene semantic features V by a component extractor-encoder (Cxe)bs,
The target area characteristics of the video obtained in the step 2.1) areAnd a target position code RlAs input sequence of Cxe, coded as scene semantic featuresWhere K is the number of target regions;
3.1.1) feature the target region of the video VbAnd a target position code RlSplicing in a cascading mode to obtain cascading characteristics RcThe expression is:
wherein the content of the first and second substances,k is 1,2, K, which is the target center coordinates and the width and height information of the target, wf,hfWidth and height of the video frame respectively;
3.1.2) Cascade characteristic R obtained in step 3.1.1)cThrough the Embedding layer, three different linear transformations are utilized to obtain a mapping matrix Q with the same dimensionalityc,Kc,VcThe expression is:
{Qc,Kc,Vc}={Rc·WQ,Rc·WK,Rc·WV} (8)
wherein, WQ、WK、WVAre all parameters learned in the training process;
3.1.3) computing Q through an Encoding layer based on the self-attention mechanism (see FIG. 2)cAnd KcDot product between, to prevent the result from being too large, divide by a scaleWherein d isQIs a matrix QcIs then multiplied by the matrix VcNormalizing the result into probability distribution by utilizing softmax operation to obtain weighted scene semantic feature VbsThe expression is:
3.2) getting the video action, i.e. the predicate of the textual description sentence, with the component extractor-decoder (Cxd),
multi-modal features and scene semantic features V using videobsAs an input to the component extractor-decoder (Cxd), a predicate, which is an action in the video, is decoded, the specific procedure being:
3.2.1) setting the Global RGB feature VrQuery as predictive subject, scene semantic feature VbsAs Key-Value, a self-attention mechanism (equation (10) is used) Obtaining a feature code E of the subjectsThen, a softmax layer (formula (11)) is used to obtain the word probability matrix p of the subjectθ(word|Vbs,Vrs) Then, the argmax function (formula (12)) is used to obtain the word matrix with the maximum probability, which is the corresponding subject s, and the expression is:
Es=fatt(Vrs,Vbs,Vbs) (10)
pθ(word|Vbs,Vrs)=softmax(Ws T·Es) (11)
wherein the content of the first and second substances,Wv、Wsis a parameter learned by the training process;
3.2.2) encoding characteristics E of the subject obtained in step 3.2.1)sQuery as an action (i.e., predicate) for predicting video, time-series characteristic V of videomCalculating a feature code E of a predicate using the formula (13) as Key-ValueaThe word probability matrix p of the predicate is calculated using equation (14)θ(word|s,Vm) And obtaining a corresponding predicate a by using a formula (15), wherein the expression is as follows:
Ea=fatt(Es,Vma,Vma)) (13)
pθ(word|s,Vma)=softmax(Wa T·Ea) (14)
wherein the content of the first and second substances,Wm、Wais a parameter learned by the training process;
3.2.3) encoding characteristics E of the predicateaQuery as prediction object, scene semantic feature V of videobsAs Key-Value, the feature code E of the object is calculated using the formula (16)oCalculating the word probability matrix p of the object using equation (17)θ(word|a,Vbs) The corresponding object o is obtained by using formula (18), and the expression is as follows:
Eo=fatt(Ea,Vbs,Vbs)) (16)
pθ(word|a,Vbs)=softmax(Wo T·Eo) (17)
wherein, WoIs a parameter learned by the training process;
step 4, generating a description text of the video by using the motion guidance decoder,
as shown in fig. 5, using the decoder model, the specific process is as follows:
4.1) calculating the predicate-feature code E by the Attention moduleaAttention weight β oft,
Encoding the characteristics of the predicate calculated by the formula (13) in step 3.2.2) into EaAnd the LSTMs output h at the t-1 timet-1Splicing, h0To represent a start symbol<bos>To obtain an embedded vector EwordtAttention distribution β output by Attention at time ttThe expression of (1) is:
wherein the content of the first and second substances,Wβ、Wh、bβis a parameter learned by the training process;
4.2) generating word predictions h by LSTMs (is a public technology)tThe expression is:
wherein, WLvIs a parameter learned by the training process;
4.3) calculating the predicted word probability,
obtaining word prediction probability p at t moment by the prediction result through a softmax functionθ(wordt) The word with the maximum probability is the predicted word at the current moment, and the expression is as follows:
pθ(wordt)=softmax(Ww·ht) (24)
wordt=argmax(pθ(wordt)) (25)
wherein, WwIs a parameter learned by the training process;
4.4) making t equal to t +1, and circularly executing the steps 4.1) to (4.3) until the predicted wordtFor end marking<eos>Until the end;
step 5, training the video text to generate a network model,
inputting all training samples given in the step 1 into a video text generation network model, repeating the step 2 to the step 4, and training by using standard cross entropy loss, wherein the total loss function of the video text generation network model is the minimum loss L of the SAAT modulesLoss L with video description text generatorcAnd the expression is:
L(θ)=Lc+λ·Ls (26)
wherein, wordt (i)Is the t word of the ith training sample,l is the label word attribute given in step 1, (s, a, o)(i)Subjects, predicates and objects which are the output of the ith training sample, (s, a, o)(i)Is subject, predicate and object of the ith training sample label, Vb,Vr,VmRespectively extracting target area characteristics, two-dimensional RGB characteristics and three-dimensional time sequence characteristics of the video in the step 2.1);
step 6, generating a text description sentence of the video,
after completing the network training through the steps 1 to 5, all parameters of the video text generation network model can be obtained; and taking the video to be described as an input video, and performing the steps 2 to 4 to obtain the text description of the video.
The constructed video text generation network model can generate video text description according to the video multi-modal features and the video motion guidance.
Claims (6)
1. A multi-modal feature-fused video description text generation method is implemented according to the following steps:
step 1, establishing a data set, a verification set and a semantic dictionary;
step 2, constructing a multi-mode feature fusion network to obtain an aggregation feature;
step 3, obtaining a subject, a predicate and an object of a description statement by using an encoder for sensing video action by grammar;
step 4, generating a description text of the video by using the motion guidance decoder;
step 5, training a video text to generate a network model;
step 6, generating a text description sentence of the video,
after completing the network training through the steps 1 to 5, obtaining all parameters of a video text generation network model; and taking the video to be described as an input video, and after the steps 2 to 4 are carried out, obtaining the text description of the video to be described.
2. The method for generating multi-modal feature-fused video description text according to claim 1, wherein the specific process of step 1 is:
1.1) constructing a data set and a validation set,
except for the data set which needs to be self-made and labeled under special conditions, the data set which is disclosed is generally suggested to be adopted;
selecting a video description text to generate data sets MSR-VTT and MSVD, selecting 80% of all data as data samples of a training set, and using the rest 20% of the samples as verification set samples;
1.2) establishing a semantic dictionary,
sequencing all words from high to low according to the occurrence times from sample labels of the training set and the verification set, and selecting the first m words to form a semantic concept set;
assigning an integer sequence number from 0 to m to each word, and then adding four additional marks, namely a start mark < bos >, an end mark < eos >, a blank mark < pad > and a replacement mark < unk >, to each word, wherein m +4 integer sequence numbers are formed to form a semantic dictionary vocab {1, 2. The method comprises the following steps of performing minimum preprocessing on a labeled sentence, namely deleting punctuation marks, adding < bos > and < eos > at the beginning and the end of each sentence respectively, replacing words which are not contained in a semantic dictionary with < unk >, fixing the length of the sentence to be L, and if the sentence is too long and the word which exceeds the sentence is deleted, supplementing the sentence to be fixed length by using < pad >;
assuming that the total number of samples of the training set is N, i is the ith video sample, i is 1, 2. And carrying out semantic dictionary labeling on the data set sample by utilizing the established semantic dictionary, wherein the expression is as follows:
Yi=[yi,1,yi,2,...,yi,L],i=1,2,...,N (1)
wherein, yi,tAnd the sequence number of the t word of the ith video in the text semantic dictionary is an integer, and t is 1, 2.
3. The method for generating multi-modal feature-fused video description text according to claim 1, wherein the specific process of step 2 is:
2.1) extracting multi-modal features of the video,
the method comprises the steps of extracting multi-mode features of a video by adopting three network structures in a combined mode, firstly, sampling an input video at equal intervals to obtain a video with a length of a T frame after preprocessing; then, extracting two-dimensional RGB characteristics and target area characteristics of the preprocessed video, and outputting M by respectively adopting the last average pooling layer of a two-dimensional convolution network IncephetionResnetV 21Two-dimensional RGB feature V of dimensional feature vector description videorM output by RoI Pooling layer using fast RCNN network2The dimension feature vector is used as a plurality of target area features V in the videob(ii) a Then, each video sequence of 16 frames is taken as a segment, each segment is repeated for 8 frames, and the output M of the fully-connected FC6 layer of the three-dimensional convolution network C3D is adopted3Three-dimensional time sequence characteristic V of dimensional feature vector description videomAnd jointly constructing visual multi-modal characteristics corresponding to each video, wherein the expression is as follows:
V={Vr,Vm,Vb} (2)
2.2) acquiring the aggregation characteristics by utilizing the multi-modal characteristic fusion model,
2.2.1) calculating the scene representation feature V with the auto-attention mechanism Modulerm',
Using a multi-modal feature fusion model to combine the two-dimensional RGB features V obtained in the step 2.1)rAnd three-dimensional timing feature VmStitching into a global feature VrmObtaining the query vector Q with the same dimension through three different linear transformationsrmKey vector KrmSum vector VrmThe expression is:
{Qrm,Krm,Vrm}={Vrm·WQ rm,Vrm·WK rm,Vrm·WV rm} (3)
wherein, WQ rm、WK rm、WV rmIs a parameter learned by the training process;
2.2.2) first calculate the query vector Q using the Self-Attention modulermAnd the key vector KrmDot product between, to prevent the result from being too large, divide by a scaleWhereinIs a matrix QrmIs then multiplied by the matrix VrmNormalizing the result into probability distribution by utilizing softmax operation to obtain weighted scene representation characteristic Vrm'The expression is:
2.2.3) constructing MA Module to calculate the motion characteristics Vm',
The three-dimensional time sequence characteristics V obtained in the step 1 are processedmAnd a code E obtained by labeling the part of speech in the semantic dictionary and enabling the labeled verb to be subjected to EbelldingposThe combination is carried out by an MA module which generates a motion characteristic V capable of focusing on the interaction relationship between objects through two linear conversion layers and a ReLU layerm'The expression is:
wherein, Wm、Wa m、bmIs the weight of the linear transformation learned by the training process, ReLU is the activation function;
2.2.4) constructing a dynamic attention module to solve the aggregation characteristic V',
the scene representation characteristics V obtained in the previous step 2.2.1) and step 2.2.2)rm'And the motion characteristic Vm'And two-dimensional RGB feature VrPerforming multi-mode feature fusion through a dynamic attention module to obtain a final aggregation feature V'; firstly, V is firstlyrm'And Vm'Dot product by one scaleWhereinIs a matrix Vrm′Is then multiplied by the matrix VrThen, normalizing the result into probability distribution through softmax operation to obtain an aggregation characteristic V', wherein the expression is as follows:
4. the method for generating multi-modal feature-fused video description text according to claim 1, wherein the specific process of step 3 is:
the encoder used for carrying out the grammar perception video action adopts a basic structure of an SAAT network model and is divided into a component extractor-encoder and a component extractor-decoder;
the component extractor-encoder and the component extractor-decoder have the same structure and are formed by stacking an Embedding layer and three Encoding layers which have the same structure and are based on a self-attention mechanism;
the structure of each Encoding layer consists of an attention mechanism, layer normalization and a nonlinear feedforward network, and the specific process is as follows:
3.1) obtaining scene semantic features V by a component extractor-encoderbs,
The target area characteristics of the video obtained in the step 2.1) areAnd a target position code RlEncoding as a scene semantic feature as an input sequence to a component extractor-encoderWhere K is the number of target regions;
3.1.1) feature the target region of the video VbAnd a target position code RlSplicing in a cascading mode to obtain cascading characteristics RcThe expression is:
wherein the content of the first and second substances,is the target center coordinates and the width and height information of the target, wf,hfWidth and height of the video frame respectively;
3.1.2) Cascade characteristic R obtained in step 3.1.1)cThrough the Embedding layer, three different linear transformations are utilized to obtain a mapping matrix Q with the same dimensionalityc,Kc,VcThe expression is:
{Qc,Kc,Vc}={Rc·WQ,Rc·WK,Rc·WV} (8)
wherein, WQ、WK、WVAre all parameters learned in the training process;
3.1.3) passing through a self-attention-basedEncoding layer computation Q of force mechanismcAnd KcDot product between, to prevent the result from being too large, divide by a scaleWherein d isQIs a matrix QcIs then multiplied by the matrix VcNormalizing the result into probability distribution by utilizing softmax operation to obtain weighted scene semantic feature VbsThe expression is:
3.2) obtaining a video action, i.e. a predicate of a textual description statement,
multi-modal features and scene semantic features V using videobsAs an input of the component extractor-decoder, decoding to obtain a predicate which is an action in the video, the specific process is as follows:
3.2.1) setting the Global RGB feature VrQuery as predictive subject, scene semantic feature VbsObtaining feature code E of subject by using self-attention mechanism as Key-ValuesAnd obtaining the word probability matrix p of the subject through a softmax layerθ(word|Vbs,Vrs) Then, the word matrix with the maximum probability is obtained by using the argmax function, namely the corresponding subject s, and the expression is as follows:
Es=fatt(Vrs,Vbs,Vbs) (10)
pθ(word|Vbs,Vrs)=softmax(Ws T·Es) (11)
wherein the content of the first and second substances,is a parameter learned by the training process;
3.2.2) encoding characteristics E of the subject obtained in step 3.2.1)sQuery as an action to predict video, temporal characteristics V of videomCalculating a feature code E of a predicate using the formula (13) as Key-ValueaThe word probability matrix p of the predicate is calculated using equation (14)θ(word|s,Vm) And obtaining a corresponding predicate a by using a formula (15), wherein the expression is as follows:
Ea=fatt(Es,Vma,Vma)) (13)
pθ(word|s,Vma)=softmax(Wa T·Ea) (14)
wherein the content of the first and second substances,is a parameter learned by the training process;
3.2.3) encoding characteristics E of the predicateaQuery as prediction object, scene semantic feature V of videobsAs Key-Value, the feature code E of the object is calculated using the formula (16)oCalculating the word probability matrix p of the object using equation (17)θ(word|a,Vbs) The corresponding object o is obtained by using formula (18), and the expression is as follows:
Eo=fatt(Ea,Vbs,Vbs)) (16)
pθ(word|a,Vbs)=softmax(Wo T·Eo) (17)
wherein, WoAre parameters learned by the training process.
5. The method for generating multi-modal feature-fused video description text according to claim 1, wherein the specific process of step 4 is:
4.1) calculating the predicate-feature code E by the Attention moduleaAttention weight β oft,
Encoding the characteristics of the predicate calculated by the formula (13) in step 3.2.2) into EaAnd the LSTMs output h at the t-1 timet-1Splicing, h0To represent a start symbol<bos>To obtain an embedded vectorAttention distribution β outputted at time t by AttentiontThe expression of (1) is:
wherein the content of the first and second substances,bβis a parameter learned by the training process;
4.2) generating word prediction result h through LSTMstThe expression is:
wherein, WLvIs a parameter learned by the training process;
4.3) calculating the predicted word probability,
obtaining word prediction probability p at t moment by the prediction result through a softmax functionθ(wordt) The word with the maximum probability is the predicted word at the current moment, and the expression is as follows:
pθ(wordt)=softmax(Ww·ht) (24)
wordt=arg max(pθ(wordt)) (25)
wherein, WwIs a parameter learned by the training process;
4.4) making t equal to t +1, and circularly executing the steps 4.1) to (4.3) until the predicted wordtFor end marking<eos>Until now.
6. The method for generating multi-modal feature-fused video description text according to claim 1, wherein the specific process of step 5 is:
inputting all training samples given in the step 1 into a video text generation network model, repeating the step 2 to the step 4, and training by using standard cross entropy loss, wherein the total loss function of the video text generation network model is the minimum loss L of the SAAT modulesLoss L with video description text generatorcAnd the expression is:
L(θ)=Lc+λ·Ls (26)
wherein, wordt (i)Is the t word of the ith training sample,l is the label word attribute given in step 1, (s, a, o)(i)Subjects, predicates and objects which are the output of the ith training sample, (s, a, o)(i)Is subject, predicate and object of the ith training sample label, Vb,Vr,VmExtracting video for step 2.1) respectivelyThe target area characteristic, the two-dimensional RGB characteristic and the three-dimensional time sequence characteristic.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110975443.2A CN113806587A (en) | 2021-08-24 | 2021-08-24 | Multi-mode feature fusion video description text generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110975443.2A CN113806587A (en) | 2021-08-24 | 2021-08-24 | Multi-mode feature fusion video description text generation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113806587A true CN113806587A (en) | 2021-12-17 |
Family
ID=78941767
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110975443.2A Withdrawn CN113806587A (en) | 2021-08-24 | 2021-08-24 | Multi-mode feature fusion video description text generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113806587A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114387430A (en) * | 2022-01-11 | 2022-04-22 | 平安科技(深圳)有限公司 | Image description generation method, device, equipment and medium based on artificial intelligence |
CN114398961A (en) * | 2021-12-28 | 2022-04-26 | 西南交通大学 | Visual question-answering method based on multi-mode depth feature fusion and model thereof |
CN115175006A (en) * | 2022-06-09 | 2022-10-11 | 中国科学院大学 | Video description method and system based on hierarchical modularization |
CN115496134A (en) * | 2022-09-14 | 2022-12-20 | 北京联合大学 | Traffic scene video description generation method and device based on multi-modal feature fusion |
CN116128043A (en) * | 2023-04-17 | 2023-05-16 | 中国科学技术大学 | Training method of video scene boundary detection model and scene boundary detection method |
CN116193275A (en) * | 2022-12-15 | 2023-05-30 | 荣耀终端有限公司 | Video processing method and related equipment |
CN116821417A (en) * | 2023-08-28 | 2023-09-29 | 中国科学院自动化研究所 | Video tag sequence generation method and device |
CN116932803A (en) * | 2023-09-13 | 2023-10-24 | 浪潮(北京)电子信息产业有限公司 | Data set generation method and training method based on multi-mode pre-training model |
CN117079081A (en) * | 2023-10-16 | 2023-11-17 | 山东海博科技信息系统股份有限公司 | Multi-mode video text processing model training method and system |
CN117876941A (en) * | 2024-03-08 | 2024-04-12 | 杭州阿里云飞天信息技术有限公司 | Target multi-mode model system, construction method, video processing model training method and video processing method |
-
2021
- 2021-08-24 CN CN202110975443.2A patent/CN113806587A/en not_active Withdrawn
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114398961A (en) * | 2021-12-28 | 2022-04-26 | 西南交通大学 | Visual question-answering method based on multi-mode depth feature fusion and model thereof |
CN114398961B (en) * | 2021-12-28 | 2023-05-05 | 西南交通大学 | Visual question-answering method based on multi-mode depth feature fusion and model thereof |
CN114387430A (en) * | 2022-01-11 | 2022-04-22 | 平安科技(深圳)有限公司 | Image description generation method, device, equipment and medium based on artificial intelligence |
CN114387430B (en) * | 2022-01-11 | 2024-05-28 | 平安科技(深圳)有限公司 | Image description generation method, device, equipment and medium based on artificial intelligence |
CN115175006A (en) * | 2022-06-09 | 2022-10-11 | 中国科学院大学 | Video description method and system based on hierarchical modularization |
CN115496134B (en) * | 2022-09-14 | 2023-10-03 | 北京联合大学 | Traffic scene video description generation method and device based on multi-mode feature fusion |
CN115496134A (en) * | 2022-09-14 | 2022-12-20 | 北京联合大学 | Traffic scene video description generation method and device based on multi-modal feature fusion |
CN116193275A (en) * | 2022-12-15 | 2023-05-30 | 荣耀终端有限公司 | Video processing method and related equipment |
CN116193275B (en) * | 2022-12-15 | 2023-10-20 | 荣耀终端有限公司 | Video processing method and related equipment |
CN116128043A (en) * | 2023-04-17 | 2023-05-16 | 中国科学技术大学 | Training method of video scene boundary detection model and scene boundary detection method |
CN116821417A (en) * | 2023-08-28 | 2023-09-29 | 中国科学院自动化研究所 | Video tag sequence generation method and device |
CN116821417B (en) * | 2023-08-28 | 2023-12-12 | 中国科学院自动化研究所 | Video tag sequence generation method and device |
CN116932803A (en) * | 2023-09-13 | 2023-10-24 | 浪潮(北京)电子信息产业有限公司 | Data set generation method and training method based on multi-mode pre-training model |
CN116932803B (en) * | 2023-09-13 | 2024-01-26 | 浪潮(北京)电子信息产业有限公司 | Data set generation method and training method based on multi-mode pre-training model |
CN117079081A (en) * | 2023-10-16 | 2023-11-17 | 山东海博科技信息系统股份有限公司 | Multi-mode video text processing model training method and system |
CN117079081B (en) * | 2023-10-16 | 2024-01-26 | 山东海博科技信息系统股份有限公司 | Multi-mode video text processing model training method and system |
CN117876941A (en) * | 2024-03-08 | 2024-04-12 | 杭州阿里云飞天信息技术有限公司 | Target multi-mode model system, construction method, video processing model training method and video processing method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113806587A (en) | Multi-mode feature fusion video description text generation method | |
CN111160008B (en) | Entity relationship joint extraction method and system | |
CN110134771B (en) | Implementation method of multi-attention-machine-based fusion network question-answering system | |
CN109874029B (en) | Video description generation method, device, equipment and storage medium | |
CN110275936B (en) | Similar legal case retrieval method based on self-coding neural network | |
CN112084314A (en) | Knowledge-introducing generating type session system | |
CN111694924A (en) | Event extraction method and system | |
CN112633364A (en) | Multi-modal emotion recognition method based on Transformer-ESIM attention mechanism | |
CN113627266B (en) | Video pedestrian re-recognition method based on transform space-time modeling | |
CN115617955B (en) | Hierarchical prediction model training method, punctuation symbol recovery method and device | |
CN112989120B (en) | Video clip query system and video clip query method | |
CN116050401B (en) | Method for automatically generating diversity problems based on transform problem keyword prediction | |
CN116450796A (en) | Intelligent question-answering model construction method and device | |
CN111767697B (en) | Text processing method and device, computer equipment and storage medium | |
CN117609421A (en) | Electric power professional knowledge intelligent question-answering system construction method based on large language model | |
CN115019239A (en) | Real-time action positioning method based on space-time cross attention | |
CN115361595A (en) | Video bullet screen generation method | |
CN115098673A (en) | Business document information extraction method based on variant attention and hierarchical structure | |
CN113065027A (en) | Video recommendation method and device, electronic equipment and storage medium | |
CN116186562A (en) | Encoder-based long text matching method | |
CN114329005A (en) | Information processing method, information processing device, computer equipment and storage medium | |
CN114328910A (en) | Text clustering method and related device | |
CN114550272B (en) | Micro-expression recognition method and device based on video time domain dynamic attention model | |
CN113987187B (en) | Public opinion text classification method, system, terminal and medium based on multi-label embedding | |
US20240153247A1 (en) | Automatic data generation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20211217 |