CN113806587A - Multi-mode feature fusion video description text generation method - Google Patents

Multi-mode feature fusion video description text generation method Download PDF

Info

Publication number
CN113806587A
CN113806587A CN202110975443.2A CN202110975443A CN113806587A CN 113806587 A CN113806587 A CN 113806587A CN 202110975443 A CN202110975443 A CN 202110975443A CN 113806587 A CN113806587 A CN 113806587A
Authority
CN
China
Prior art keywords
video
word
feature
expression
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110975443.2A
Other languages
Chinese (zh)
Inventor
朱虹
刘媛媛
李阳辉
张雨嘉
王栋
史静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202110975443.2A priority Critical patent/CN113806587A/en
Publication of CN113806587A publication Critical patent/CN113806587A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • G06F16/785Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using colour or luminescence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for generating a multi-modal feature fusion video description text, which comprises the following steps: 1) establishing a data set, a verification set and a semantic dictionary; 2) constructing a multi-modal feature fusion network to obtain an aggregation feature; 3) obtaining a subject, a predicate, and an object describing a sentence using an encoder that perceives video motion using a grammar; 4) generating a description text of the video by using the motion guidance decoder; 5) training a video text to generate a network model; 6) and generating a text description sentence of the video, and after finishing the network training in the steps 1 to 5, obtaining all parameters of a video text generation network model, wherein the video to be described is taken as an input video, and after the steps 2 to 4, the text description of the video to be described is obtained. The method has higher accuracy.

Description

Multi-mode feature fusion video description text generation method
Technical Field
The invention belongs to the technical field of video text description generation, and relates to a multi-modal feature fusion video description text generation method.
Background
The task of video text description is to automatically generate a complete and natural sentence to describe video content, and accurately understanding the content contained in the video is of great significance and wide application in practice. For example, in the case of massive video data, video text description can be used for fast and efficient video retrieval, and the generated video text description can also be used for intelligent auditing of videos.
In the video description text generation process, if semantic information contained in the video multi-modal features cannot be better learned, the problem of semantic inconsistency between video content and generated descriptions can be caused. At present, 2D and 3D convolutional neural networks have successfully improved the technology of learning representation from visual, audio and motion information, but solve the problem of how to aggregate the extracted multi-modal characteristics of the video, and still be a research idea that can improve the accuracy of text description.
Disclosure of Invention
The invention aims to provide a method for generating a multi-modal feature fusion video description text, which solves the problem that in the prior art, in the generation of the video description text, the semantics of the video content and the generated description are inconsistent.
The technical scheme adopted by the invention is that the method for generating the multi-mode feature fused video description text is implemented according to the following steps:
step 1, establishing a data set, a verification set and a semantic dictionary;
step 2, constructing a multi-mode feature fusion network to obtain an aggregation feature;
step 3, obtaining a subject, a predicate and an object of a description statement by using an encoder for sensing video action by grammar;
step 4, generating a description text of the video by using the motion guidance decoder;
step 5, training a video text to generate a network model;
step 6, generating a text description sentence of the video,
after completing the network training through the steps 1 to 5, obtaining all parameters of a video text generation network model; and taking the video to be described as an input video, and after the steps 2 to 4 are carried out, obtaining the text description of the video to be described.
The method has the advantages that in the video description text generation network model, the RGB characteristics obtained by the 2D convolutional neural network and the time sequence characteristics obtained by the 3D convolutional neural network are combined through the multi-mode characteristic fusion model to obtain the aggregation characteristics which are more in line with the text description requirements, and the aggregation characteristics are combined with predicate coding information generated by the grammar perception prediction action module and are sent to the decoding network model to perform text description on the input video. Compared with the algorithm index of the mainstream paper retrieved at present, the video description text generated by the method has higher accuracy.
Drawings
FIG. 1 is a flow chart of a multimodal feature fusion model of the method of the present invention;
FIG. 2 is a flow chart of a self-attention mechanism in the method of the present invention;
FIG. 3 is a flow chart of an encoder based on a syntax aware predictive action module in the method of the present invention;
FIG. 4 is a flow chart of the encoding layer in the syntax aware predictive action module of the method of the present invention;
fig. 5 is a flow chart of a decoder model employed by the method of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The method for generating the multi-modal feature fusion video description text is specifically implemented according to the following steps:
step 1, establishing a data set, a verification set and a semantic dictionary,
1.1) constructing a data set and a validation set,
because the text of the video describes the requirement of the generalization capability of the data set of the network, except for the data set which needs to be self-made and labeled under special conditions, the public data set is generally recommended;
the method preferably generates data sets MSR-VTT and MSVD according to the video description text with relatively high use rate at present; the MSVD data set comprises 1967 YouTube public short videos, each video mainly displays an activity, the duration is 10-25 s, the videos comprise different people, animals, actions, scenes and the like, the video contents are labeled by the different people, and each video corresponds to 40 text description sentences on average; 10000 videos in the MSR-VTT data set, wherein each video corresponds to 20 text description sentences on average;
randomly selecting a part (preferably 80% of the total data in the step) from the MSR-VTT and MSVD data sets as data samples of a training set, and using the rest 20% of the samples as verification set samples;
1.2) establishing a semantic dictionary,
from sample labels of a training set and a verification set, sorting all words from high to low according to occurrence times, selecting the first m words to form a semantic concept set, wherein m is an empirical value, and preferably, m is 80% to 85% of the total number of words;
each word is assigned with an integer sequence number from 0 to m, and then four additional marks are added (to each word), namely a start mark < bos >, an end mark < eos >, a blank mark < pad > and a replacement mark < unk >, so that m +4 integer sequence numbers are formed, and a semantic dictionary vocab {1, 2. Performing minimum preprocessing on the sentences with labels, namely deleting punctuation marks, adding < bos > and < eos > at the beginning and the end of each sentence respectively, replacing words not contained in a semantic dictionary with < unk >, fixing the length of the sentences to be L, wherein L is an empirical value, and preferably selecting L from [20,40] according to the statistical result of the sentence length described by the video text; if the sentence is too long to delete the excess words, the sentence is too short to be supplemented to a fixed length using < pad >;
assuming that the total number of samples of the training set is N, i is the ith video sample, i is 1, 2. And carrying out semantic dictionary labeling on the data set sample by utilizing the established semantic dictionary, wherein the expression is as follows:
Yi=[yi,1,yi,2,...,yi,L],i=1,2,...,N (1)
wherein, yi,tThe sequence number of the t word of the ith video in the text semantic dictionary is an integer, and t is 1, 2.
Step 2, constructing a multi-mode feature fusion network to obtain the aggregation features,
2.1) extracting multi-modal features of the video,
firstly, sampling an input video (the input video is a training sample video given in the step 1 during model training, and the input video is a video required to be described by a text after training is finished) at equal intervals to obtain a video with a length of a T frame after preprocessing, (T is an empirical value and is determined by combining with the content of the video required to be described, and preferably T belongs to the field of 16,64]) (ii) a Then, two-dimensional RGB characteristics and target area characteristics of the preprocessed video are extracted, and the last average pooling layer output M of a two-dimensional convolution network Inception ResnetV2(IRV2) (which is the prior art) is adopted1(M11536) dimensional feature vector describes two-dimensional RGB feature V of a videorM output from RoI Pooling layer using FasterRCNN network (prior art)2(M21024) dimensional feature vector as a plurality of target region features V in the videob(ii) a Then, taking each video sequence of 16 frames as a slice, repeating 8 frames per slice, using the output M of the fully-connected FC6 layer of the three-dimensional convolutional network C3D (prior art)3(M32048) dimensional feature vector describes the three-dimensional temporal characteristics V of a videomAnd jointly constructing visual multi-modal characteristics corresponding to each video, wherein the expression is as follows:
V={Vr,Vm,Vb} (2)
2.2) acquiring the aggregation characteristics by utilizing the multi-modal characteristic fusion model,
2.2.1) calculating the scene representation feature V with the auto-attention mechanism Modulerm',
As shown in fig. 1, the two-dimensional RGB features V obtained in step 2.1) are combined by using the framework of the multi-modal feature fusion modelrAnd three-dimensional timing feature VmStitching into a global feature VrmObtaining the query vector Q with the same dimension through three different linear transformationsrmKey vector KrmSum vector VrmThe expression is:
{Qrm,Krm,Vrm}={Vrm·WQ rm,Vrm·WK rm,Vrm·WV rm} (3)
wherein, WQ rm、WK rm、WV rmIs a parameter learned by the training process;
2.2.2) As shown in FIG. 2, the query vector Q is first calculated using the Self-orientation modulermAnd the key vector KrmDot product between, to prevent the result from being too large, divide by a scale
Figure BDA0003227086990000051
Wherein
Figure BDA0003227086990000052
Is a matrix QrmIs then multiplied by the matrix VrmNormalizing the result into probability distribution by utilizing softmax operation to obtain weighted scene representation characteristic VrmThe expression is:
Figure BDA0003227086990000053
wherein the content of the first and second substances,
Figure BDA0003227086990000054
is a matrix QrmDimension (d);
2.2.3) constructing MA Module to calculate the motion characteristics Vm',
The three-dimensional time sequence characteristics V obtained in the step 1 are processedmCode E obtained by Ebedding verbiage of verb labeled by part-of-speech (pos) in semantic dictionaryposThe combination is carried out by an MA module which generates a motion characteristic V capable of focusing on the interaction relationship between objects through two linear conversion layers and a ReLU layerm', the expression is:
Figure BDA0003227086990000055
wherein, Wm
Figure BDA0003227086990000056
bmIs the weight of the linear transformation learned by the training process, ReLU is the activation function;
2.2.4) constructing a dynamic attention module to solve the aggregation characteristic V',
the scene representation characteristics V obtained in the previous step 2.2.1) and step 2.2.2)rm' and motion characteristics Vm' and two-dimensional RGB feature VrPerforming multi-mode feature fusion through a dynamic attention module to obtain a final aggregation feature V'; firstly, V is firstlyrm' and Vm' dot multiplication, division by a scale of one dimension
Figure BDA0003227086990000061
Wherein
Figure BDA0003227086990000062
Is a matrix Vrm′Is then multiplied by the matrix VrThen, normalizing the result into probability distribution through softmax operation to obtain an aggregation characteristic V', wherein the expression is as follows:
Figure BDA0003227086990000063
step 3, obtaining a subject, a predicate and an object of the descriptive statement by using an encoder for sensing the video action by the grammar,
the encoder for syntax-aware video motion adopts a basic structure of an SAAT network model (the SAAT network model is the prior art), belongs to a syntax-aware prediction motion module, and is divided into a component extractor-encoder (Cxe) and a component extractor-decoder (Cxd);
as shown in fig. 3, the component extractor-encoder (Cxe) and the component extractor-decoder (Cxd) are identical in structure, and each is composed of an Embedding layer and three identical Encoding layer stacks based on the self-attention mechanism;
as shown in fig. 4, each Encoding layer has a structure composed of a Self-Attention mechanism (Self-Attention), layer normalization (layerNorm) and a nonlinear feed-forward network (FFN), and the specific process is as follows:
3.1) obtaining scene semantic features V by a component extractor-encoder (Cxe)bs
The target area characteristics of the video obtained in the step 2.1) are
Figure BDA0003227086990000064
And a target position code RlAs input sequence of Cxe, coded as scene semantic features
Figure BDA0003227086990000065
Where K is the number of target regions;
3.1.1) feature the target region of the video VbAnd a target position code RlSplicing in a cascading mode to obtain cascading characteristics RcThe expression is:
Figure BDA0003227086990000071
wherein the content of the first and second substances,
Figure BDA0003227086990000072
k is 1,2, K, which is the target center coordinates and the width and height information of the target, wf,hfWidth and height of the video frame respectively;
3.1.2) Cascade characteristic R obtained in step 3.1.1)cThrough the Embedding layer, three different linear transformations are utilized to obtain a mapping matrix Q with the same dimensionalityc,Kc,VcThe expression is:
{Qc,Kc,Vc}={Rc·WQ,Rc·WK,Rc·WV} (8)
wherein, WQ、WK、WVAre all parameters learned in the training process;
3.1.3) computing Q through an Encoding layer based on the self-attention mechanism (see FIG. 2)cAnd KcDot product between, to prevent the result from being too large, divide by a scale
Figure BDA0003227086990000073
Wherein d isQIs a matrix QcIs then multiplied by the matrix VcNormalizing the result into probability distribution by utilizing softmax operation to obtain weighted scene semantic feature VbsThe expression is:
Figure BDA0003227086990000074
3.2) getting the video action, i.e. the predicate of the textual description sentence, with the component extractor-decoder (Cxd),
multi-modal features and scene semantic features V using videobsAs an input to the component extractor-decoder (Cxd), a predicate, which is an action in the video, is decoded, the specific procedure being:
3.2.1) setting the Global RGB feature VrQuery as predictive subject, scene semantic feature VbsAs Key-Value, a self-attention mechanism (equation (10) is used) Obtaining a feature code E of the subjectsThen, a softmax layer (formula (11)) is used to obtain the word probability matrix p of the subjectθ(word|Vbs,Vrs) Then, the argmax function (formula (12)) is used to obtain the word matrix with the maximum probability, which is the corresponding subject s, and the expression is:
Es=fatt(Vrs,Vbs,Vbs) (10)
pθ(word|Vbs,Vrs)=softmax(Ws T·Es) (11)
Figure BDA0003227086990000081
wherein the content of the first and second substances,
Figure BDA0003227086990000082
Wv、Wsis a parameter learned by the training process;
3.2.2) encoding characteristics E of the subject obtained in step 3.2.1)sQuery as an action (i.e., predicate) for predicting video, time-series characteristic V of videomCalculating a feature code E of a predicate using the formula (13) as Key-ValueaThe word probability matrix p of the predicate is calculated using equation (14)θ(word|s,Vm) And obtaining a corresponding predicate a by using a formula (15), wherein the expression is as follows:
Ea=fatt(Es,Vma,Vma)) (13)
pθ(word|s,Vma)=softmax(Wa T·Ea) (14)
Figure BDA0003227086990000083
wherein the content of the first and second substances,
Figure BDA0003227086990000084
Wm、Wais a parameter learned by the training process;
3.2.3) encoding characteristics E of the predicateaQuery as prediction object, scene semantic feature V of videobsAs Key-Value, the feature code E of the object is calculated using the formula (16)oCalculating the word probability matrix p of the object using equation (17)θ(word|a,Vbs) The corresponding object o is obtained by using formula (18), and the expression is as follows:
Eo=fatt(Ea,Vbs,Vbs)) (16)
pθ(word|a,Vbs)=softmax(Wo T·Eo) (17)
Figure BDA0003227086990000085
wherein, WoIs a parameter learned by the training process;
step 4, generating a description text of the video by using the motion guidance decoder,
as shown in fig. 5, using the decoder model, the specific process is as follows:
4.1) calculating the predicate-feature code E by the Attention moduleaAttention weight β oft
Encoding the characteristics of the predicate calculated by the formula (13) in step 3.2.2) into EaAnd the LSTMs output h at the t-1 timet-1Splicing, h0To represent a start symbol<bos>To obtain an embedded vector EwordtAttention distribution β output by Attention at time ttThe expression of (1) is:
Figure BDA0003227086990000091
wherein the content of the first and second substances,
Figure BDA0003227086990000092
Wβ、Wh、bβis a parameter learned by the training process;
4.2) generating word predictions h by LSTMs (is a public technology)tThe expression is:
Figure BDA0003227086990000093
wherein, WLvIs a parameter learned by the training process;
4.3) calculating the predicted word probability,
obtaining word prediction probability p at t moment by the prediction result through a softmax functionθ(wordt) The word with the maximum probability is the predicted word at the current moment, and the expression is as follows:
pθ(wordt)=softmax(Ww·ht) (24)
wordt=argmax(pθ(wordt)) (25)
wherein, WwIs a parameter learned by the training process;
4.4) making t equal to t +1, and circularly executing the steps 4.1) to (4.3) until the predicted wordtFor end marking<eos>Until the end;
step 5, training the video text to generate a network model,
inputting all training samples given in the step 1 into a video text generation network model, repeating the step 2 to the step 4, and training by using standard cross entropy loss, wherein the total loss function of the video text generation network model is the minimum loss L of the SAAT modulesLoss L with video description text generatorcAnd the expression is:
L(θ)=Lc+λ·Ls (26)
Figure BDA0003227086990000101
Figure BDA0003227086990000102
wherein, wordt (i)Is the t word of the ith training sample,
Figure BDA0003227086990000103
l is the label word attribute given in step 1, (s, a, o)(i)Subjects, predicates and objects which are the output of the ith training sample, (s, a, o)(i)Is subject, predicate and object of the ith training sample label, Vb,Vr,VmRespectively extracting target area characteristics, two-dimensional RGB characteristics and three-dimensional time sequence characteristics of the video in the step 2.1);
step 6, generating a text description sentence of the video,
after completing the network training through the steps 1 to 5, all parameters of the video text generation network model can be obtained; and taking the video to be described as an input video, and performing the steps 2 to 4 to obtain the text description of the video.
The constructed video text generation network model can generate video text description according to the video multi-modal features and the video motion guidance.

Claims (6)

1. A multi-modal feature-fused video description text generation method is implemented according to the following steps:
step 1, establishing a data set, a verification set and a semantic dictionary;
step 2, constructing a multi-mode feature fusion network to obtain an aggregation feature;
step 3, obtaining a subject, a predicate and an object of a description statement by using an encoder for sensing video action by grammar;
step 4, generating a description text of the video by using the motion guidance decoder;
step 5, training a video text to generate a network model;
step 6, generating a text description sentence of the video,
after completing the network training through the steps 1 to 5, obtaining all parameters of a video text generation network model; and taking the video to be described as an input video, and after the steps 2 to 4 are carried out, obtaining the text description of the video to be described.
2. The method for generating multi-modal feature-fused video description text according to claim 1, wherein the specific process of step 1 is:
1.1) constructing a data set and a validation set,
except for the data set which needs to be self-made and labeled under special conditions, the data set which is disclosed is generally suggested to be adopted;
selecting a video description text to generate data sets MSR-VTT and MSVD, selecting 80% of all data as data samples of a training set, and using the rest 20% of the samples as verification set samples;
1.2) establishing a semantic dictionary,
sequencing all words from high to low according to the occurrence times from sample labels of the training set and the verification set, and selecting the first m words to form a semantic concept set;
assigning an integer sequence number from 0 to m to each word, and then adding four additional marks, namely a start mark < bos >, an end mark < eos >, a blank mark < pad > and a replacement mark < unk >, to each word, wherein m +4 integer sequence numbers are formed to form a semantic dictionary vocab {1, 2. The method comprises the following steps of performing minimum preprocessing on a labeled sentence, namely deleting punctuation marks, adding < bos > and < eos > at the beginning and the end of each sentence respectively, replacing words which are not contained in a semantic dictionary with < unk >, fixing the length of the sentence to be L, and if the sentence is too long and the word which exceeds the sentence is deleted, supplementing the sentence to be fixed length by using < pad >;
assuming that the total number of samples of the training set is N, i is the ith video sample, i is 1, 2. And carrying out semantic dictionary labeling on the data set sample by utilizing the established semantic dictionary, wherein the expression is as follows:
Yi=[yi,1,yi,2,...,yi,L],i=1,2,...,N (1)
wherein, yi,tAnd the sequence number of the t word of the ith video in the text semantic dictionary is an integer, and t is 1, 2.
3. The method for generating multi-modal feature-fused video description text according to claim 1, wherein the specific process of step 2 is:
2.1) extracting multi-modal features of the video,
the method comprises the steps of extracting multi-mode features of a video by adopting three network structures in a combined mode, firstly, sampling an input video at equal intervals to obtain a video with a length of a T frame after preprocessing; then, extracting two-dimensional RGB characteristics and target area characteristics of the preprocessed video, and outputting M by respectively adopting the last average pooling layer of a two-dimensional convolution network IncephetionResnetV 21Two-dimensional RGB feature V of dimensional feature vector description videorM output by RoI Pooling layer using fast RCNN network2The dimension feature vector is used as a plurality of target area features V in the videob(ii) a Then, each video sequence of 16 frames is taken as a segment, each segment is repeated for 8 frames, and the output M of the fully-connected FC6 layer of the three-dimensional convolution network C3D is adopted3Three-dimensional time sequence characteristic V of dimensional feature vector description videomAnd jointly constructing visual multi-modal characteristics corresponding to each video, wherein the expression is as follows:
V={Vr,Vm,Vb} (2)
2.2) acquiring the aggregation characteristics by utilizing the multi-modal characteristic fusion model,
2.2.1) calculating the scene representation feature V with the auto-attention mechanism Modulerm'
Using a multi-modal feature fusion model to combine the two-dimensional RGB features V obtained in the step 2.1)rAnd three-dimensional timing feature VmStitching into a global feature VrmObtaining the query vector Q with the same dimension through three different linear transformationsrmKey vector KrmSum vector VrmThe expression is:
{Qrm,Krm,Vrm}={Vrm·WQ rm,Vrm·WK rm,Vrm·WV rm} (3)
wherein, WQ rm、WK rm、WV rmIs a parameter learned by the training process;
2.2.2) first calculate the query vector Q using the Self-Attention modulermAnd the key vector KrmDot product between, to prevent the result from being too large, divide by a scale
Figure FDA0003227086980000031
Wherein
Figure FDA0003227086980000032
Is a matrix QrmIs then multiplied by the matrix VrmNormalizing the result into probability distribution by utilizing softmax operation to obtain weighted scene representation characteristic Vrm'The expression is:
Figure FDA0003227086980000033
wherein the content of the first and second substances,
Figure FDA0003227086980000034
is a matrix QrmDimension (d);
2.2.3) constructing MA Module to calculate the motion characteristics Vm'
The three-dimensional time sequence characteristics V obtained in the step 1 are processedmAnd a code E obtained by labeling the part of speech in the semantic dictionary and enabling the labeled verb to be subjected to EbelldingposThe combination is carried out by an MA module which generates a motion characteristic V capable of focusing on the interaction relationship between objects through two linear conversion layers and a ReLU layerm'The expression is:
Figure FDA0003227086980000035
wherein, Wm、Wa m、bmIs the weight of the linear transformation learned by the training process, ReLU is the activation function;
2.2.4) constructing a dynamic attention module to solve the aggregation characteristic V',
the scene representation characteristics V obtained in the previous step 2.2.1) and step 2.2.2)rm'And the motion characteristic Vm'And two-dimensional RGB feature VrPerforming multi-mode feature fusion through a dynamic attention module to obtain a final aggregation feature V'; firstly, V is firstlyrm'And Vm'Dot product by one scale
Figure FDA0003227086980000041
Wherein
Figure FDA0003227086980000042
Is a matrix Vrm′Is then multiplied by the matrix VrThen, normalizing the result into probability distribution through softmax operation to obtain an aggregation characteristic V', wherein the expression is as follows:
Figure FDA0003227086980000043
4. the method for generating multi-modal feature-fused video description text according to claim 1, wherein the specific process of step 3 is:
the encoder used for carrying out the grammar perception video action adopts a basic structure of an SAAT network model and is divided into a component extractor-encoder and a component extractor-decoder;
the component extractor-encoder and the component extractor-decoder have the same structure and are formed by stacking an Embedding layer and three Encoding layers which have the same structure and are based on a self-attention mechanism;
the structure of each Encoding layer consists of an attention mechanism, layer normalization and a nonlinear feedforward network, and the specific process is as follows:
3.1) obtaining scene semantic features V by a component extractor-encoderbs
The target area characteristics of the video obtained in the step 2.1) are
Figure FDA0003227086980000044
And a target position code RlEncoding as a scene semantic feature as an input sequence to a component extractor-encoder
Figure FDA0003227086980000045
Where K is the number of target regions;
3.1.1) feature the target region of the video VbAnd a target position code RlSplicing in a cascading mode to obtain cascading characteristics RcThe expression is:
Figure FDA0003227086980000051
wherein the content of the first and second substances,
Figure FDA0003227086980000052
is the target center coordinates and the width and height information of the target, wf,hfWidth and height of the video frame respectively;
3.1.2) Cascade characteristic R obtained in step 3.1.1)cThrough the Embedding layer, three different linear transformations are utilized to obtain a mapping matrix Q with the same dimensionalityc,Kc,VcThe expression is:
{Qc,Kc,Vc}={Rc·WQ,Rc·WK,Rc·WV} (8)
wherein, WQ、WK、WVAre all parameters learned in the training process;
3.1.3) passing through a self-attention-basedEncoding layer computation Q of force mechanismcAnd KcDot product between, to prevent the result from being too large, divide by a scale
Figure FDA0003227086980000053
Wherein d isQIs a matrix QcIs then multiplied by the matrix VcNormalizing the result into probability distribution by utilizing softmax operation to obtain weighted scene semantic feature VbsThe expression is:
Figure FDA0003227086980000054
3.2) obtaining a video action, i.e. a predicate of a textual description statement,
multi-modal features and scene semantic features V using videobsAs an input of the component extractor-decoder, decoding to obtain a predicate which is an action in the video, the specific process is as follows:
3.2.1) setting the Global RGB feature VrQuery as predictive subject, scene semantic feature VbsObtaining feature code E of subject by using self-attention mechanism as Key-ValuesAnd obtaining the word probability matrix p of the subject through a softmax layerθ(word|Vbs,Vrs) Then, the word matrix with the maximum probability is obtained by using the argmax function, namely the corresponding subject s, and the expression is as follows:
Es=fatt(Vrs,Vbs,Vbs) (10)
pθ(word|Vbs,Vrs)=softmax(Ws T·Es) (11)
Figure FDA0003227086980000061
wherein the content of the first and second substances,
Figure FDA0003227086980000062
is a parameter learned by the training process;
3.2.2) encoding characteristics E of the subject obtained in step 3.2.1)sQuery as an action to predict video, temporal characteristics V of videomCalculating a feature code E of a predicate using the formula (13) as Key-ValueaThe word probability matrix p of the predicate is calculated using equation (14)θ(word|s,Vm) And obtaining a corresponding predicate a by using a formula (15), wherein the expression is as follows:
Ea=fatt(Es,Vma,Vma)) (13)
pθ(word|s,Vma)=softmax(Wa T·Ea) (14)
Figure FDA0003227086980000063
wherein the content of the first and second substances,
Figure FDA0003227086980000064
is a parameter learned by the training process;
3.2.3) encoding characteristics E of the predicateaQuery as prediction object, scene semantic feature V of videobsAs Key-Value, the feature code E of the object is calculated using the formula (16)oCalculating the word probability matrix p of the object using equation (17)θ(word|a,Vbs) The corresponding object o is obtained by using formula (18), and the expression is as follows:
Eo=fatt(Ea,Vbs,Vbs)) (16)
pθ(word|a,Vbs)=softmax(Wo T·Eo) (17)
Figure FDA0003227086980000065
wherein, WoAre parameters learned by the training process.
5. The method for generating multi-modal feature-fused video description text according to claim 1, wherein the specific process of step 4 is:
4.1) calculating the predicate-feature code E by the Attention moduleaAttention weight β oft
Encoding the characteristics of the predicate calculated by the formula (13) in step 3.2.2) into EaAnd the LSTMs output h at the t-1 timet-1Splicing, h0To represent a start symbol<bos>To obtain an embedded vector
Figure FDA0003227086980000066
Attention distribution β outputted at time t by AttentiontThe expression of (1) is:
Figure FDA0003227086980000071
wherein the content of the first and second substances,
Figure FDA0003227086980000072
bβis a parameter learned by the training process;
4.2) generating word prediction result h through LSTMstThe expression is:
Figure FDA0003227086980000073
wherein, WLvIs a parameter learned by the training process;
4.3) calculating the predicted word probability,
obtaining word prediction probability p at t moment by the prediction result through a softmax functionθ(wordt) The word with the maximum probability is the predicted word at the current moment, and the expression is as follows:
pθ(wordt)=softmax(Ww·ht) (24)
wordt=arg max(pθ(wordt)) (25)
wherein, WwIs a parameter learned by the training process;
4.4) making t equal to t +1, and circularly executing the steps 4.1) to (4.3) until the predicted wordtFor end marking<eos>Until now.
6. The method for generating multi-modal feature-fused video description text according to claim 1, wherein the specific process of step 5 is:
inputting all training samples given in the step 1 into a video text generation network model, repeating the step 2 to the step 4, and training by using standard cross entropy loss, wherein the total loss function of the video text generation network model is the minimum loss L of the SAAT modulesLoss L with video description text generatorcAnd the expression is:
L(θ)=Lc+λ·Ls (26)
Figure FDA0003227086980000074
Figure FDA0003227086980000081
wherein, wordt (i)Is the t word of the ith training sample,
Figure FDA0003227086980000082
l is the label word attribute given in step 1, (s, a, o)(i)Subjects, predicates and objects which are the output of the ith training sample, (s, a, o)(i)Is subject, predicate and object of the ith training sample label, Vb,Vr,VmExtracting video for step 2.1) respectivelyThe target area characteristic, the two-dimensional RGB characteristic and the three-dimensional time sequence characteristic.
CN202110975443.2A 2021-08-24 2021-08-24 Multi-mode feature fusion video description text generation method Withdrawn CN113806587A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110975443.2A CN113806587A (en) 2021-08-24 2021-08-24 Multi-mode feature fusion video description text generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110975443.2A CN113806587A (en) 2021-08-24 2021-08-24 Multi-mode feature fusion video description text generation method

Publications (1)

Publication Number Publication Date
CN113806587A true CN113806587A (en) 2021-12-17

Family

ID=78941767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110975443.2A Withdrawn CN113806587A (en) 2021-08-24 2021-08-24 Multi-mode feature fusion video description text generation method

Country Status (1)

Country Link
CN (1) CN113806587A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114387430A (en) * 2022-01-11 2022-04-22 平安科技(深圳)有限公司 Image description generation method, device, equipment and medium based on artificial intelligence
CN114398961A (en) * 2021-12-28 2022-04-26 西南交通大学 Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN115175006A (en) * 2022-06-09 2022-10-11 中国科学院大学 Video description method and system based on hierarchical modularization
CN115496134A (en) * 2022-09-14 2022-12-20 北京联合大学 Traffic scene video description generation method and device based on multi-modal feature fusion
CN116128043A (en) * 2023-04-17 2023-05-16 中国科学技术大学 Training method of video scene boundary detection model and scene boundary detection method
CN116193275A (en) * 2022-12-15 2023-05-30 荣耀终端有限公司 Video processing method and related equipment
CN116821417A (en) * 2023-08-28 2023-09-29 中国科学院自动化研究所 Video tag sequence generation method and device
CN116932803A (en) * 2023-09-13 2023-10-24 浪潮(北京)电子信息产业有限公司 Data set generation method and training method based on multi-mode pre-training model
CN117079081A (en) * 2023-10-16 2023-11-17 山东海博科技信息系统股份有限公司 Multi-mode video text processing model training method and system
CN117876941A (en) * 2024-03-08 2024-04-12 杭州阿里云飞天信息技术有限公司 Target multi-mode model system, construction method, video processing model training method and video processing method

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398961A (en) * 2021-12-28 2022-04-26 西南交通大学 Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN114398961B (en) * 2021-12-28 2023-05-05 西南交通大学 Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN114387430A (en) * 2022-01-11 2022-04-22 平安科技(深圳)有限公司 Image description generation method, device, equipment and medium based on artificial intelligence
CN114387430B (en) * 2022-01-11 2024-05-28 平安科技(深圳)有限公司 Image description generation method, device, equipment and medium based on artificial intelligence
CN115175006A (en) * 2022-06-09 2022-10-11 中国科学院大学 Video description method and system based on hierarchical modularization
CN115496134B (en) * 2022-09-14 2023-10-03 北京联合大学 Traffic scene video description generation method and device based on multi-mode feature fusion
CN115496134A (en) * 2022-09-14 2022-12-20 北京联合大学 Traffic scene video description generation method and device based on multi-modal feature fusion
CN116193275A (en) * 2022-12-15 2023-05-30 荣耀终端有限公司 Video processing method and related equipment
CN116193275B (en) * 2022-12-15 2023-10-20 荣耀终端有限公司 Video processing method and related equipment
CN116128043A (en) * 2023-04-17 2023-05-16 中国科学技术大学 Training method of video scene boundary detection model and scene boundary detection method
CN116821417A (en) * 2023-08-28 2023-09-29 中国科学院自动化研究所 Video tag sequence generation method and device
CN116821417B (en) * 2023-08-28 2023-12-12 中国科学院自动化研究所 Video tag sequence generation method and device
CN116932803A (en) * 2023-09-13 2023-10-24 浪潮(北京)电子信息产业有限公司 Data set generation method and training method based on multi-mode pre-training model
CN116932803B (en) * 2023-09-13 2024-01-26 浪潮(北京)电子信息产业有限公司 Data set generation method and training method based on multi-mode pre-training model
CN117079081A (en) * 2023-10-16 2023-11-17 山东海博科技信息系统股份有限公司 Multi-mode video text processing model training method and system
CN117079081B (en) * 2023-10-16 2024-01-26 山东海博科技信息系统股份有限公司 Multi-mode video text processing model training method and system
CN117876941A (en) * 2024-03-08 2024-04-12 杭州阿里云飞天信息技术有限公司 Target multi-mode model system, construction method, video processing model training method and video processing method

Similar Documents

Publication Publication Date Title
CN113806587A (en) Multi-mode feature fusion video description text generation method
CN111160008B (en) Entity relationship joint extraction method and system
CN110134771B (en) Implementation method of multi-attention-machine-based fusion network question-answering system
CN109874029B (en) Video description generation method, device, equipment and storage medium
CN110275936B (en) Similar legal case retrieval method based on self-coding neural network
CN112084314A (en) Knowledge-introducing generating type session system
CN111694924A (en) Event extraction method and system
CN112633364A (en) Multi-modal emotion recognition method based on Transformer-ESIM attention mechanism
CN113627266B (en) Video pedestrian re-recognition method based on transform space-time modeling
CN115617955B (en) Hierarchical prediction model training method, punctuation symbol recovery method and device
CN112989120B (en) Video clip query system and video clip query method
CN116050401B (en) Method for automatically generating diversity problems based on transform problem keyword prediction
CN116450796A (en) Intelligent question-answering model construction method and device
CN111767697B (en) Text processing method and device, computer equipment and storage medium
CN117609421A (en) Electric power professional knowledge intelligent question-answering system construction method based on large language model
CN115019239A (en) Real-time action positioning method based on space-time cross attention
CN115361595A (en) Video bullet screen generation method
CN115098673A (en) Business document information extraction method based on variant attention and hierarchical structure
CN113065027A (en) Video recommendation method and device, electronic equipment and storage medium
CN116186562A (en) Encoder-based long text matching method
CN114329005A (en) Information processing method, information processing device, computer equipment and storage medium
CN114328910A (en) Text clustering method and related device
CN114550272B (en) Micro-expression recognition method and device based on video time domain dynamic attention model
CN113987187B (en) Public opinion text classification method, system, terminal and medium based on multi-label embedding
US20240153247A1 (en) Automatic data generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20211217