CN117789076A - Video description generation method and system oriented to semantic characteristic selection and attention fusion - Google Patents
Video description generation method and system oriented to semantic characteristic selection and attention fusion Download PDFInfo
- Publication number
- CN117789076A CN117789076A CN202311740703.3A CN202311740703A CN117789076A CN 117789076 A CN117789076 A CN 117789076A CN 202311740703 A CN202311740703 A CN 202311740703A CN 117789076 A CN117789076 A CN 117789076A
- Authority
- CN
- China
- Prior art keywords
- video
- features
- description
- semantic
- attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 61
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000009471 action Effects 0.000 claims abstract description 50
- 230000007246 mechanism Effects 0.000 claims abstract description 23
- 230000000007 visual effect Effects 0.000 claims abstract description 15
- 238000012549 training Methods 0.000 claims description 20
- 230000006870 function Effects 0.000 claims description 18
- 238000013507 mapping Methods 0.000 claims description 16
- 238000000605 extraction Methods 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000001514 detection method Methods 0.000 claims description 11
- 230000000977 initiatory effect Effects 0.000 claims description 9
- 239000000654 additive Substances 0.000 claims description 8
- 230000000996 additive effect Effects 0.000 claims description 8
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000004590 computer program Methods 0.000 claims 3
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 238000002474 experimental method Methods 0.000 description 10
- 230000000875 corresponding effect Effects 0.000 description 9
- 238000011156 evaluation Methods 0.000 description 8
- 238000002679 ablation Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005065 mining Methods 0.000 description 3
- 241000282414 Homo sapiens Species 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 235000019987 cider Nutrition 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 230000002787 reinforcement Effects 0.000 description 2
- 238000011524 similarity measure Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000013256 Gubra-Amylin NASH model Methods 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000004451 qualitative analysis Methods 0.000 description 1
- 238000013441 quality evaluation Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of natural language processing and computer vision, in particular to a video description generation method and a system for semantic characteristic selection and attention fusion, which map image features of a video to be described to a text space of a description sentence and acquire semantic object features in the video to be described; extracting action features in the video to be described according to the semantic object features, and fusing different mode feature information in the video to be described based on an attention mechanism, wherein the different mode feature information at least comprises the semantic object features and the action features; and combining the previous word code in the description sentence to embed and gradually decoding the fused characteristic information into natural language description corresponding to the video to be described by utilizing the conditional probability distribution. The invention can enhance the semantic relativity of the video entity text description and the video content, effectively relieve the semantic deviation between the visual space and the semantic space, improve the output accuracy of the action encoder and have better application prospect in the field of automatic generation of the video description.
Description
Technical Field
The invention relates to the technical field of natural language processing and computer vision, in particular to a video description generation method and system for semantic property selection and attention fusion.
Background
The video description task facing the video understanding needs to extract meaningful high-level semantic information such as objects, scenes, actions and the like from the video, and accordingly generates text content with clear logic and smooth semantics. Video description generation is a cross-domain problem involving computer vision and natural language processing, which can convert video content into text or semantic representations, and can help people to retrieve and manage massive video resources more quickly and efficiently. In view of the spatiotemporal nature of video data and the variety, complexity of semantics, a major challenge is how to efficiently combine visual information of video with natural language descriptions to generate accurate, fluent descriptions. How to realize understanding video content and describe the content thereof has become a research hotspot in the current video understanding field.
The frequency description generation methods are roughly classified into three types, mainly: a method based on manual feature extraction, a video description method based on deep learning and a generation method based on reinforcement learning. The method based on manual feature extraction is completed by using a template technology, so that the generated grammar of the descriptive sentence is accurate, but the descriptive sentence is limited due to the existence of the fixed template, so that the sentence is generated very inflexibly and the content is not rich. In a deep learning based approach, a typical research framework is to describe video using an "encoder-decoder" paradigm. In reinforcement learning based methods, taking GAN networks as an example, by training the generator and the arbiter, the generator is enabled to generate new data that is similar to real data, which typically requires the training of the GAN model using additional data sets. The three types of methods still have some difficulties in video description generation: 1) There are a large number of static semantic objects in the video, many works do not have these semantic objects to filter, but stay on the video frame level features to analyze the video content, lacking fine-grained semantic object selection. 2) Because of the motion characteristics of the video, some research works pay more attention to capturing motion semantics, and the limitation that the original video motion recognition task can only output limited motion types according to video content is broken through. But the method is more focused on action semantics, ignores other components of the descriptive statement, and is greatly restricted in sentence structure, flexibility expression and the like. 3) In the aspect of feature fusion, most of the existing methods simply perform cascading operation on features of different modes, lack correlation mining among advanced semantics, and have relatively simple feature fusion modes and low efficiency.
Disclosure of Invention
Therefore, the invention provides a video description generation method and a system for semantic property selection and attention fusion, which solve the problem that the application of the video description is affected due to the lack of fine-grained semantic object selection, limited action type limitation, lack of correlation mining among high-level semantics and the like in the prior art.
According to the design scheme provided by the invention, on one hand, a video description generation method for semantic property selection and attention fusion is provided, which comprises the following steps:
mapping image features of the video to be described to a text space of the description sentence, and acquiring semantic object features in the video to be described;
extracting action features in the video to be described according to the semantic object features, and fusing different mode feature information in the video to be described based on an attention mechanism, wherein the different mode feature information at least comprises the semantic object features and the action features;
and combining the previous word code in the description sentence to embed and gradually decoding the fused characteristic information into natural language description corresponding to the video to be described by utilizing the conditional probability distribution.
As the video description generation method for semantic property selection and attention fusion of the present invention, further, mapping image features of a video to be described to a text space of a description sentence includes:
firstly, selecting a plurality of frames in a video to be described as key frames, taking the key frames and the video frames in adjacent areas thereof as input of a pre-training object detector model, and acquiring target detection features in the video to be described by using the object detector model, wherein the target detection features at least comprise initial object features, initial action features and video frame features;
then, the encoder is utilized to encode and output the initial object characteristics; and for the encoded output result, combining the video frame feature and the initial motion feature and mapping the input to the descriptive text space using a decoder to select and output video semantic object features, wherein the decoder comprises: an LSTM network represented by the input data vector is acquired and the LSTM network output is mapped to a fully connected layer of text space.
As the video description generation method for semantic property selection and attention fusion, the invention further maps LSTM network output to text space in a full connection layer, and supervises and learns the network by means of description sentences, wherein the learning process comprises the following steps:
firstly, constructing a loss function based on the minimized distance of the entity in the descriptive statement and the entity in the video;
the LSTM network is then supervised learning using the descriptive statements and based on the loss function to learn the vector representation of the video content in text space.
The method for generating the video description oriented to semantic property selection and attention fusion further extracts action features in the video to be described according to semantic object features, and comprises the following steps:
extracting initial actions in sample data by using a pre-trained C3D three-dimensional convolutional neural network, and acquiring action features in a video by combining semantic object features through cross attention fusion, wherein in the C3D three-dimensional convolutional neural network pre-training, the initial actions and the action related features are spliced, the spliced features are mapped to a text space by using an LSTM network and a full connection layer, and learning optimization of the neural network is realized by minimizing the distance between the related action features in a video description supervision signal text and the features after the splicing is mapped to the text space.
The video description generation method for semantic property selection and attention fusion, provided by the invention, further fuses different mode characteristic information in the video to be described based on an attention mechanism, and comprises the following steps:
firstly, fusing action features and semantic object features by using an additive attention mechanism;
then, extracting depth level features of the fused features by using a multi-layer attention mechanism, and acquiring a similarity matrix;
and then, normalizing the similarity matrix by using a softmax function, acquiring attention weights, and carrying out weighted average on the attention weights and the depth level extraction features to acquire fusion features of different modal features.
As the video description generation method for semantic property selection and attention fusion, the invention further combines the previous word code embedding in the description sentence and gradually decodes the fused characteristic information into the natural language description corresponding to the video to be described by utilizing the conditional probability distribution, and comprises the following steps:
firstly, converting the fusion characteristic into a decoder input vector sequence containing time steps;
then, the word at the current moment is predicted by a pre-trained LSTM decoder according to the previous word code embedding in the visual characteristics and the description sentences, and the natural language description corresponding to the video is obtained by gradually decoding until the complete description sentences are generated or the predefined maximum sentence length is reached.
As the video description generation method for semantic property selection and attention fusion, the invention further comprises the following steps of:
the training loss function is set with cross entropy loss, and the LSTM decoder is optimized for training with a given video and video description and based on the training loss function.
Further, the invention also provides a video description generating system for semantic property selection and attention fusion, which comprises: the device comprises a preprocessing module, a fusion module and an output module, wherein,
the preprocessing module is used for mapping the image characteristics of the video to be described to a text space of the description sentence and acquiring semantic object characteristics in the video to be described;
the fusion module is used for extracting action features in the video to be described according to the semantic object features, and fusing different mode feature information in the video to be described based on an attention mechanism, wherein the different mode feature information at least comprises semantic object features and action features;
and the output module is used for decoding the fused characteristic information step by step into natural language description corresponding to the video to be described.
The invention has the beneficial effects that:
1. aiming at the problems that a large number of static semantic objects exist in a video, a lot of work does not have the semantic objects for screening, but remains on video frame level features for analyzing video content, fine granularity semantic object selection and the like, the image features are mapped to a text space through efficient semantic feature selection and feature mapping, so that a model is promoted to learn semantic representation of the semantic objects in the text space from descriptive sentences, and semantic deviation between a visual space and the semantic space can be effectively relieved.
2. Aiming at the problems of simple cascade operation of features of different modes, lack of correlation mining among high-level semantics, relatively simple feature fusion mode, low efficiency and the like in the aspect of feature fusion, the accuracy of the output of an action encoder can be improved by feature fusion coding action features and using semantic object features as query search action features.
3. The experimental results of the comparative experiments on MSVD and MSR-VTT data sets show that the scheme achieves more advanced results; meanwhile, a large number of ablation experiments can also verify the effectiveness of the scheme, can be applied to the automatic description scene of the multimedia video, and has a good application prospect.
Description of the drawings:
FIG. 1 is a schematic flow diagram of video description generation facing semantic property selection and attention fusion in an embodiment;
FIG. 2 is a schematic diagram of an overall model framework of the video description generation principle in an embodiment;
FIG. 3 is a schematic illustration of the performance effects of a model in an embodiment;
fig. 4 is a schematic diagram of a variation trend of the ablation experiment BLEU4 index in the embodiment;
fig. 5 is a visual analysis schematic in the example.
The specific embodiment is as follows:
the present invention will be described in further detail with reference to the drawings and the technical scheme, in order to make the objects, technical schemes and advantages of the present invention more apparent.
Aiming at the problems that the existing video description generation still has semantic deviation, low feature fusion efficiency and the like between video content and generated description in the encoding-decoding process, the embodiment of the invention, referring to fig. 1, provides a video description generation method oriented to semantic characteristic selection and attention fusion, which comprises the following contents:
s101, mapping image features of the video to be described to a text space of a description sentence, and acquiring semantic object features in the video to be described.
Specifically, mapping image features of a video to be described to a text space of a description sentence can be designed to include the following:
firstly, selecting a plurality of frames in a video to be described as key frames, taking the key frames and the video frames in adjacent areas thereof as input of a pre-training object detector model, and acquiring target detection features in the video to be described by using the object detector model, wherein the target detection features at least comprise initial object features, initial action features and video frame features;
then, the encoder is utilized to encode and output the initial object characteristics; and for the encoded output result, combining the video frame feature and the initial motion feature and mapping the input to the descriptive text space using a decoder to select and output video semantic object features, wherein the decoder comprises: an LSTM network represented by the input data vector is acquired and the LSTM network output is mapped to a fully connected layer of text space.
The LSTM network output is mapped to a text space in the full connection layer, and the network is supervised and learned by means of descriptive sentences, and the learning process can comprise:
firstly, constructing a loss function based on the minimized distance of the entity in the descriptive statement and the entity in the video;
the LSTM network is then supervised learning using the descriptive statements and based on the loss function to learn the vector representation of the video content in text space.
In order to process the association relationship, in the embodiment, referring to a video description generation algorithm (A method for generating video descriptions by fusing semantic selection and multi-attention stacking, SSMA) framework of the fusion semantic feature selection and multi-attention superposition shown in fig. 2, the video description generation algorithm mainly comprises 4 parts of semantic feature selection, action feature extraction, multi-mode information fusion and a decoder. The video is represented as a series of visual features v. Wherein each feature v i Is obtained by a visual feature extractor represented by a convolutional neural network. Meanwhile, each word of the natural language description can be converted into an embedded vector form w through one-hot encoding t . In this case, the task of the solution is to use the visual features and the previous word w t-1 To predict the output word w at the current time t I.e. generating natural language descriptive sentences, in a specific implementation this can be done by maximizing the conditional probability distribution of a given input sequence and output sequence, to select at each instant a word that maximizes the conditional probability as the predicted word at the current instant, and to proceed until either a complete descriptive sentence is generated or a predefined maximum sentence length is reached.
In order to make the model learn the judgment of human beings, in the embodiment of the present invention, firstly, a semantic object conforming to the intuition of human eyes is screened from a video, and the specific structure of the semantic object can be designed to comprise two parts of target detection and language model. Only those objects that are intuitive to humans are selected, image features can be mapped to text space by means of DETR's structure through a pre-trained transducer model, and then optimized using descriptive sentences as labels. The description sentence can be directly applied as a supervisory signal to learn the vector representation of the video content in the text space.
For a single video, T frames may be selected as key frames, along with video frames surrounding the key frames, to be input to a pre-trained object detector. Capturing object regions from each key frame and clustering the regions according to intersections between the appearance and the bounding boxes to obtain object detection featuresM represents the number of objects. The initial object features will then be input into the encoder to obtain the hidden layer representation.
The semantic selection decoder accepts three types of inputs
Wherein XO is h Is the encoder output, θ is a randomly generated query parameter, X V May be a video context vector from 2 DCNN. As an output of the decoder, phi is mapped to semantic space through the full connection layer and LSTM, and supervised learning is performed by means of video description.
In calculating loss, in the present embodiment, synonym representations for each entity in the descriptive statement may be obtained from WordNet, and after removing the abstract nouns, the entity embedding features are calculated using SBERT (Sentence-BERT) as the text encoderObviously, entities extracted from descriptive sentences and from videosThe extracted entities may constitute two sets, the optimization problem being in fact an optimal allocation problem. The optimal allocation cost can be set as +.>Then
Dist representsAnd->The matching cost between them is finally reduced by minimizing +.>And->The distance between the two semantic selection steps is optimized:
s102, extracting action features in the video to be described according to the semantic object features, and fusing different mode feature information in the video to be described based on an attention mechanism, wherein the different mode feature information at least comprises the semantic object features and the action features.
Specifically, extracting action features in the video to be described according to semantic object features can be designed to include the following contents:
extracting initial actions in sample data by using a pre-trained C3D three-dimensional convolutional neural network, and acquiring action features in a video by combining semantic object features through cross attention fusion, wherein in the C3D three-dimensional convolutional neural network pre-training, the initial actions and the action related features are spliced, the spliced features are mapped to a text space by using an LSTM network and a full connection layer, and learning optimization of the neural network is realized by minimizing the distance between the related action features in a video description supervision signal text and the features after the splicing is mapped to the text space.
The fusing of the different modality characteristic information in the video to be described based on the attention mechanism may include:
firstly, fusing action features and semantic object features by using an additive attention mechanism;
then, extracting depth level features of the fused features by using a multi-layer attention mechanism, and acquiring a similarity matrix;
and then, normalizing the similarity matrix by using a softmax function, acquiring attention weights, and carrying out weighted average on the attention weights and the depth level extraction features to acquire fusion features of different modal features.
In the action feature extraction, semantic object features can be used as query search action features, so that the accuracy of the output of the action encoder can be improved.
Motion features in video were extracted using a C3D network pre-trained in Kinetics-400. The C3D network is a three-dimensional convolutional neural network, can effectively capture space-time information in video and extract corresponding action features, and the features can be used for subsequent feature fusion, text generation and the like. Extracting T action initials in each sample using a C3D networkOutput of the combined semantic selection step->Obtaining action characteristics through cross attention fusionSimilar to the semantic feature selection step, to use video descriptive text as a supervisory signal, the initial motion features and motion related features may be spliced and then input into the LSTM and mapped to text space through the fully connected layers
By minimizing the action features S from the text supervision signal a Andthe distance between them optimizes the action feature extraction step described above.
In order to relate the action features to the corresponding semantic objects, the verb embedding of the video description may still be extracted as a supervisory signal using the SBERT encoder. Thus, the subsequent multi-mode information fusion can obtain richer characteristic representation, and the accuracy and the efficiency of video content understanding and analysis are improved.
In order to acquire semantic representation of the whole video in a text space, in multi-modal feature fusion, different modal information in the video can be fused by utilizing a mechanism of additive attention and self-attention superposition.
First, the video action features and the video semantic object features are fused using an additive attention mechanism, and the specific process can be expressed as:
wherein z is i Is after fusionFeature vector, W of (2) 1 ,W 2 ,W 3 B is a learnable parameter, M is the number of semantic objects.
Then, depth extraction is carried out on the fused features by using a multi-layer self-attention mechanism, and a result z after each video feature is fused i ∈R (s×h) Where s represents the sequence length and h represents the hidden state dimension of each word.
Then, deep feature extraction is performed by using a three-layer self-attention mechanism, and x is multiplied by three learnable weight matrices W respectively Q ∈R h×h 、W K ∈R h×h And W is V ∈R h×h A Query, key, value matrix is obtained. A dot product method is used for calculating a similarity score matrix S between the Query and the Key.
S=QK T (10)
And normalizing the similarity matrix through a Softmax function to obtain attention weight, and carrying out weighted average on the attention weight and the Value matrix to obtain an output sequence H.
H=Vsoftmax(S) (11)
S103, the feature information after fusion is embedded by combining the previous word code in the description sentence and is decoded step by step into natural language description corresponding to the video to be described by utilizing conditional probability distribution.
Specifically, the method can be designed to comprise the following steps:
firstly, converting the fusion characteristic into a decoder input vector sequence containing time steps;
then, the word at the current moment is predicted by a pre-trained LSTM decoder according to the previous word code embedding in the visual characteristics and the description sentences, and the natural language description corresponding to the video is obtained by gradually decoding until the complete description sentences are generated or the predefined maximum sentence length is reached.
In training the LSTM decoder, a training loss function may be set using cross entropy loss, and the LSTM decoder may be optimized for training using a given video and video description and based on the training loss function.
The decoder is responsible for extracting the vector after the previous fusion and decoding it into text, whose training goal is to minimize cross entropy loss to optimize the performance of the description generator.
The output characteristics of the multi-mode encoder areWhere s is the length of the input sequence and h is the dimension of the feature vector, these features can be mapped to text space with a single LSTM decoder, as follows:
first, the output characteristic H of the self-attention model is converted into a decoder input vector sequence containing each time step using a linear layerWhere h is the hidden state size of the LSTM decoder:
Z=HW+b (12)
wherein the method comprises the steps ofIs the weight matrix of the fully connected layer>Is the bias vector.
The decoder is then initialized to an all-zero hidden state vectorAnd an embedding vector of a single markere is the dimension of the embedded vector, the hidden state +.>
h t =LSTM(x t ,h t-1 ) (13)
Wherein LSTM denotes operation of the LSTM layer, at each time stepHiding state h using a full connection layer t Mapping to oneVector of dimensions->Where v is the vocabulary. Can be expressed by the following formula:
y t =h t V+c (14)
wherein,is the weight matrix of the full connection layer, +.>Is the bias vector, then y is determined using the Softmax function t Conversion to a probability distribution p t :
p t =softmax(y t ) (15)
Next, a beam search is used to select the k most likely sequence outputs to optimize the output sequence, where k=5.
When selecting cross entropy loss to optimize a description generator, a video and its video description are given [ w ] 1 ,...,w L ]Where L is the video length. Zeta (w) t ) For the word w t Representation in the vocabulary. The loss function can be expressed as:
further, based on the above method, the embodiment of the present invention further provides a video description generating system for semantic property selection and attention fusion, which includes: the device comprises a preprocessing module, a fusion module and an output module, wherein,
the preprocessing module is used for mapping the image characteristics of the video to be described to a text space of the description sentence and acquiring semantic object characteristics in the video to be described;
the fusion module is used for extracting action features in the video to be described according to the semantic object features, and fusing different mode feature information in the video to be described based on an attention mechanism, wherein the different mode feature information at least comprises semantic object features and action features;
and the output module is used for decoding the fused characteristic information step by step into natural language description corresponding to the video to be described.
To verify the validity of this protocol, the following is further explained in connection with experimental data:
experiments were performed on the MSVD (Microsoft Video Description) dataset and the MSR-VTT (Microsoft Research Video to Text) dataset, respectively, the MSVD dataset being a small-scale dataset containing more than 1970 YouTube video clips. Each video clip is about 10 seconds in length, with an average of 40 word descriptions per video. The MSR-VTT data set is a large-scale data set containing 10000 video segments, each video segment being about 20 seconds in length, with an average of 20 word descriptions per video.
Model evaluation was performed using four main evaluation metrics, BLEU (Bilingual Evaluation Understudy bilingual intercommunicating quality evaluation, which is an accuracy-based similarity measure), ROUGE (Recall-Oriented Understudy for Gisting Evaluation, which is a Recall-based similarity measure), CIDEr (Consensus-based Image Description Evaluation, which is a Consensus-based image description evaluation), METEOR (Metric for Evaluation of Translation with Explicit Ordering, explicit rank translation evaluation measure).
The scheme model adopts a framework shown in fig. 2, adopts the idea based on target detection in the semantic feature selection process, and firstly analyzes video content by using a target detection method to extract semantic objects which are worth describing, such as people, vehicles, animals and the like; in the extraction of the action characteristics based on the C3D network, the semantic objects are mapped into a text space in a mode of action characteristic mapping, so that the semantic objects can be used as labels of natural language, and the natural language description sentences are used as labels for optimization so as to further improve the performance of the model; the multi-mode video information fusion based on the multi-attention mechanism is realized, the outputs from the visual image and the language text are fused by adopting a multi-mode semantic fusion method, and the information of the visual mode and the language mode can be utilized to enrich the model, so that the semantic information in the video can be more accurately described, and the fused semantic information is decoded into natural language description by using a decoder in an LSTM decoder.
1. Comparative test
In order to further compare the performance of the models, the models of RestNet, SAAT, ORG-TRL, uni-receiver and the like are respectively compared in two data sets of MSVD and MSR-VTT in the experiment, and better results are obtained on BLEU4 and ROUGE, METEOR, CIDEr indexes, and the effects are shown in figure 3, table 1 and table 2. In fig. 3, a red font indicates a model generation result, and a black font indicates label.
As can be seen from Table 1, the ORG-TRL introduced into the object detector has significantly improved performance over SAAT of the same year, which can illustrate the importance of the object detection feature in the video description task.
Table 1 model performance on MSVD dataset
Table 2 shows the behavior of the model on MSR-VTT, and it can be seen that there is a decrease in each index on the MSR-VTT dataset. This is because the MSR-VTT contains more diversified video scenes and more actions, and the shortcoming of the C3D network in modeling longer information is reflected. But in a comprehensive view, the invention still achieves better results.
Table 2 model behavior in MSR-VTT datasets
2. Ablation experiments
To verify the effectiveness of the motion encoder and feature fusion module, ablation experiments were performed based on the MSR-VTT data set, with specific experimental data as shown in Table 3.
Table 3 ablation experimental results
In an experiment, the scheme directly inputs the C3D features into a feature fusion step, and the table shows that under the condition that feature selection is not performed, a large number of features which are not helpful to semantic generation are input into the model, so that the quality of the generated text is greatly reduced.
In the aspect of multi-mode fusion, after being completely subjected to DNN, various indexes are obviously reduced, and the BLEU index is only 25.3. The multi-modal fusion in this case is indispensable. In the scheme, the additive attention mechanism is replaced by DNN, the self-attention mechanism is reserved, and a good effect is not obtained, but under the condition that only the self-attention layer is removed and the additive attention is reserved, the BLEU index is only 40.2, which indicates that the additive attention plays a good characteristic fusion role, and self-attention is an essential part of the characteristic fusion module. Figure 4 shows the trend of the change of the BLEU4 index in the ablation experiment following the training cycle.
In order to perform qualitative analysis on the multi-modal fusion, in an experiment, the Grad-CAM tool is used for performing visual analysis on the multi-modal fusion, the effect is as shown in fig. 5, a place with heavy color in a picture represents a region of interest of a model, and the fact that after depth feature extraction, the model can be well focused on a key part in a visual region can be found in the picture, and on the basis, a decoder can generate a description in a targeted manner.
Through the experimental data, the scheme can enhance the semantic relativity between the text description of the entity in the video and the video content, effectively relieve the semantic deviation between the visual space and the semantic space, improve the output accuracy of the action encoder, and has better application prospect in the field of automatic generation of the video description.
The relative steps, numerical expressions and numerical values of the components and steps set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The elements and method steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or a combination thereof, and the elements and steps of the examples have been generally described in terms of functionality in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those of ordinary skill in the art may implement the described functionality using different methods for each particular application, but such implementation is not considered to be beyond the scope of the present invention.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the above methods may be performed by a program that instructs associated hardware, and that the program may be stored on a computer readable storage medium, such as: read-only memory, magnetic or optical disk, etc. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits, and accordingly, each module/unit in the above embodiments may be implemented in hardware or may be implemented in a software functional module. The present invention is not limited to any specific form of combination of hardware and software.
Finally, it should be noted that: the above examples are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, but it should be understood by those skilled in the art that the present invention is not limited thereto, and that the present invention is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A video description generation method for semantic property selection and attention fusion, comprising:
mapping image features of the video to be described to a text space of the description sentence, and acquiring semantic object features in the video to be described;
extracting action features in the video to be described according to the semantic object features, and fusing different mode feature information in the video to be described based on an attention mechanism, wherein the different mode feature information at least comprises the semantic object features and the action features;
and combining the previous word code in the description sentence to embed and gradually decoding the fused characteristic information into natural language description corresponding to the video to be described by utilizing the conditional probability distribution.
2. The method for generating a video description for semantic property selection and attention fusion according to claim 1, wherein mapping image features of a video to be described to a text space of a description sentence comprises:
firstly, selecting a plurality of frames in a video to be described as key frames, taking the key frames and the video frames in adjacent areas thereof as input of a pre-training object detector model, and acquiring target detection features in the video to be described by using the object detector model, wherein the target detection features at least comprise initial object features, initial action features and video frame features;
then, the encoder is utilized to encode and output the initial object characteristics; and for the encoded output result, combining the video frame feature and the initial motion feature and mapping the input to the descriptive text space using a decoder to select and output video semantic object features, wherein the decoder comprises: an LSTM network represented by the input data vector is acquired and the LSTM network output is mapped to a fully connected layer of text space.
3. The video description generation method for semantic property selection and attention fusion according to claim 2, wherein the LSTM network output is mapped to a text space in a fully connected layer, and the network is supervised and learned by means of description sentences, and the learning process comprises:
firstly, constructing a loss function based on the minimized distance of the entity in the descriptive statement and the entity in the video;
the LSTM network is then supervised learning using the descriptive statements and based on the loss function to learn the vector representation of the video content in text space.
4. The method for generating a video description for semantic property selection and attention fusion according to claim 1, wherein extracting action features in a video to be described according to semantic object features comprises:
extracting initial actions in sample data by using a pre-trained C3D three-dimensional convolutional neural network, and acquiring action features in a video by combining semantic object features through cross attention fusion, wherein in the C3D three-dimensional convolutional neural network pre-training, the initial actions and the action related features are spliced, the spliced features are mapped to a text space by using an LSTM network and a full connection layer, and learning optimization of the neural network is realized by minimizing the distance between the related action features in a video description supervision signal text and the features after the splicing is mapped to the text space.
5. The method for generating the video description for semantic property selection and attention fusion according to claim 1, wherein the fusing of different modality feature information in the video to be described based on an attention mechanism comprises:
firstly, fusing action features and semantic object features by using an additive attention mechanism;
then, extracting depth level features of the fused features by using a multi-layer attention mechanism, and acquiring a similarity matrix;
and then, normalizing the similarity matrix by using a softmax function, acquiring attention weights, and carrying out weighted average on the attention weights and the depth level extraction features to acquire fusion features of different modal features.
6. The method for generating a video description for semantic property selection and attention fusion according to claim 1, wherein the step of embedding and decoding the fused feature information into a natural language description corresponding to the video to be described by combining a previous word code in the description sentence by using a conditional probability distribution comprises the steps of:
firstly, converting the fusion characteristic into a decoder input vector sequence containing time steps;
then, the word at the current moment is predicted by a pre-trained LSTM decoder according to the previous word code embedding in the visual characteristics and the description sentences, and the natural language description corresponding to the video is obtained by gradually decoding until the complete description sentences are generated or the predefined maximum sentence length is reached.
7. The semantic property selection and attention fusion oriented video description generation method of claim 6, wherein the LSTM decoder training process comprises:
setting a training loss function using the cross entropy loss, using the given video and video description, and causing the LSTM decoder to learn a mapping relationship between the input video feature sequence to the text space of the video description based on the training loss function.
8. A semantic property selection and attention fusion oriented video description generation system comprising: the device comprises a preprocessing module, a fusion module and an output module, wherein,
the preprocessing module is used for mapping the image characteristics of the video to be described to a text space of the description sentence and acquiring semantic object characteristics in the video to be described;
the fusion module is used for extracting action features in the video to be described according to the semantic object features, and fusing different mode feature information in the video to be described based on an attention mechanism, wherein the different mode feature information at least comprises semantic object features and action features;
and the output module is used for decoding the fused characteristic information step by step into natural language description corresponding to the video to be described.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1 to 7.
10. An electronic device, comprising:
at least one processor, and a memory coupled to the at least one processor;
wherein the memory stores a computer program executable by the at least one processor to implement the method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311740703.3A CN117789076A (en) | 2023-12-18 | 2023-12-18 | Video description generation method and system oriented to semantic characteristic selection and attention fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311740703.3A CN117789076A (en) | 2023-12-18 | 2023-12-18 | Video description generation method and system oriented to semantic characteristic selection and attention fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117789076A true CN117789076A (en) | 2024-03-29 |
Family
ID=90388449
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311740703.3A Pending CN117789076A (en) | 2023-12-18 | 2023-12-18 | Video description generation method and system oriented to semantic characteristic selection and attention fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117789076A (en) |
-
2023
- 2023-12-18 CN CN202311740703.3A patent/CN117789076A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Huang et al. | Image captioning with end-to-end attribute detection and subsequent attributes prediction | |
CN108986186B (en) | Method and system for converting text into video | |
Gao et al. | RNN-transducer based Chinese sign language recognition | |
CN114840705B (en) | Combined commodity retrieval method and system based on multi-mode pre-training model | |
CN113423004B (en) | Video subtitle generating method and system based on decoupling decoding | |
CN116956920A (en) | Multi-mode named entity identification method for multi-task collaborative characterization | |
CN113836866B (en) | Text encoding method, text encoding device, computer readable medium and electronic equipment | |
CN115796182A (en) | Multi-modal named entity recognition method based on entity-level cross-modal interaction | |
Gajbhiye et al. | Generating the captions for remote sensing images: A spatial-channel attention based memory-guided transformer approach | |
CN114339450A (en) | Video comment generation method, system, device and storage medium | |
CN113837233A (en) | Image description method of self-attention mechanism based on sample self-adaptive semantic guidance | |
CN111597816A (en) | Self-attention named entity recognition method, device, equipment and storage medium | |
CN117112786A (en) | Rumor detection method based on graph attention network | |
Deorukhkar et al. | A detailed review of prevailing image captioning methods using deep learning techniques | |
Xue et al. | Lcsnet: End-to-end lipreading with channel-aware feature selection | |
CN113553445A (en) | Method for generating video description | |
Song et al. | Exploring explicit and implicit visual relationships for image captioning | |
CN115860002B (en) | Combat task generation method and system based on event extraction | |
Qi et al. | Video captioning via a symmetric bidirectional decoder | |
CN114386412B (en) | Multi-mode named entity recognition method based on uncertainty perception | |
CN117789076A (en) | Video description generation method and system oriented to semantic characteristic selection and attention fusion | |
CN115169285A (en) | Event extraction method and system based on graph analysis | |
Zhou et al. | Joint scence network and attention-guided for image captioning | |
CN114510904A (en) | End-to-end image semantic description method and system based on fashion field | |
Yang et al. | All up to You: Controllable Video Captioning with a Masked Scene Graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |