CN115033736A

CN115033736A - Video abstraction method guided by natural language

Info

Publication number: CN115033736A
Application number: CN202210652477.2A
Authority: CN
Inventors: 金永刚; 郑婧; 马海钢
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2022-09-09

Abstract

The invention discloses a video abstraction method guided by natural language, which comprises the following steps: decomposing a video file into a frame sequence, and extracting frame image characteristics; extracting frame semantic features and text semantic features, and performing space cosine similarity calculation to obtain attention weight; building a video abstract model guided by natural language; carrying out model training on the whole network model; and reasonably selecting the video abstract according to the frame importance score sequence. The invention provides an attention mechanism guided by natural language aiming at a video abstract task innovatively, the attention mechanism guided by the natural language is introduced into a video abstract framework, and ablation experiments show that the attention mechanism guided by the natural language obviously improves the performance of an abstract model. In addition, the attention mechanism of the natural language guidance provided by the invention has stronger objectivity, can fully pay attention to the video clips related to the title text, and the title text is easily obtained in the Internet video without any cost.

Description

Natural language guided video abstraction method

Technical Field

The invention belongs to the technical field of video abstraction, and particularly relates to a natural language guided video abstraction method.

Background

With the rapid development of the multimedia technology and the network information technology in the recent years, video becomes an increasingly mainstream information communication mode. How to condense a lengthy video to several minutes or even a few seconds, i.e., video summarization, on the basis of maintaining key information is becoming an important research content in the field of video technology. The video abstraction technology automatically selects important segments in a video by using a computer algorithm program, and the important segments are used as video abstractions, so that the storage space of the video is reduced, and a user can quickly browse video information.

The idea of mimicking human attention initially appeared in the field of computer vision, reducing the computational complexity of image processing and improving performance by introducing a mechanism of attention that focuses only on partial regions of an image, rather than the entire image. Similarly, each frame in a video has its own importance level, and some segments contain key information of the video content, which is important for people to pay attention when selecting the abstract. Therefore, many researchers at home and abroad introduce a mechanism into the video summarization method, and assign different important weights to different frames in an input sequence instead of looking at all the input frames at once, so that an internal relation between the input video sequence and the output importance scores is provided.

If a video segment is interesting to the user, it is more likely to become an important segment of the whole sequence, and many researchers at home and abroad model based on the user's attention, for example, Ma et al propose in the document [ A user authentication model for video submission [ C ]// Proceedings of the content ACM international conference on multimedia 2002:533-542] to score the importance of the video segment by using low-level features such as motion change, human face feature, etc., combine these scores to form an attention curve, and extract the part on the peak of the curve as a key shot to construct the abstract. With the development of the deep learning Technology, some attention-based deep Video summarization methods are proposed, for example, Zhong et al propose an AVS Video summarization model in the documents [ Video summary with entry-based encoder-decoder networks [ J ]. IEEE Transactions on Circuits and Systems for Video Technology,2019,30(6):1709-1717], describe a supervised Video summary as a sequence-to-sequence learning problem, explore two attention-based decoding networks by using an additive and multiplicative objective functions, obtain an attention weight through the properties of a Video sequence itself, and learn the attention-based system in a supervised manner. Apostolisis et al propose in the document [ Combining Global and Local Attention with Positional Encoding for Video summation [ C ]//2021IEEE International Symposium on Multimedia (ISM). IEEE,2021:226-234] to discover frame-dependent models of different granularities in combination with Global and Local multi-headed Attention mechanisms, and the Attention mechanisms utilized integrate components Encoding the temporal position of the Video frame.

However, it is easy to find that currently existing attention-based depth video summarization methods generally only use video image information to generate a summary, and do not consider information of other modalities, but key information highly related to video content is often contained in descriptive texts such as titles and the like.

Disclosure of Invention

In view of the above, the invention provides a natural language guided video abstraction method, which innovatively provides a natural language guided attention mechanism for a video abstraction task, obtains attention weight by calculating similarity of a video sequence and a text, introduces the natural language guided attention mechanism in an Encoder-Decoder video abstraction framework, and ablation experiments show that the natural language guided attention mechanism obviously improves the performance of an abstraction model.

A natural language guided video abstraction method comprises the following steps:

(1) decomposing the video files in the training set into frame sequences, and using the pre-trained depth image network to frameThe sequence is subjected to image feature extraction to obtain a corresponding frame image feature sequence (f) ₁ ,…,f _n )；

(2) Semantic feature extraction is carried out on the frame sequence by utilizing an image coding network of a pre-training multi-modal model to obtain a frame semantic feature sequence (x) ₁ ,…,x _n ) (ii) a Extracting semantic features of a natural language text of a video file by using a text coding network of a pre-trained multi-mode model to obtain text semantic features t;

(3) from frame semantic feature sequences (x) ₁ ,…,x _n ) And calculating the text semantic features t through the spatial cosine similarity to obtain an attention weight sequence (alpha) ₁ ,…,α _n )；

(4) Constructing a video abstract model of an attention mechanism based on natural language guidance, wherein the input of the video abstract model is a frame image feature sequence (f) ₁ ,…,f _n ) And attention weight sequence (alpha) ₁ ,…,α _n ) The output is a sequence of frame importance scores (y) ₁ ,…,y _n ) Further training the video abstract model;

(5) feature sequence (f) of frame images of video file of test set ₁ ,…,f _n ) And attention weight sequence (alpha) ₁ ,…,α _n ) Inputting the data into a trained video abstract model, and outputting to obtain a frame importance score sequence (y) of the video file ₁ ,…,y _n )；

(6) Using a sequence of frame importance scores (y) ₁ ,…,y _n ) And selecting key shots and synthesizing the key shots into a video abstract.

Further, since the natural language text of a considerable portion of the video files in the training set does not reflect the title content well, the natural language text needs to be optimized before semantic feature extraction of the natural language text of the video files: firstly, some titles are drawn up according to video contents, then a plurality of users watch videos in a questionnaire mode and select proper titles, and finally, the title with the highest proportion selected by the users is taken as an optimized natural language text.

Further, the specific implementation manner of the step (3) is as follows: first calculate the textThe semantic feature t and the frame semantic feature sequence (x) ₁ ,…,x _n ) The cosine similarity between each feature vector in the sequence is obtained

And then on the similarity sequence

Performing softmax normalization to obtain an attention weight sequence (alpha) ₁ ,…,α _n ) (ii) a The attention weight sequence can reflect the importance degree of the video clip relative to the title text, and an effective attention mechanism is provided to guide the generation of the abstract.

Furthermore, the video abstract model is formed by sequentially connecting an Encoder module, an Attention module and a Decoder module, wherein the Encoder module is used for inputting a frame image feature sequence (f) ₁ ,…,f _n ) Coding to obtain a hidden sequence (h) ₁ ,…,h _n ) The Attention module utilizes an Attention weight sequence (alpha) ₁ ,…,α _n ) For hidden sequence (h) ₁ ,…,h _n ) Carrying out weighted summation to obtain a fusion variable h, decoding the fusion variable h by a Decoder module and outputting a frame importance score sequence (y) ₁ ,…,y _n )。

Further, the process of training the video summary model in the step (4) is as follows: firstly, initializing model parameters, and then, carrying out frame image feature sequence (f) ₁ ,…,f _n ) And attention weight sequence (alpha) ₁ ,…,α _n ) In the input model, the model predicts the sequence of output frame importance scores (y) ₁ ,…,y _n ) And carrying out iterative updating on the model parameters by utilizing a gradient descent and back propagation algorithm according to the loss function L until the loss function L is minimized and converged or the maximum iteration number is reached, thus finishing the training.

Further, the expression of the loss function L is as follows:

wherein: y is _i Representing the importance score, s, of the ith video frame _i And the marking fraction of the ith video frame is shown, and n is the total frame number of the video file.

Further, the annotation score s for the ith video frame _i Marking the importance of the video frame by multiple users, wherein the label 1 represents the importance, the label 0 represents the unimportance, and marking the score s _i I.e. the percentage of the number of users who print the label 1 to all users.

Further, the specific implementation manner of the step (6) is as follows: visually consecutive frames are first combined into a shot, and then a sequence of frame importance scores (y) ₁ ,…,y _n ) Converting into a shot importance score sequence, and averaging the importance scores of all frames in the shot to obtain the importance score of the shot; and then selecting key shots by adopting an 0/1 knapsack algorithm according to the shot importance score sequence and the shot length, wherein the length of the key shots does not exceed 15% of the length of the whole video, and finally synthesizing all the key shots into the video abstract.

Based on the technical scheme, the invention has the following beneficial technical effects:

1. after 150 epochs of training, the five-fold cross-validation average F1-score for the natural language-guided video summary model was 47.0%. The ablation experiment for removing the natural language guidance is carried out, and the five-fold cross validation average F1-score of the video abstract model for removing the natural language guidance is 44.5%, which shows that the natural language guidance improves the model performance.

2. Compared with the attention mechanism based on the user, the attention mechanism guided by the natural language provided by the invention has stronger objectivity and can fully pay attention to the video clip related to the title text; at the same time, the title text is readily available in internet video and does not require any cost, whereas the user's attention requires that the analysis be collected and not readily available.

3. The abstract model fully focuses on key contents related to natural language texts while learning how human beings abstract videos. When a user provides a natural language text and a section of video, the abstract generated by the video abstract method has stronger correlation with the natural language text provided by the user, and can fully reflect the interest of the user.

Drawings

FIG. 1 is a schematic diagram of attention weight generation based on natural language guidance according to the present invention.

FIG. 2 is a schematic structural diagram of an Encode-Decoder video abstract model based on the attention mechanism.

Detailed Description

In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.

As shown in fig. 1, the video summarization method based on natural language guidance of the present invention includes the following steps:

s1: decomposing the video file into a frame sequence, and extracting image features of the frame sequence by using a pre-training depth image network to obtain a frame image feature sequence (f) ₁ ,…,f _n )。

In this embodiment, a video is sampled into a frame sequence at a sampling rate of 2fps, and then image features of the frame sequence are extracted using a pool5 layer pre-trained on a google lenet model on a large-scale image data set ImageNet to obtain a frame image feature sequence (f) ₁ ,…,f _n ) And n represents the video length, and the feature dimension of each video frame is 1024.

S2: semantic feature extraction is carried out on the frame sequence and the natural language text by utilizing an image coding network and a text coding network of a pre-training multi-modal model to obtain a frame semantic feature sequence (x) ₁ ,…,x _n ) And text semantic features t, and performing space cosine similarity calculation to obtain an attention weight sequence (alpha) ₁ ,…,α _n )。

The embodiment utilizes the image coding network and the text coding network of the pre-trained CLIP to carry out semantic feature extraction on the frame sequence and the natural language text, and the CLIP model uniformly converts the feature space of the image coding network and the feature space of the text coding network into the semantic space in the training process, so that the good effect is achievedThe good property provides conditions for calculating the similarity of the video frame image and the text. Frame semantic feature sequence (x) extracted by CLIP pre-training model ₁ ,…,x _n ) And performing space cosine Similarity calculation on the semantic features t of the text to obtain a Similarity sequence

Will be provided with

Obtaining the attention weight sequence (alpha) by performing softmax normalization processing ₁ ,…,α _n ) The attention weight sequence obtained at this time can reflect the importance degree of the video segment relative to the title text, and an effective attention mechanism is provided to guide the generation of the abstract.

S3: constructing an Encoder-Decoder video abstract model based on an attention mechanism and realized by an LSTM network, and inputting a frame image feature sequence (f) ₁ ,…,f _n ) And attention weight sequence (alpha) ₁ ,…,α _n ) Outputting a sequence of frame importance scores (y) ₁ ,…,y _n )。

The Encoder and the Decoder in the model adopted by the embodiment can be realized by iteration of a convolutional neural network, a cyclic neural network and the like, and the LSTM is selected to realize the long-distance dependence considering that the LSTM can well solve the problem of long-distance dependence and has certain advantages in the problem of long-sequence modeling; as shown in FIG. 2, the Encoder-Decoder video abstract model of the attention mechanism based on natural guidance implemented by the LSTM network is divided into 3 modules: encoder, Attention, Decoder, the concrete realization mode of 3 modules is as follows:

the Encoder module is realized by n stacked LSTM unitsThe last moment memory state c is input into the LSTM cell _t-1 And input f of the current time _t Outputting the hidden state h at the moment _t (ii) a Encoder input frame feature sequence (f) ₁ ,…,f _n ) Output hidden sequence (h) ₁ ,…,h _n )：

(h ₁ ,…,h _n )＝Encoder(f ₁ ,…,f _n )

Attention Module Pair hidden sequence (h) ₁ ,…,h _n ) Performing weighted accumulation, wherein the attention weight sequence is (alpha) ₁ ,…,α _n ) And outputting a variable h:

h＝∑h _i *α _i

the Decoder module is realized by n stacked LSTM units, inputs the obtained fusion variable h, and outputs an importance score sequence (y) ₁ ,…,y _n )：

(y ₁ ,…,y _n )＝Decoder(h)

S4: and carrying out model training on the whole network model.

In the embodiment, the SumMe video abstract data set is selected for training, but titles of part of videos in the SumMe data set cannot well reflect title contents, for example, "Jumps" cannot completely reflect video contents, and training effect is influenced. Therefore, optimization of the video title text of the SumMe dataset is required, specifically: some titles are drawn up according to the video content, a plurality of users watch videos in a questionnaire mode and select proper titles, and finally the original titles of the videos of the SumMe data set and the optimized titles provided by the method are selected as training texts with the highest proportion, as shown in Table 1:

TABLE 1

The SumMe data set has 25 videos, the data set is small in scale and suitable for training through a cross-validation method, the 25 videos in the data set are fully utilized, and adverse effects caused by unbalanced division are reduced. In the embodiment, 80% of data is selected as a training set, 20% of data is selected as a test set, a five-fold cross validation method is used, in each division mode, the number of the test sets is 20, the number of the validation sets is 5, a model is trained by using mean loss, (y) ₁ ,…,y _n ) Predicting scores for the model,(s) ₁ ,…,s _n ) The training set is scored and the loss function is as follows:

s5: according to the sequence of frame importance scores (y) ₁ ,…,y _n ) And reasonably selecting the video abstract.

In the embodiment, visually continuous frames are combined into a shot, the frame importance score sequence is converted into a shot importance score sequence, namely, the importance scores of all the frames in the shot are averaged to obtain the shot importance score, then the shot importance score and the shot length use 0/1 knapsack algorithm to select the key shot, the length of the key shot needs to be limited to 15% of the original video length, and the key shot is combined into the video abstract.

The embodiments described above are presented to enable one of ordinary skill in the art to make and use the invention and are capable of modifications in various obvious respects, all without departing from the invention. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims

1. A natural language guided video abstraction method comprises the following steps:

(1) decomposing the video files in the training set into frame sequences, and extracting image features of the frame sequences by using a pre-trained depth image network to obtain corresponding frame image feature sequences (f) ₁ ，...，f _n )；

(2) Semantic feature extraction is carried out on the frame sequence by utilizing an image coding network of a pre-training multi-modal model to obtain a frame semantic feature sequence (x) ₁ ，...，x _n ) (ii) a Extracting semantic features of a natural language text of a video file by using a text coding network of a pre-trained multi-mode model to obtain text semantic features t;

(3) from frame semantic feature sequences (x) ₁ ，...，x _n ) And obtaining an attention weight sequence (alpha) through space cosine similarity calculation with the text semantic feature t ₁ ，...，α _n )；

(4) Constructing a video abstract model of an attention mechanism based on natural language guidance, wherein the input of the video abstract model is a frame image feature sequence (f) ₁ ，...，f _n ) And attention weight sequence (alpha) ₁ ，...，α _n ) The output is a sequence of frame importance scores (y) ₁ ，...，y _n ) Further training the video abstract model;

(5) feature sequence (f) of frame images of video file of test set ₁ ，...，f _n ) And attention weight sequence (alpha) ₁ ，...，α _n ) Inputting the data into a trained video abstract model, and outputting to obtain a frame importance score sequence (y) of the video file ₁ ，...，y _n )；

(6) Using a sequence of frame importance scores (y) ₁ ，...，y _n ) And selecting key shots and synthesizing the key shots into a video abstract.

2. The video summarization method of claim 1, wherein: since the natural language text of a certain part of video files in the training set can not well reflect the title content, the natural language text needs to be optimized before semantic feature extraction on the natural language text of the video files: firstly, some titles are drawn up according to video contents, then a plurality of users watch videos in a questionnaire mode and select proper titles, and finally, the title with the highest proportion selected by the users is taken as an optimized natural language text.

3. The video summarization method of claim 1, wherein: the specific implementation manner of the step (3) is as follows: firstly, calculating text semantic feature t and frame semantic feature sequence (x) ₁ ，...，x _n ) The cosine similarity between each feature vector in the sequence table is obtained to obtain a similarity sequence

And then on the similarity sequence

Performing softmax normalization to obtain an attention weight sequence (alpha) ₁ ，...，α _n )。

4. The video summarization method of claim 1, wherein: the video abstract model is formed by sequentially connecting an Encoder module, an Attention module and a Decoder module, wherein the Encoder module is used for inputting a frame image characteristic sequence (f) ₁ ，...，f _n ) Coding to obtain a hidden sequence (h) ₁ ，...，h _n ) The Attention module utilizes an Attention weight sequence (alpha) ₁ ，...，α _n ) For hidden sequence (h) ₁ ，...，h _n ) Carrying out weighted summation to obtain a fusion variable h, decoding the fusion variable h by a Decoder module and outputting a frame importance score sequence (y) ₁ ，...，y _n )。

5. Video summary according to claim 1The method is characterized in that: the process of training the video abstract model in the step (4) is as follows: firstly, initializing model parameters, and then, carrying out frame image feature sequence (f) ₁ ，...，f _n ) And attention weight sequence (alpha) ₁ ，...，α _n ) In the input model, the model predicts the sequence of output frame importance scores (y) ₁ ，...，y _n ) And carrying out iterative updating on the model parameters by utilizing a gradient descent and back propagation algorithm according to the loss function L until the loss function L is minimized and converged or the maximum iteration number is reached, thus finishing the training.

6. The video summarization method of claim 5, wherein: the expression of the loss function L is as follows:

7. The video summarization method of claim 6, wherein: annotation score s for ith video frame _i Marking the importance of the video frame by multiple users, wherein the label 1 represents the importance, the label 0 represents the unimportance, and marking the score s _i I.e. the percentage of the number of users who print the label 1 to all users.

8. The video summarization method of claim 1, wherein: the specific implementation manner of the step (6) is as follows: visually consecutive frames are first combined into a shot, and then a sequence of frame importance scores (y) ₁ ，...，y _n ) Converting into a shot importance score sequence, and averaging the importance scores of all frames in the shot to obtain the importance score of the shot; and selecting the relation according to the shot importance score sequence and the shot length by adopting an 0/1 knapsack algorithmAnd (4) key shots, wherein the length of each key shot is not more than 15% of the length of the whole video, and all the key shots are finally synthesized into the video abstract.