CN116992079A

CN116992079A - Multi-mode video abstract extraction method based on video captions

Info

Publication number: CN116992079A
Application number: CN202310767163.1A
Authority: CN
Inventors: 胡珍珍; 王振山; 宋子杰; 洪日昌
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2023-06-27
Filing date: 2023-06-27
Publication date: 2023-11-03

Abstract

The invention discloses a multi-mode video abstract extraction method based on video captions, which comprises the following steps: 1, acquiring a frame characteristic representation of a video, 2, acquiring a characteristic representation of a caption, 3, performing automated video frame importance assessment, 5, optimizing a summarizer model, and 6, optimizing a video caption generator based on a key frame. The invention can rapidly output the key frame set of the short video and the corresponding subtitles, wherein the key frame set reflects the whole content of the video in a visual form by a small number of video frames, and the matched subtitles summarize the video pictures in a text form, thereby helping users to screen the short video more efficiently, saving storage space and computing resources and being more beneficial to deployment and application to terminal equipment.

Description

Multi-mode video abstract extraction method based on video captions

Technical Field

The invention belongs to the field of artificial intelligence, and particularly relates to a multi-mode video abstract extraction method based on video captions.

Background

The explosive growth of short video social software and self-media has led to the growth of internet videos in a blowout way, so how to quickly acquire key information in videos becomes an important problem. The goal of the video summarization task is to retrieve key frames or video clips, such as key lenses, in the video that contain as much information as possible with minimal redundancy. One straightforward application of video summarization is the cover-page presentation of video in a video website, where a reasonable summary segment can help the user determine whether to click on the video. Because of the specificity of the video abstraction task, such as strong subjectivity of results, great difficulty in labeling data sets, change of video resolution and the like, great challenges are brought to the improvement of the video abstraction technology.

The problem of difficulty in labeling data sets results in a shortage of high quality data sets in the field of video summarization, and conventional video summarization methods such as MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Sum published by Xu et al 2022 tend to be based on TVSum and SumMe data sets, for example, the TVSum data set scores the importance of each frame of video by using 20 annotators for each video, the data set contains 50 videos, sumMe is a key segment of video selected by 15 to 20 annotators, and only contains 20 videos. The cost of manual annotation of large-scale video summary datasets is enormous and therefore impractical. Previous work has generally selected several low quality data sets as supplemental training. How to train a high-quality video abstract model by adopting the existing data set on the premise of not increasing extra labeling cost, and how to use the abstract video frames in a reasonable way is still a problem to be solved.

Disclosure of Invention

The invention aims to solve the defects of the prior art, and provides a multi-mode video abstract extraction method based on video captions, which can simultaneously output video abstracts and video captions, thereby helping users to screen short videos more effectively, saving storage space and computing resources and being more beneficial to deployment and application to terminal equipment.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

the invention discloses a multimode video abstract extraction method based on video captions, which is characterized by comprising the following steps:

step 1, acquiring frame characteristic representation of a video:

for a video subtitle data set D= { V, Y }, wherein V represents a video set and Y represents an English subtitle sentence set corresponding to each video in the video set V;

processing any ith video in the video set V by adopting a visual encoder of the CLIP model to obtain a frame characteristic representation F of the ith video _i ＝{f _i,1, f _i,2 ,...,f _i,n ,..,f _i,N -a }; wherein f _i,n Representing an nth frame characteristic representation in an ith video, wherein N represents the total frame number of the video i;

step 2, acquiring the characteristic representation of the caption:

text encoder adopting CLIP model is used for English caption sentence Y corresponding to ith video in pairs _i ＝{y _i,1,1 ,...,y _i,1,W ；y _i,m,1 ,y _i,m,2 ,...,y _i,m,t ,...,y _i,m,W ；y _i,M,1 ,...,y _i,M,W Processing to obtain English caption text vector T corresponding to video i _i ＝{t _i,1 ,t _i,2 ,...,t _i,m ,..,t _i,M -wherein y _i,m,t Representing the t-th word, t, in the mth subtitle sentence corresponding to the ith video _i,m Representing an mth subtitle vector in an English subtitle sentence corresponding to the ith video; w represents the total number of words;

step 3, obtaining the characteristic representation f of the nth frame in the ith video by using the formula (1) _i,n And caption text vector T _i Average similarity s (f) _i,k ,T _i ) And represents f as an nth frame feature of video i _i,n Automated scoring of (a)

In the formula (1), tr represents vector transposition;

step 4, constructing a video abstractor, which comprises the following steps: the system comprises a self-attention mechanism layer, a local attention enhancement layer and a fully-connected network MLP, and is used for training;

step 4.1, the self-attention mechanism layer calculates the characteristic representation f of the nth frame in the ith video by using the method (2) _i,n And j-th frame characteristic representation f _i,j The cross-relation score r (f) _i,n ,f _i,j )：

r(f _i,n ,f _i,j )＝P×tanh(W ₁ f _i,n +W ₂ f _i,j +b) (2)

In the formula (2), P, W ₁ ,W ₂ Is three parameter matrixes to be learned, and b is a bias vector; tanh represents an activation function;

step 4.2, the local attention enhancement layer calculates an nth frame feature representation f in the ith video by using the method (3) _i,n Is a local attention enhanced video frame featureResulting in a locally attention enhanced feature representation of the ith video

In the formula (3), the amino acid sequence of the compound,representing the j-th frame characteristic representation f _i,j An nth frame characteristic representation f with an ith video _i,n The relation weights between the vectors, representing the multiplication of the vectors element by element, and has:

step 4.3, calculating the nth frame characteristic representation f of the ith video by using the full-connection network MLP (5) _i,n Predictive scoring of (2)

In formula (5), geLU represents an activation function; + represents the residual connection;

step 4.4, constructing a binary cross entropy loss L by using the method (7) _vsum ：

In the formula (7), B represents the number of videos in the video subtitle data set D;

in the first training stage, based on the video caption data set D, training a video summarizer by using a back propagation and gradient descent method, and enabling a binary cross entropy loss L _vsum Stopping training when the minimum time is reached, so as to obtain a trained video abstractor model;

step 5, representing the frame characteristics of the ith video by F _i ＝{f _i,1, f _i,2 ,...,f _i,n ,..,f _i,N Inputting into trained video abstractor model, and selectingThe top K frame characteristic representations with highest prediction scores form sub-optimal video frame setWherein (1)>A kth frame best feature representation representing an ith video; k represents the number of screened optimal video frames;

step 6, constructing a decoder consisting of a light-weight long-short-time memory network LSTM, and training;

step 6.1, when t=1, the optimal video frame set corresponding to the ith videoInputting into decoder, and obtaining the predictive word +.f of the mth caption sentence corresponding to the ith video outputted by the mth time step>

When t=2, 3, …, W, then the t-th step control factor ζ is randomly initialized _t If (if)Predicted word +.of the mth subtitle sentence corresponding to the ith video output at the t-1 th time step>After the processing of the decoder, the predictive word +.f of the mth subtitle sentence corresponding to the ith video output by the mth time step is obtained>If->Then the t word y of the mth caption sentence corresponding to the ith video _i,m,t After the t-th word is processed by the decoder,obtaining a predictive word +.f of an mth subtitle sentence corresponding to an ith video output by a tth time step>

Step 6.2, constructing a Cross entropy loss L by using the method (8) _XE ：

In the formula (10), p _θ (y _i,m,t ) Representing the t-th word y in the mth caption sentence corresponding to the i-th video at the t-th step by the decoder _i,m,t The output prediction probability, θ, represents the learning parameter;

step 6.3, during the second training phase, based on the English caption sentence Y _i Training the decoder using back propagation and gradient descent methods and letting Y _i And stopping training when the video frame reaches the minimum, thereby obtaining a trained decoder model, and performing subtitle output on the optimal video frame output by the trained video summarizer model.

The electronic device of the invention comprises a memory and a processor, wherein the memory is used for storing a program for supporting the processor to execute the multi-mode video abstract extraction method, and the processor is configured for executing the program stored in the memory.

The invention relates to a computer readable storage medium, on which a computer program is stored, characterized in that the computer program when run by a processor performs the steps of the multimodal video summary extraction method.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention can rapidly and efficiently compress video content and ensure the accuracy of semantic information through the double coupling function of the video coding-based double video abstraction framework, namely the abstractor and the decoder, thereby improving the efficiency of browsing short videos for users and being used as the display of the short videos in short video websites.

2. The present invention uses visual and text modes to summarize video content by extracting visual representations of key frames and then summarizing the key frames based on which a frame-level score is obtained. And selecting the frame with the most meaning and the most semantic consistency through the cross-mode video abstract model so as to compress video content on the premise of not reducing the video description quality, thereby eliminating redundant frames in the video and improving the video representation efficiency.

3. The invention uses the lightweight LSTM decoder to generate the description, and can convey the same semantic information without a large number of key frames, thereby bringing beneficial application value to the fields of video coding and video text data processing.

Drawings

FIG. 1 is a frame diagram of a multimodal video summary model of the present invention;

FIG. 2 is a block diagram of a summarizer according to the present invention;

FIG. 3 is a block diagram of a caption generator according to the present invention;

FIG. 4 is a flow chart of the multi-modal video summary model training of the present invention.

Detailed Description

In this embodiment, a method for extracting a multi-mode video summary based on video subtitles, as shown in fig. 1 and fig. 4, is performed according to the following steps:

step 1, acquiring frame characteristic representation of a video:

processing any ith video in the video set V by adopting a visual encoder of the CLIP model to obtain a frame characteristic representation F of the ith video _i ＝{f _i,1, f _i,2 ,...,f _i,n ,..,f _i,N -a }; wherein f _i,n Representing the nth frame feature representation in the ith video, N representing the total frame number of video i, in this embodiment, n=12;

step 2, acquiring the characteristic representation of the caption:

using the CLIP modelEnglish caption sentence Y corresponding to ith video in the text encoder pair of (2) _i ＝{y _i,1,1 ,...,y _i,1,W ；y _i,m,1 ,y _i,m,2 ,...,y _i,m,t,..., y _i,m,W ；y _i,M,1 ,...,y _i,M,W Processing to obtain caption text vector T corresponding to video i _i ＝{t _i,1 ,t _i,2 ,...,t _i,m ,..,t _i,M -wherein y _i,m,t Representing the t-th word, t, in the mth subtitle sentence corresponding to the ith video _i,m Representing the mth subtitle vector in the corresponding english subtitle sentence in the ith video, m=20, w=30 in this embodiment;

step 3, as shown in FIG. 2, obtaining the characteristic representation f of the nth frame in the ith video by using the formula (1) _i,n And caption text vector T _i Average similarity s (f) _i,k ,T _i ) And represents f as an nth frame feature of video i _i,n Automated scoring of (a)

In the formula (1), tr represents vector transposition;

step 4.1, calculating the characteristic representation f of the nth frame in the ith video by the self-attention mechanism layer through the method (2) _i,n And j-th frame characteristic representation f _i,j The cross-relation score r (f) _i,n ,f _i,j )：

r(f _i,n ,f _i,j )＝P×tanh(W ₁ f _i,n +W ₂ f _i,j +b) (2)

step 4.2, local injectionThe force enhancement layer calculates the nth frame characteristic representation f in the ith video by using the method (3) _i,n Is a local attention enhanced video frame featureResulting in a locally attention enhanced feature representation of the ith video

In the formula (3), the amino acid sequence of the compound,the j-th frame feature representation f representing the i-th video _i,j And the nth frame characteristic represents f _i,n The relation weights between the vectors, representing the multiplication of the vectors element by element, and has:

step 4.3, the full connection network MLP calculates the nth frame characteristic representation f of the ith video by using the method (5) _i,n Predictive scoring of (2)

step 4.4 training the video summarizer using a back propagation and gradient descent method based on the video subtitle data set D during the first training phase and by minimizing the binary cross entropy loss L as shown in equation (7) _vsum To optimize the video summarizer to obtain a trained video summaryAnd (5) a key model:

in the formula (7), B represents the number of videos in the video subtitle data set D.

In this embodiment, a maximum iteration number epoch_number is set ¹ 10, adopting an Adam optimization algorithm with learning rate and exponential decay rate by a gradient descent method, and when the iteration number reaches epoch_number ¹ When the training is stopped, the objective function loses L _vsum To the minimum;

step 5, representing the frame characteristics of the ith video by F _i ＝{f _i,1 ,f _i,2 ,...,f _i,n ,..,f _i,N Inputting into trained video abstractor model, selecting the top K frame characteristic representations with highest predictive scores to form sub-optimal video frame setWherein (1)>A kth frame best feature representation representing an ith video; k represents the number of screened optimal video frames;

step 6, constructing a decoder composed of a light-weight long-short-time memory network LSTM and training, as shown in figure 3; the description is generated using a lightweight LSTM decoder. The video frames thinned at a certain sampling rate are used for generating the characteristics of the video ViT, the characteristics are input as a summarizer of a model, the summarizer can judge the information quantity of the input video frames and give out specific quantitative evaluation, and then a frame characteristic set with the highest information quantity is screened out according to the evaluation and is sent to an LSTM decoder to generate language description.

When t=2, 3, …, W, then the t-th step control factor ζ is randomly initialized _t If (if)Predicted word +.of the mth subtitle sentence corresponding to the ith video output at the t-1 th time step>After the processing of the decoder, the predictive word +.f of the mth subtitle sentence corresponding to the ith video output by the mth time step is obtained>If->Then the t word y of the mth caption sentence corresponding to the ith video _i,m,t After the t word is processed by a decoder, obtaining a predicted word +.f of an m-th subtitle sentence corresponding to an i-th video output by a t-th time step>

Step 6.2, constructing a Cross entropy loss L by using the method (8) _XE ：

In the formula (10), p _θ (y _i,m,t ) Representing the t-th word y in the mth caption sentence corresponding to the i-th video at the t-th step by the decoder _i,m,t Output predictive probability, θ representsLearning parameters;

step 6.3, during the second training phase, based on English caption sentence Y _i Training the decoder using back propagation and gradient descent methods and calculating L _XE To update network parameters and set maximum iteration number epoch_number ² 30, in this step, the gradient descent method adopts Adam optimization algorithm with learning rate and exponential decay rate, when the iteration number reaches epoch_number ² And stopping training, so as to obtain a trained decoder for outputting the caption of the optimal video frame output by the trained video summarizer model.

In this embodiment, an electronic device includes a memory for storing a program supporting the processor to execute the above method, and a processor configured to execute the program stored in the memory.

In this embodiment, a computer-readable storage medium stores a computer program that, when executed by a processor, performs the steps of the method described above.

In summary, the method aims at the hot tide of short video, and aims at a video output key frame set and corresponding subtitles thereof, wherein the key frame set reflects the whole content of the video in a visual form by a small number of video frames, and the matched subtitles summarize the video pictures in a text form and reflect the video content from two angles of visual text. The method has the advantages that the number of the used model parameters is small, the requirements on the storage space and the computing resources are limited, and the application can be effectively deployed.

Claims

1. A multi-mode video abstract extraction method based on video captions is characterized by comprising the following steps:

step 1, acquiring frame characteristic representation of a video:

using CLIP modelsThe visual encoder processes any ith video in the video set V to obtain a frame characteristic representation F of the ith video _i ＝{f _i,1, f _i,2 ,...,f _i,n ,..,f _i,N -a }; wherein f _i,n Representing an nth frame characteristic representation in an ith video, wherein N represents the total frame number of the video i;

step 2, acquiring the characteristic representation of the caption:

In the formula (1), tr represents vector transposition;

r(f _i,n ,f _i,j )＝P×tanh(W ₁ f _i,n +W ₂ f _i,j +b) (2)

step 6.1, when t=1, the optimal video frame set corresponding to the ith videoInput into decoder and obtain the t-th time step transmissionPredictive word +.f. of mth subtitle sentence corresponding to ith video>

Step 6.2, constructing a Cross entropy loss L by using the method (8) _XE ：

2. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program that supports the processor to perform the multimodal video summary extraction method of claim 1, the processor being configured to execute the program stored in the memory.

3. A computer readable storage medium having a computer program stored thereon, which when executed by a processor performs the steps of the multimodal video summary extraction method of claim 1.