CN114020964A

CN114020964A - Method for realizing video abstraction by using memory network and gated cyclic unit

Info

Publication number: CN114020964A
Application number: CN202111346288.4A
Authority: CN
Inventors: 马然; 苏敏; 张冰; 安平
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2022-02-08

Abstract

The invention provides a method for realizing video abstraction by using a memory network and a gating cycle unit, which comprises the following steps: preprocessing a data set to obtain video frame characteristics of a video; designing an integral network model which takes the video frame characteristics as input and takes the frame importance scores as output; carrying out model training in the whole network model; converting the frame importance scores obtained from the trained overall network model into key shots; and combining the key lens into a dynamic video abstract, and testing a video abstract result. According to the invention, the overall network model performs characteristic processing on the input video, and the dynamic video abstract is obtained by utilizing the frame importance scores, so that the overall effectiveness of the model is improved.

Description

Method for realizing video abstraction by using memory network and gated cyclic unit

Technical Field

The invention relates to the technical field of video abstraction, in particular to a method for realizing video abstraction by using a memory network and a gate control cycle unit.

Background

With the development of the internet and the emergence of various mobile portable shooting devices, people can shoot people and things around video records anytime and anywhere and upload the video records to the internet, so that the number of videos is increased explosively. The massive video data inevitably brings stress on video storage and management. Meanwhile, when a user searches for information needed by the user in a large number of related videos, all contents of each video are often required to be browsed to filter unnecessary information; although the user can speed up the video browsing using the double-speed play, the user still needs to spend a lot of time and effort. Therefore, a means for extracting important contents in a video is urgently needed by human beings, and video summarization is carried out as soon as possible.

The video abstraction technology realizes the summarization and summarization of video contents by extracting meaningful frames or meaningful shots in videos, provides a way for human beings to efficiently access and manage mass video data, and relieves the pressure on video storage, transmission, archiving and retrieval caused by the explosive increase of the number of network videos in recent years. Meanwhile, the video abstract is used as a means for enabling a user to quickly browse and acquire important contents of the video, and user experience can be improved to a great extent.

The hierarchical structure of the video can be generally divided into a video stream, a scene, a shot and a frame from top to bottom, and starting from the video hierarchical structure, the video abstract can be divided into a static video abstract and a dynamic video abstract. The static video abstract is a video abstract formed by selecting meaningful frames, and the dynamic video abstract is formed by meaningful shots. For a static video summary, i.e. a video summary in the form of key frames, the summary process can be divided into the following 3 steps: feature extraction, video content importance calculation and video abstract generation. Firstly, extracting the characteristics of video frames, then calculating the content importance of the video frames, and finally selecting important video frames according to the importance to form a static video abstract. For dynamic video abstraction, namely video abstraction in the form of key shots, the abstraction process can be generally divided into four steps of feature extraction, video shot segmentation, video content importance calculation and video abstraction generation. Firstly, extracting the characteristics of video frames, then segmenting video shots according to the extracted frame characteristics to generate a plurality of shots, then calculating the content importance of the video shots, and finally selecting a certain number of shots to form an abstract according to the importance.

In an early conventional video summarization method, methods such as manually-made features or clustering are mostly used to determine whether to select video frames or video shots to compose a video summary, such as Gygli M, Grabner H, riemens chneider H, et al, creatingsummary from User Videos, Fleet D, Pajdla T, Schiele B, Tuytelaars T, editor, Computer Vision-Eccv 2014, Pt vi, Cham, Springer Int Publishing Ag,2014, 505 and 520.

With the excellent performance of deep learning in various application fields, the first article Zhang K, Chao W-L, Sha F, et al, video learning with Long Short-Term Memory [ C ] Computer Vision-ECCV 2016,2016:766-782 appeared in 2016. The authors first model the variable range temporal dependency between video frames using the Long short-term memory network (LSTM) in deep learning, then use the Multilayer Perceptron (MLP) to estimate the importance of the frames, and increase the diversity of the visual content of the generated summary based on the Determinant Point Process (DPP).

Video summaries based on deep learning can be classified into unsupervised video summaries, supervised video summaries and weakly supervised video summaries according to whether a summary algorithm needs supervised information and the strength of the supervised information. The unsupervised video abstract method evaluates the importance, representativeness, interestingness and the like of video contents by using specific standards and then generates a video abstract according to an evaluation result. Hu M, Hu R, Wang X, et al, unused Temporal adaptation prediction Model for User Created video [ C ]. MultiMedia Modeling,2021:519 530. A Temporal Attention based summary Model TASM is proposed, in which more comprehensive information is obtained by combining the outputs of the encoder and decoder, meanwhile, the feedforward attribute reward is designed to enhance the diversity of the selected key frames, and a Depth Deterministic Policy Gradient (DDPG) is adopted to train the TASM, and the network structure solves the problems of high redundancy and information distortion between key frames existing in the existing User video summary. The supervised video summarization method is used for training a neural network by using data labeled by human beings, so that the model can capture video content with more semantic information generally, and the obtained summarization result is superior to that of the unsupervised video summarization method in general. ZHao B, Li X, Lu X. TTH-RNN, temporal-Train iterative Neural Network for Video simulation [ J ]. IEEE Transactions on Industrial Electronics,2021,68(4):3629-3637. TTH-RNN Network is proposed, which comprises a transducer-Train embedding layer to avoid that the mapping matrix from high-dimensional Video features to concealment is too large, thereby reducing the training difficulty; while using hierarchical RNNs to explore long-term temporal correlations between video frames. Although the performance of the supervised video summarization method is superior to that of the unsupervised method in general conditions, the number of data sets of the current video summarization is small, the scale is small, and the establishment of a new large-scale data set is time-consuming and labor-consuming work. A VESD (variable Encoder-predictor-Decoder) model is proposed, which is composed of two parts, namely a variable Encoder and an Encoder-attention-Decoder, wherein the variable Encoder learns latent semantics from network videos, and the Encoder-attention-Decoder is used for significance evaluation and summary generation. Most of the methods use a convolutional neural network or a cyclic neural network to acquire time-space domain information of the video, and the acquisition capability of the long-range memory is poor.

Disclosure of Invention

In view of the defects in the prior art, the invention aims to provide a method for realizing video abstraction by using a memory network and a gate control cycle unit.

According to one aspect of the present invention, there is provided a method for implementing video summarization by using a memory network and a gated loop unit, comprising:

preprocessing a data set to obtain video frame characteristics of a video;

designing an integral network model which takes the video frame characteristics as input and takes the frame importance scores as output;

training the whole network model;

converting the frame importance scores output from the trained overall network model into key shots;

and combining the key shots into a dynamic video abstract.

Preferably, the preprocessing the data set to obtain video frame characteristics of the video comprises:

sampling a video into a sequence of video frames;

and extracting the characteristics of the video frame sequence by using a pre-trained GoogleLeNet model to obtain the video frame characteristics corresponding to each video.

Preferably, the overall network model with the video frame characteristics as input and the frame importance scores as output comprises:

the input module is used for fusing the front and back information of the input video frame characteristics;

the memory module is used for outputting the fused result as the input of the memory module, repeatedly retrieving and storing information in memory by using a multi-hoss structure and outputting memory output which maximally retains useful information;

an output module, a memory output of the memory module being an input of the output module, the output module returning the memory output to the frame importance score;

preferably, the input module is configured to process the input video frame feature sequence using a layer of bidirectional GRUs, including:

the forward output of the GRU contains past information of the current time;

backward output of the GRU comprises future information of the current moment;

and the bidirectional GRU fuses and adds the forward output and the backward output element by element to obtain complete past and future information of the current moment.

Preferably, constructing the memory module comprises:

coding vectors obtained by a fusion sequence output by the bidirectional GRU through a full connection layer are used as original problem vectors and input into a first hop;

respectively inputting the fusion sequence output by the bidirectional GRU to two different full-connection layers of the first hop to obtain input memory and output memory;

the problem vector and the input memory are subjected to inner product, and self-attention weight is obtained through Softmax;

the self-attention weight and the output memory are subjected to weighted summation, the weighted summation is multiplied by the problem vector element by element, and memory output is obtained through a full connection layer with a normalization layer and an activation function;

inputting the memory output as a new problem vector into the next hop to calculate self-attention weight and memory output; until the memory output of the last hop is used as the input of the output module.

Preferably, the fully-connected matrix of output memory between adjacent hop layers is equal to the fully-connected matrix of input memory; the full-join matrix of the original problem vector is equal to the full-join matrix of the underlying input memory.

Preferably, constructing the output module comprises:

the memory output is connected with the fusion sequence of the bidirectional GRU output in a residual error manner;

sequentially passing the output of the residual connection through an LN layer and a Dropout layer;

and inputting the output result of the Dropout layer into a two-layer fully-connected network, wherein the node number of the output layer of the last fully-connected layer is 1, and the frame importance score corresponding to each video frame is obtained.

Preferably, the training the whole network model includes:

the network model is trained using the mean square error as a loss function, and an Adam optimizer.

Preferably, the converting the frame importance scores obtained from the trained overall network model into key shots comprises:

detecting scene change points by using a KTS algorithm, regarding a video clip between the two scene change points as a video shot, and dividing the video into a plurality of video shots;

for each video shot, calculating the average value of the importance scores of all frames in the video shot as a shot score;

and executing a knapsack algorithm to select key shots for the plurality of divided shots and the shot scores.

Preferably, the knapsack algorithm maximizes the total fraction of key shots whose length needs to be limited to 15% of the original video length.

Compared with the prior art, the invention has the following beneficial effects:

the method for realizing video abstraction by using the memory network and the gate control cycle unit has the advantages that the overall network model carries out feature processing on input videos, and dynamic video abstraction is obtained by using the frame importance scores, so that the overall effectiveness of the model is improved.

The method for realizing the video abstraction by using the memory network and the gated cyclic unit aims at the defect that the convolutional neural network and the cyclic neural network cannot acquire long-term memory, and repeatedly retrieves information in the memory by using the memory network, so that the useful information is reserved to the maximum extent. Meanwhile, the bidirectional GRU is used in the input module to fuse the forward and backward information of the video, and the F1 score of the model is improved. Experiments on two reference data sets TVSum and SumMe verify the effectiveness of the inventive method.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flowchart of a method for implementing video summarization using a memory network and a gated loop unit according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of an overall network model according to a preferred embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an input module according to a preferred embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a memory module according to a preferred embodiment of the present invention;

fig. 5 is a schematic structural diagram of an output module according to a preferred embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

As shown in fig. 1, a flowchart of a method for implementing video summarization by using a memory network and a gated loop unit according to an embodiment of the present invention includes:

s1, preprocessing the data set to obtain video frame characteristics of the video;

s2, designing an integral network model with video frame characteristics as input and frame importance scores as output;

s3, performing model training in the whole network model;

s4, converting the frame importance scores obtained from the trained integral network model into key shots;

and S5, combining the key images into a dynamic video abstract, and testing the video abstract result.

For better data processing, the methodA preferred embodiment is provided to perform S1. The video is first sampled into a sequence of video frames at a sampling rate of 2fps, and then features of the sequence of video frames are extracted using the pool5 layer of the google lenet model pre-trained on the large-scale image dataset ImageNet, the feature dimension of each video frame being 1024, so as to obtain { X { (X) } for each video_kAnd (4) vector. { X_kThe length of the video frame is N, and N represents the number of video frames sampled in the video; each vector X_kD represents the characteristic dimension of the video frame.

S2 is performed for the present invention providing a preferred embodiment. The present embodiment aims to use the memory network to retain useful information to the maximum extent, and at the same time, fuse the forward and backward video information at each time by using a Bidirectional Gated current Unit (BiGRU) at the input module. Fig. 2 is a schematic diagram of the overall network model structure of this embodiment. As can be seen, the network model is divided into 3 modules: the specific construction process of the 3 modules comprises the following steps:

s21, constructing an input module, as shown in fig. 3, which is a schematic structural diagram of the input module. The input module is used for fusing the front and back information of the input video frame.

Selecting a layer of bidirectional GRU to process the input sequence, wherein the output characteristic dimension is still 1024, as shown in formula (1):

wherein

Indicating the hidden state of the forward GRU at time step k,

representing the hidden state of the backward GRU at a time step k;

and

sum V_kAs output from the input module, both past and future information of the current video frame is included.

S22, constructing a memory module, as shown in FIG. 4, which is a schematic structural diagram of the memory module. The function of the memory module is to repeatedly retrieve information in the memory by using a multi-hoss structure to obtain output memory { o ] which maximally retains useful information_k}. The memory here refers to the input memory { a ] in FIG. 4_kAnd output memory b_kAnd mapping front and back information of the fused video frame characteristics through different full connection layers respectively.

At output of bidirectional GRU { V_kGet the problem vector Q through 3 different full connection layers respectively_k}, input memory { a_kAnd output memory b_kAdding a Dropout layer behind each full-connection layer; for input memory { a_kAnd output memory b_kAnd then adding a time sequence information coding matrix T respectively_AAnd T_B. The Dropout layer has the function of relieving overfitting, and the addition of the time sequence information coding matrix is because the video frame sequence is a time sequence, and the performance of the model can be improved by performing time coding, namely:

Q_k＝dropout(linear_Q(V_k)) (2)

a_k＝dropout(linear_A(V_k))+T_A(k) (3)

b_k＝dropout(linear_B(V_k))+T_B(k) (4)

problem vector { Q_kAnd input memory { a }_kInner product and get self-attention weight p through Softmax layer_kAs shown below:

p_k＝Softmax(Q^Ta_k) (5)

wherein Q represents a group consisting of N Q_kA matrix of components.

Self-attention weight { p_kAnd output memory { b }_kIs weighted and summed with the problem vector Q_kMultiplying element by element, and obtaining memory output { o ] through a full connection layer with LN layer and ReLU activation function_kAs shown below:

o_k＝relu(layernorm(linear(p_kb_k*Q_k))) (6)

the output results are remembered as new problem vectors to calculate weights:

Q_k＝o_k (7)

the formulas (3) to (7) are a hop.

In order to reduce the number of parameters, the present invention provides a preferred embodiment. The full connection matrix of the output memory between adjacent hop layers is equal to the full connection matrix of the input memory, i.e.:

linear_A(k+1)＝linear_B(k) (8)

furthermore, the full-join matrix of the original problem vector is equal to the full-join matrix of the underlying input memory.

The number of neurons in the output layer of the full connection layer in the memory module is embed _ size 512, so that the output memory { o }_kThe size of (E) is embed _ size × N.

S23, constructing an output module, as shown in fig. 5, which is a schematic structural diagram of the output module. The output module is used for returning the memory output to the frame importance score s_k}。

First, the memory output { o_kAnd outputs of bidirectional GRU { V }_kMake residual connection, since o_kAnd { V }_kThe dimension of the component is different, and the component needs to be given to V_kPerform a linear mapping to match the dimensions. And the output of the residual error connection is input into a two-layer full-connection network after sequentially passing through the LN layer and the Dropout layer. The first layer of the fully-connected network consists of a fully-connected layer, an LN layer, a ReLU layer and a Dropout layer, and the output dimension of the fully-connected layer is embedded _ size. The second layer of full-connection network consists of a full-connection layer and a SigMoid layer, the output dimensionality of the full-connection layer is 1, the SigMoid is used for limiting the output to be 0-1, and therefore a frame importance score { s }is obtained_k}。

The present invention provides one embodiment performing S3. In this embodiment, the mean square error is used as the loss function. Specifically, the mean square error loss function is as follows:

loss(s_i,s_i′)＝(s_i-s_i′)² (9)

wherein s is_iAnd s_i' represents the frame importance score and the ground-route frame importance score, respectively, of the network output. The mean square error loss is one of the most commonly used loss functions in the regression problem.

In this embodiment, the batch size is set to 1, an Adam optimizer is used to train the network, the learning rate of the Adam optimizer is 0.0001, and the weight attenuation coefficient is 0.00001.

Based on the above embodiment, the present invention provides a preferred embodiment that executes S4, the frame importance score is converted into a key shot. Two steps are required for the frame importance score to be converted into a key shot.

The method comprises the steps of firstly, detecting scene change points by using a KTS algorithm, regarding a video clip between two scene change points as a video shot, dividing a video into a plurality of video shots, and calculating the average value of importance scores of all frames in the shot as a shot score for each shot.

And secondly, according to the plurality of divided shots and the shot scores, executing a knapsack algorithm to select key shots so as to maximize the total scores of the key shots, wherein the total duration of the key shots is limited to 15% of the duration of the original video.

Specifically, the knapsack algorithm is shown as follows:

wherein y is_iFraction, u, representing the ith shot_iIndicating whether a key shot is selected,/_iIs the length of the ith shot, the size of L is equal to 15% of the original video duration, and K is the number of video shots.

Based on S4 in the above embodiment, S5 is performed to combine the key lenses into a dynamic summary, and calculate an F1 score between the result of the group-route summary.

To verify the effectiveness of the above examples, the present invention applies it to practice and compares it with other methods. Specifically, the data set consists of 4 data sets of TVSum and SumMe and auxiliary data sets YouTube and OVP. Canonical and Augmented represent two different data set settings, Canonical representing 80% of the current data set as the training set and 20% as the test set; augmented means that 80% of the current data set and the remaining 3 data sets are used as training sets and 20% of the current data set are used as test sets. For example, when the current data set is SumMe, the training set consists of YouTube, OVP, TVSum, and 80% SumMe, and the test set consists of 20% SumMe. The summary results are shown in table 1 in comparison with other algorithms. The results in Table 1 show the F1 fraction (in%:%), the greater the F1 fraction, the better the performance of the algorithm, and the final column is the average of the performance of the individual algorithms, from Table 1 the effectiveness of the method of the invention can be seen.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The above-described preferred features may be used in any combination without conflict with each other.

Claims

1. A method for implementing video abstraction by using a memory network and a gated cyclic unit is characterized by comprising the following steps:

preprocessing a data set to obtain video frame characteristics of a video;

training the whole network model;

and combining the key shots into a dynamic video abstract.

2. The method of claim 1, wherein the preprocessing the data set to obtain the video frame characteristics of the video comprises:

sampling a video into a sequence of video frames;

3. The method for implementing video summarization by using a memory network and a gated round robin unit according to claim 1, wherein the overall network model with video frame characteristics as input and frame importance scores as output comprises:

the memory module is used for acquiring storage memory by taking the fused result output by the input module as the input of the memory module; the memory module repeatedly retrieves the information in the memory by using a multi-hoss structure and outputs memory output for retaining useful information;

and the memory output of the memory module is used as the input of the output module, and the output module returns the memory output to the frame importance score.

4. The method of claim 3, wherein the input module is configured to process the input video frame feature sequence using a layer bidirectional GRU, and comprises:

the forward output of the GRU contains past information of the current time;

backward output of the GRU comprises future information of the current moment;

5. The method for implementing video abstraction by using memory network and gated loop unit according to claim 3 or 4, wherein said memory module is constructed, including:

6. The method for video summarization using a memory network and gated round robin unit according to claim 5, wherein the full link matrix of the output memory between adjacent hop layers is equal to the full link matrix of the input memory; the full-join matrix of the original problem vector is equal to the full-join matrix of the underlying input memory.

7. The method for implementing video summarization by using a memory network and a gated loop unit according to claim 3, wherein constructing the output module comprises:

8. The method of claim 1, wherein the training of the global network model comprises:

9. The method for video summarization using a memory network and a gated round robin unit according to claim 1, wherein the converting the frame importance scores obtained from the trained global network model into key shots comprises:

10. The method of claim 9, wherein the knapsack algorithm aims to maximize the total shot fraction of key shots and limits the length of the key shots to 15% of the original video length.