CN109885728A

CN109885728A - Video summarization method based on meta learning

Info

Publication number: CN109885728A
Application number: CN201910037959.5A
Authority: CN
Inventors: 李学龙; 李红丽; 董永生
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2019-01-16
Filing date: 2019-01-16
Publication date: 2019-06-14
Anticipated expiration: 2039-01-16
Also published as: CN109885728B

Abstract

The present invention relates to a kind of video summarization methods based on meta learning.The abstract problem of each video is considered as an independent video summary task by thought of this method based on meta learning, and learner model is learnt in video summary task space, to improve the Generalization Capability of model, explores video frequency abstract mechanism.Specifically, the present invention is used as learner model with video frequency abstract shot and long term Memory Neural Networks (video summarization Long Short Term Memory neural network, vsLSTM).The method of the present invention specifically includes that all tasks (data are concentrated with the abstract problem of each video) are randomly divided into training set and test set by (1)；(2) the two stages Learning Scheme that learner model is proposed by this method is learnt between the task of training set, is explored to video frequency abstract mechanism；(3) Performance Evaluation is completed in performance of the test model on test set.

Description

Video summarization method based on meta learning

Technical field

The invention belongs to one of technical field of computer vision and the critical issue of machine learning and pattern-recognition.This Invention makes a summary to video, extracts key frame therein, can reduce people and browse the time of video, and can be applied to video Retrieval, video management etc..

Background technique

With mobile phone, mobile camera etc. can capture apparatus it is widely available, emerge the video data of magnanimity, and daily There is a large amount of video data to generate and propagate.On the one hand, these data provide information abundant for people, on the other hand, The time for browsing, retrieving the consumption of these video datas is also considerable.In this context, video frequency abstract is concentrated as a kind of video Technology has obtained the extensive concern of computer vision field researcher.

Video frequency abstract refers to through semi- or fully automated mode, analyzes spatio-temporal redundancies existing for video structure and content, The redundancy segment (frame) in original video is removed, and extracts wherein significant segment (frame).People not only can be improved in it The efficiency of video is browsed, can also lay the foundation for subsequent video analysis and processing, be widely used in video frequency searching, video management Etc..It generates so far, has been a concern from it, emerge many representative methods.But because different people is browsing Focus is different when video, and up to the present, there are no a videos that is pervasive or can fully meeting people's needs to pluck Method is wanted, therefore, there are also wide exploration spaces for the research of video frequency abstract algorithm.

Because of the intrinsic structure of video data, serializing characteristic and shot and long term Memory Neural Networks (Long Short Term Memory neural network, LSTM) excellent Series Modeling performance, nearest method uses LSTM as basic mostly Model.If Zhang et al. is in document K.Zhang, W.L.Chao, F.Sha, and K.Grauman, " video summarization with long short term memory,”in Proc.Eur.Conf.Comput.Vis., Pp.766-782,2016. in propose video frequency abstract shot and long term memory network (video summarization LSTM, ) and determinant point process shot and long term memory network (determinantal point process LSTM, dppLSTM) vsLSTM It is two exemplary videos abstract network model based on basic LSTM model refinement in recent years, it can be well to different long in video The time-dependent relation of degree models；Zhou and Qiao is in document K.Zhou and Y.Qiao, " deep reinforcement learning for unsupervised video summarization with diversity Representativeness reward, " a variety of unsupervised and supervision version that proposes in arXiv:18.01.00054,2017. The thought that this depth abstract network (Deep Summarization Network, DSN) learns deeply incorporates LSTM net In the learning process of network, with the structural features of better captured video data；Ji et al. in document Z.Ji, K.Xiong, Y.Pang,and X.Li,“video summarization with attention-based encoder-decoder Networks, " the coding decoder video frequency abstract net based on attention mechanism that proposes in arXiv:1708.09545,2017. Network structure (Attention encoder-decoder networks for Video Summarization, AVS) will be with LSTM is that the encoder of basic model and the decoder based on attention mechanism are implemented in combination with the extraction to key frame of video.

There are the problem of:

1) it more focuses on the structure of video data or serializes characteristic, rather than video summary task itself；

2) model is required to explore the mechanism of video frequency abstract not yet explicitly, model generalization ability is not good enough.

Summary of the invention

Technical problems to be solved

For the deficiency of above-mentioned existing method, the present invention provides a kind of video summarization method based on meta learning.This method Abstract problem to each video is considered as an independent video summary task, the study of model by the thought based on meta learning It is carried out in video summary task space, so that it more focuses on video summary task itself；By in video summary task The study in space, this method explicitly require model to explore a kind of video frequency abstract mechanism, to improve the Generalization Capability of model.

Technical solution

A kind of video summarization method based on meta learning, it is characterised in that steps are as follows:

Step 1: preparing data set

Use open source video frequency abstract data set SumMe, TVSum, Youtube and OVP: when using SumMe as test set, Youtube and OVP is as training set, and TVSum is as verifying collection；When using TVSum as test set, Youtube and OVP are as instruction Practice collection, SumMe is as verifying collection；

Step 2: extracting video frame feature

Video frame is input to GoogleLeNet network, and using the output of network layer second from the bottom as its depth characteristic； Use color histogram, GIST, HOG and dense SIFT as traditional characteristic, wherein color histogram extracts from video frame RGB form, other traditional characteristics extract from the corresponding grayscale image of video frame；

Step 3: training video abstract model

The two stages algorithm for training network based on meta learning thought is used to carry out learner model vsLSTM network f θ ginseng Number θ study, training before by model parameter θ random initializtion be θ₀, i-th iteration is by model parameter by θ_i-1It is updated to θ_i, Each iteration is made of two stage stochastic gradient descent process in training:

First stage is by parameter by θ_i-1It is updated toSelect a task at random from training setCalculate learner In parameter current θ_i-1Performance under state in the taskAnd loss functionIt asksIt is right θ_i-1Derivative and renewal learning person's parameter θ_i-1ExtremelyThen learner model can be calculated againPerformance in the task And update its parameterThe update of this parameter can carry out n times, and wherein n is positive integer, be shown below:

Wherein, α indicates learning rate,WithIt is learner modelWithIn taskOn L₁Loss function, wherein the parameter of learner model is respectively θ_i-1WithL₁Loss function is defined as:

Wherein y indicates the output vector of model, and x indicates that ground truth vector, N indicate the number of element in vector；

Second stage by parameter byIt is updated to θ_i: select a task at random from training setLearner is calculated to exist ParameterPerformance under state in the taskAnd loss functionIt asksTo θ_i-1Derivative simultaneously Renewal learning person parameter is to θ_i, as shown in formula (3):

Wherein β indicates meta learning rate, is used as hyper parameter in the methods of the invention；It is learner model In taskOn L₁Loss function, wherein the parameter of learner model be

This two stages training algorithm as meta learning person's model guidance learning person's model vsLSTM training to carry out video The exploration of digest mechanism, by maximizing generalization ability of the learner model on test set, i.e. minimum learner model exists The extensive error of expectation on test set, can acquire the parameter θ of learner model in successive ignition；

Step 4: the video frame feature in step 2 is input in the trained learner model vsLSTM network of step 3, The probability that every frame is selected into video frequency abstract can be obtained.

Specific step is as follows for step 4: first according to the probability or score of vsLSTM output, video being divided into timing not The segment of intersection；Then using the average value of video frame score in each segment as the score of the video clip, and according to video The score of segment carries out descending sort to video clip；Sequentially retain since the video clip of maximum probability, to avoid choosing Abstract result it is too long, the stopping when the video clip total length of reservation reaches the 15% of original video length, the view chosen at this time Abstract result of the frequency segment as original video.

Beneficial effect

A kind of video summarization method based on meta learning proposed by the present invention, has the beneficial effect that:

1) video frequency abstract is solved the problems, such as using the thought of meta learning for the first time；

2) simple and effective video frequency abstract model training method is proposed, model is made more to focus on video summary task Itself；

3) it is intended to improve the generalization ability of model, is distinctly claimed video frequency abstract model and video frequency abstract mechanism is explored；

4) it proves that inventive algorithm has the characteristics that advanced, validity by the comparison of qualitative and quantitative experiment, has very high Practical application value.

Detailed description of the invention

Fig. 1 is the conceptive whole flow chart of the present invention

Fig. 2 is an iteration process schematic that the present invention proposes training method

Fig. 3 is performance schematic diagram of the present invention under different hyper parameters

Fig. 4 is visualization result figure of the invention

Specific embodiment

Now in conjunction with embodiment, attached drawing, the invention will be further described:

Realize technical solution of the present invention the following steps are included:

1) prepare data set

This method use open source video frequency abstract data set SumMe (M.Gygli, H.Grabner, H.Riemenschneider,and L.Van Gool,“creating summaries from user videos,”in Proc.Eur.Conf.Comput.Vis.,pp.505-520,2014)、TVSum(Y.Song,J.Vallmitjana, A.Stent,and A.Jaimes,“TvSum:summarizing web videos using titles,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,pp.5179-5187,2015)、Youtube(S.E.F.De Avila, A.P.B.Lopes,A.da luz Jr,and A.de Albuquerque Ara ujo,“Vssum:a mechanism designed to produce static video summarizies and a novel evaluation method,” Pattern Recognit.Lett., vol.32, no.1, pp.56-68,2011) and OVP (open video project,http://www.open-video.org/.)。

For the Generalization Capability of search model, successively use SumMe or TVSum as test set, the other three is as training Collect with verifying.When using SumMe as test set, TVSum, Youtube and OVP are as training and verifying collection；When being with SumMe When test set, Youtube and OVP are as training set, and TVSum is as verifying collection；When using TVSum as test set, Youtube and OVP is as training set, and SumMe is as verifying collection.

2) video frame feature is extracted

The method of the present invention uses depth and the two kinds of feature of tradition respectively to verify the validity of model.By video frame Be input to GoogleLeNet (C.Szegedy, W.Liu, Y.Jia, P.Sermanet, S.Reed, D.Angueloy, D.Erhan, V.vanhoucke,A.Rabinovich et al,“going deeper with convolutions,”in Porc.IEEE Conf.Comput.Vis.Pattern Recognit., 2015) the layer second from the bottom output in network model is special as its depth Sign, traditional characteristic use color histogram, GIST, HOG (Histogram of Oriented Gradient) and dense SIFT (Scale Invariant Feature Transform), wherein color histogram extracts from the RGB form of video frame, Other traditional characteristics extract from the corresponding grayscale image of video frame.

3) training video abstract model

This method propose the two stages algorithm for training network MetaL-VS based on meta learning thought, training in every time iteration by Two stage stochastic gradient descent algorithm composition, this two stages training algorithm is as meta learning person's model guidance learning person's model Training, vsLSTM carry out the exploration of video frequency abstract mechanism as learner model.

As shown in Figure 1, thought of this method based on meta learning, by the abstract problem to each video be considered as one it is independent Video summary task, model are learnt in video summary task space, finally, by the way that the abstract problem of test video to be considered as New task, model can obtain the corresponding abstract of the video.Specifically, method proposes the two stages nets based on meta learning thought Network training algorithm is to carry out learner model (this method uses vsLSTM network as learner model when realizing) f_θParameter θ Study.It can be by model parameter by θ as shown in Fig. 2, setting i-th iteration_i-1It is updated to θ_iIt is (before training that model parameter is initial at random Turn to θ₀), training in every time iteration be made of two stage stochastic gradient descent process.

First stage is by parameter by θ_i-1It is updated to(n=2 in illustrated case): it selects one at random from training set and appoints BusinessLearner is calculated in parameter current θ_i-1Performance under state in the taskAnd loss functionIt asksTo θ_i-1Derivative and renewal learning person's parameter θ_i-1ExtremelyThen learner's mould can be calculated again TypePerformance in the task simultaneously updates its parameterTheoretically it is secondary can to carry out n (n is positive integer) for the update of this parameter, such as formula (1) shown in:

Wherein α indicates learning rate, in the methods of the invention as hyper parameter；WithIt is to learn Habit person's modelWithIn taskOn L₁Loss function, wherein the parameter of learner model is respectively θ_i-1With L₁Loss function is defined as:

Wherein y indicates the output vector of model, and x indicates that ground truth vector, N indicate the number of element in vector.

This two stages training algorithm as meta learning person's model guidance learning person model (vsLSTM) training to regard The exploration of frequency digest mechanism.It (minimizes learner model to exist by maximizing generalization ability of the learner model on test set The extensive error of expectation on test set), the parameter θ of learner model can be acquired in successive ignition.

4) video frequency abstract is exported

The input of this video frequency abstract model is video frame feature (depth or traditional characteristic), and output is that every frame is selected in video Entering the probability of abstract, (output is a vector, and each element is more than or equal to 0 and is less than or equal to 1 in vector, and the length of vector is equal to frame Number, i.e., each essence indicates that corresponding video frame is selected into the probability of video frequency abstract in vector, it is understood that for the weight for being the frame The property wanted score.).According to document K.Zhang, W.L.Chao, F.Sha, and K.Grauman, " video summarization With long short term memory, " in Proc.Eur.Conf.Comput.Vis., pp.766-782,2016. Method, abstract result can be converted by the result of this method.The feature of each frame of test video is input to trained study Video summary results can be obtained by processing in person's model.

Specific steps: first according to the probability or score of vsLSTM output, with Kernel Temporal Segmentation (KTS) is (according to document K.Zhang, W.L.Chao, F.Sha, and K.Grauman, " video summarization with long short term memory,”in Proc.Eur.Conf.Comput.Vis., Pp.766-782,2016.) video is divided into disjoint segment in timing by method；Then by video frame score in each segment Score of the average value as the video clip descending sort is carried out to video clip and according to the score of video clip；From most The video clip of maximum probability starts sequentially to retain (by the sequence of video clip score from high to low), to avoid the abstract knot chosen Fruit is too long, the stopping when the video clip total length of reservation reaches the 15% of original video length, and the video clip chosen at this time is made For the abstract result of original video.

1) simulated conditions

The present invention is to be in central processing unitI5-3470 3.2GHz GPU, memory 16G, Centos operating system On, the emulation of python program is carried out with Anaconda software.Data set used in experiment is from disclosed database It obtains:

SumMe dataset(http://classif.ai/dataset/ethz-cvl-video-summe)

TVSum dataset(https://github.com/yalesong/tvsum)

Youtube dataset(http://www.npdi.dcc.ufmg.br/VSUMM)

OVP dataset(http://www.open-video.org)

Wherein SumMe data set includes 25 mark videos, respectively there is 50 mark views in TVSum, Youtube and OVP Frequently.In training learner model, training set includes ground truth, and the ground truth of test set is hidden.When with When SumMe is test set, 10 videos are randomly selected from TVSum as verifying collection, remaining video and Youtube in TVSum, Video in OVP collectively constitutes training set；When using TVSum as test set, 25 video conducts are selected at random from TVSum Test set, wherein remaining video, as verifying collection, the other three data set forms training set.In our experiment, it uses Test set verifies the validity of our methods.Performance Evaluating Indexes are F-score F:

Wherein P indicates precision (precision), and R indicates recall rate (recall):

Wherein A indicates the abstract of model generation as a result, B indicates ground truth.

2) emulation content

(1) make the better hyper parameter of the method for the present invention performance (learning rate learning rate, lr, member to show to explore The process of habit rate meta learning rate, mlr and first stage parameter update times n), in an experiment, we The assessment experiment of different hyper parameter drag performances is carried out.

Fig. 3 illustrates the performance of different hyper parameter drag performances.It can be seen from the figure that when lr takes 0.0001, mlr to take When 0.001, model behaving oneself best on both data sets.

It is best that table 1, which illustrates the F-score of model on both data sets, overstriking number when hyper parameter n takes different value, Index.Because testing the video memory limitation using video card, n value maximum is 2, will appear the mistake of low memory when n value is greater than 2. As can be seen from the table, when the value of hyper parameter n is 1, model behaving oneself best on both data sets.

The performance (F-score) of model on both data sets when 1. hyper parameter n of table takes different value

n	1	2
			SumMe	44.1%	42.5%
TVSum	58.2%	58.1%

It (2) is the validity for proving this algorithm, in experiment 2, we are by the algorithm of this paper and typical method in recent years It is compared.The first control methods is that Gygli et al. was proposed in 2015, and reference papers are discussed in detail: M.Gygli,H.Grabner,and L.Van Gool,“video summarization by learning submodular mixtures of objectives,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,2015, Pp.3090-3098. second of control methods is vsLSTM, and reference papers are discussed in detail: K.Zhang, W.L.Chao, F.Sha, and K.Grauman,“video summarization with long short term memory,”in Proc.Eur.Conf.Comput.Vis., the third control methods of 2016, pp.766-782. is Zhang et al. 2016 What year proposed, reference papers are discussed in detail: K.Zhang, W.L.Chao, F.Sha, and K.Grauman, " summary transfer:exemplar-based subset selection for video summarization,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit., 2016, pp.1059-1067. the 4th kinds of control methods It is SUM-GAN_sup, reference papers are discussed in detail: B.Mahasseni, M.Lam, and S.Todorovic, " unsupervised video summarizationwith adversarial lstm networks,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit., 2017. the 5th kinds of control methods are DR-DSN_sup, reference is discussed in detail Paper: K.Zhou and Y.Qiao, " deep reinforcement learning for unsupervised video summarization with diversity representativeness reward,”arXiv:1801.00054, 2017. the 6th kinds of control methods are that Li et al was proposed in 2017, reference papers are discussed in detail: X.Li, B.Zhao, and X.Lu,“a general framework for edited video and raw video summarization,”IEEE Trans.Image Process., vol.26, no.8, pp.3652-3664,2017. table 2 is pair of quantizating index F-score Than overstriking number is best index.As can be seen from the table, method MetaL-VS proposed in this paper is showed most in comparison It is good.Therefore, by the comparison of the method representative in recent years with this field, advance of the invention is further demonstrated.

Fig. 4 is the visualization result figure of MetaL-VS, and wherein Air_Force_One and car_over_camera video comes From SumMe data set；AwmHb44_ouw and qqR6AEXwxoQ video comes from TVSum data set.The blue portion of histogram is Ground truth, i.e., each frame manually marked are the probability of abstract frame, and RED sector is MetaL-VS as a result, under histogram The picture on side is the few examples picture in MetaL-VS abstract result.Although it can be seen from the figure that there is some deviations, MetaL-VS can select the high frame of different degree from original video, ignore unimportant enough frame.It can from visualization figure Effectiveness of the invention out.

2. 7 kinds of method video summary results index F-score comparisons of table

Method	SumMe	TVSum
			Gygli et al.	39.7%	-
vsLSTM	40.7%	56.9%
			Zhang et al.	40.9%	-
SUM-GAN_sup	41.7%	56.3%
			DR-DSN_sup	42.1%	58.1%
Li et al.	43.1%	52.7%
			MetaL-VS (present invention)	44.1%	58.2%

It (3) is test the method for the present invention MetaL-VS to the robustness of traditional characteristic, we have generation in nearly 2 years with two The method of table has carried out the comparative experiments of video frequency abstract performance on traditional characteristic.First control methods is SUM-GAN_sup, in detail Carefully introduce reference papers: B.Mahasseni, M.Lam, and S.Todorovic, " unsupervised video summarization with adversarial lstm networks,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit., 2017. second control methods are dppLSTM, are discussed in detail with reference to text It offers: K.Zhang, W.L.Chao, F.Sha, and K.Grauman, " video summarization with long short Term memory, " in Proc.Eur.Conf.Comput.Vis., 2016, pp.766-782. tables 3 are quantizating index F- The comparison of score, the best index of overstriking digital representation.It can find out from table, MetaL-VS is achieved can be with nearly 2 years classics The performance of method shoulder to shoulder, and the also control methods 4 and 2.8 percentage point more than two respectively on SumMe data set.By Performance of the MetaL-VS on traditional characteristic is it is found that the present invention has certain robustness and generalization ability to traditional characteristic.

Table 3. uses F-score performance comparison when traditional characteristic

Method	SumMe	TVSum
			SUM-GAN_sup	39.5%	59.5%
dppLSTM	40.7%	57.9%
			MetaL-VS (present invention)	43.5%	57.9%

The method of the present invention is first and explores the method that meta learning is applied in video frequency abstract field.Think of based on meta learning Think, video frequency abstract model is learnt in video summary task space, and this mode is conducive to model and more focuses on video Abstract task itself rather than just the video data of structuring, serializing, while being more advantageous to model to video frequency abstract mechanism Exploration, be conducive to improve model generalization ability.Pass through the comparison of qualitative and quantitative experiment, it was demonstrated that inventive algorithm has advanced The features such as property, validity.

Claims

1. a kind of video summarization method based on meta learning, it is characterised in that steps are as follows:

Step 1: preparing data set

Step 2: extracting video frame feature

Video frame is input to GoogleLeNet network, and using the output of network layer second from the bottom as its depth characteristic；It uses Color histogram, GIST, HOG and dense SIFT are as traditional characteristic, and wherein color histogram extracts from the RGB of video frame Form, other traditional characteristics extract from the corresponding grayscale image of video frame；

Step 3: training video abstract model

The two stages algorithm for training network based on meta learning thought is used to carry out learner model vsLSTM network f θ parameter θ Study, training before by model parameter θ random initializtion be θ₀, i-th iteration is by model parameter by θ_i-1It is updated to θ_i, in training Each iteration is made of two stage stochastic gradient descent process:

First stage is by parameter by θ_i-1It is updated toSelect a task at random from training setLearner is calculated current Parameter θ_i-1Performance under state in the taskAnd loss functionIt asksTo θ_i-1Lead Number and renewal learning person's parameter θ_i-1ExtremelyThen learner model can be calculated againPerformance and update in the task Its parameterThe update of this parameter can carry out n times, and wherein n is positive integer, be shown below:

Wherein, α indicates learning rate,WithIt is learner modelWithIn taskOn L₁Damage Function is lost, wherein the parameter of learner model is respectively θ_i-1WithL₁Loss function is defined as:

Second stage by parameter byIt is updated to θ_i: select a task at random from training setLearner is calculated in parameterPerformance under state in the taskAnd loss functionIt asksTo θ_i-1Derivative and more New learner's parameter is to θ_i, as shown in formula (3):

Wherein β indicates meta learning rate, is used as hyper parameter in the methods of the invention；It is learner modelIn taskOn L₁Loss function, wherein the parameter of learner model be

This two stages training algorithm as meta learning person's model guidance learning person's model vsLSTM training to carry out video frequency abstract The exploration of mechanism, by maximizing generalization ability of the learner model on test set, i.e. minimum learner model is being tested The extensive error of expectation on collection, can acquire the parameter θ of learner model in successive ignition；

Step 4: the video frame feature in step 2 being input in the trained learner model vsLSTM network of step 3, can be obtained The probability of video frequency abstract is selected into every frame.

2. a kind of video summarization method based on meta learning according to claim 1, it is characterised in that the specific step of step 4 It is rapid as follows: first according to the probability or score of vsLSTM output, video being divided into disjoint segment in timing；It then will be each Score of the average value of video frame score as the video clip in segment, and according to the score of video clip, to video clip Carry out descending sort；Sequentially retain since the video clip of maximum probability, to avoid the abstract result chosen too long, works as reservation Stopping when reaching the 15% of original video length of video clip total length, the video clip chosen at this time plucks as original video Want result.