CN110110140A

CN110110140A - Video summarization method based on attention expansion coding and decoding network

Info

Publication number: CN110110140A
Application number: CN201910319879.9A
Authority: CN
Inventors: 冀中; 焦放; 庞彦伟
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-04-19
Filing date: 2019-04-19
Publication date: 2019-08-09

Abstract

A kind of video summarization method based on attention expansion coding and decoding network: regard as sequence to Sequence Learning process video frequency abstract, and utilize the relativity of time domain of video interframe, original video in SumMe or TVSum is obtained into video frame characteristic sequence by pre-training network, using video frame characteristic sequence as the input of encoder network in attention expansion coding and decoding network, obtain the semantic information sequence of video frame, again by the decoding network of multiplying property attention, the score for corresponding to each video frame is obtained；Then the score of all video frames is constituted into abstract sequence, the semantic information sequence of abstract sequence is obtained by retrospective encoder, the global semantic differentiation loss of building, introduce moving average model, the semantic dependency of study abstract sequence and video frame characteristic sequence, the new abstract sequence for retaining original video important information is obtained, set final abstract is selected finally by new abstract sequence.Invention enhances the robustness of model.

Description

Video summarization method based on attention expansion coding and decoding network

Technical field

The present invention relates to a kind of video frequency abstracts.More particularly to it is a kind of for video processing and index based on attention expand Open up the video summarization method of encoding and decoding network.

Background technique

With the fast development of information technology, video data explosive increase, there are redundancy and again in multitude of video data Multiple information, this makes every user's quick obtaining information needed become more difficult.In this case, video summarization technique Come into being, its target be generate one it is compact and comprehensively make a summary, provide target video within the shortest time for user Maximum information, want the needs of fast and accurately browsing video important information to meet people, improve people and obtain information Ability.

The research of video frequency abstract is generally divided into two classes: the method for supervised learning and unsupervised learning.It is wherein unsupervised to pluck Method is wanted to lay particular emphasis on the immanent structure of learning data, using lower-level vision feature, with the pith of positioning video.That studies is each In kind method, including cluster, sparse optimization and energy minimum etc..Main research at this stage is mostly based on the prison manually marked Educational inspector's learning method makes the abstract generated have original video by maximizing the similarity degree generated between abstract and artificial mark More information, and the performance of algorithm is generally better than the video summarization technique based on unsupervised learning.

The research of video summarization technique at present mainly regards video frequency abstract as sequence to Sequence Learning process, using length Short-term memory network (LSTM) and its variant model the relativity of time domain of video interframe.Using original video frame sequence as defeated Enter, export the importance score of each corresponding video frame, then importance score is sorted, is finally selected to close according to the score Key frame or crucial camera lens, obtain final abstract.

But supervised video summarization method requires the abstract generated and original video as close as will generate at present Abstract and corresponding ground-truth construct loss function, then continue to optimize the abstract of generation by backpropagation, make to give birth to At abstract and corresponding manual tag as close possible to finally making the abstract generated rich in original video information.This constraint The local corresponding relationship generated between abstract and true mark is only focused on, so that the abstract generated depends entirely on true mark. But concentrate the data containing supervision message less in disclosed reference data, so that model is easy to out during training Existing over-fitting, hardly results in a preferable depth model, influences the performance for ultimately generating abstract.And video frequency abstract Process is essentially mapping process of the original video to abstract, and many key messages may be lost in mapping process, therefore How making full use of the semantic information of original video and losing slowing down in mapping process information is also our problems to be solved, And when using stochastic gradient descent algorithm training neural network, parameter will avoid parameter from mutating when updating, and prevent from joining Influence of the number fluctuation to result.

Above method is concerned only with the local corresponding relationship generated between abstract and true mark, and does not comprehensively consider to generation The global restriction of abstract, and fail to make full use of the semantic information of video.And at the exceptional value of parameter renewal process Reason is without explicitly proposing that solution, this point also will affect final digest performance.

Summary of the invention

The technical problem to be solved by the invention is to provide a kind of video frequency abstracts based on attention expansion coding and decoding network Method.

The technical scheme adopted by the invention is that: a kind of video summarization method based on attention expansion coding and decoding network, It include: to regard video frequency abstract as sequence to Sequence Learning process, and using the relativity of time domain of video interframe, by SumMe Or the original video in TVSum obtains video frame characteristic sequence by pre-training network, using video frame characteristic sequence as attention The input of encoder network in power expansion coding and decoding network, obtains the semantic information sequence of video frame, then pass through multiplying property attention Decoding network, obtain the score for corresponding to each video frame；Then the score of all video frames is constituted into abstract sequence, by returning Gu property encoder obtains the semantic information sequence of abstract sequence, constructs global semanteme and differentiates loss, introduces moving average model, learn The semantic dependency for practising abstract sequence and video frame characteristic sequence obtains the new abstract sequence for retaining original video important information Column select set final abstract finally by new abstract sequence.

The original video important information is the importance score information marked in SumMe or TVSum.

Specifically comprise the following steps:

1) video frame is obtained with the polydispersity index of 2fps to original video in SumMe or TVSum, the video frame is passed through Pre-training obtains GoogLeNet network on ImageNet data set, extracts video frame characteristic sequence X={ x₁,x₂,...,x_T}；

2) by video frame characteristic sequence be input in attention expansion coding and decoding network by two-way shot and long term memory network In the encoder of composition, coding obtains the semantic information sequence V={ v of video frame₁,v₂,...,v_T}；

3) by the semantic information sequence inputting of video frame into the decoder being made of shot and long term memory network, in decoder Middle introducing attention mechanism, decoding obtain corresponding to the score of each video frame, and the score of all video frames is constituted abstract sequence Y ={ y₁,y₂,...,y_L}；

4) by the abstract sequence inputting of generation into the retrospective encoder being made of two-way shot and long term memory network, coding Obtain the semantic information sequence U={ u of abstract sequence₁,u₂,...,u_T, then by abstract sequence semantic information sequence with it is corresponding The importance score information that is marked in SumMe or TVSum constitute local discriminant loss on the basis of, be re-introduced into video The global of the semantic information Sequence composition of the semantic information sequence of frame and abstract sequence differentiates loss, generates representative new Abstract sequence；Wherein local discriminant loss and the global loss function for differentiating loss building are as follows:

L=L_O+λL_s

Wherein: local discriminant lossg_iIndicate that i-th of video frame is marked in SumMe or TVSum The importance score of note, y_iIndicate the score of each video frame generated；

The overall situation differentiates lossV indicates the semantic information sequence of video frame, and U indicates the semanteme of abstract sequence Information sequence, λ are tradeoff parameter, and being worth is 0.001；

5) moving average model is introduced, record constitutes the network of encoder, constitutes the network of decoder and constitute retrospective The average value of each parameter in the network of encoder passing parameter value within the set time, changes each parameter smoothing, suppression Parameter mutation processed, repeats step 1)~step 4) until obtaining the score of all video frames, constitutes new abstract sequence, finally lead to New abstract sequence is crossed to select set final abstract.

The semantic information sequence V={ v of video frame described in step 2)₁,v₂,...,v_TIt is that net is remembered by forward direction shot and long term The hidden state of networkWith the hidden state of backward shot and long term memory networkFusion obtains, whereinV has merged the contextual information of video frame, and t takes 1~T.

5. the video summarization method according to claim 3 based on attention expansion coding and decoding network, feature exist Attention mechanism is introduced into decoding process for improving the accurate of each video frame score prediction in decoding process in, step 3) Property, the attention mechanism is the prediction of fusion abstract sequence guidance current video frame importance scores, that is, passes through similarity Function measures the similarity that previous shot and long term memory network hides the semantic information sequence of layer state and current video frame, similar Spending function isAgain by normalizing the power weight that gains attentionWherein attention weightThe input for corresponding to decoder at this time is the new semantic information sequence V of video frame_t, whereinObtain corresponding to the score of each video frame by the new semantic information sequence of video frame.

Video summarization method based on attention expansion coding and decoding network of the invention is generating abstract and artificial mark structure On the basis of local restriction, it is introduced into the semantic information by original video and generates the semantic information of abstract in semantic space The global restriction of building incorporates attention mechanism, and smoothing parameter in the training process in decoding process, enhances model Robustness.Advantage is mainly reflected in:

1, it novelty: proposes a kind of based on attention and retrospective to encode the coding/decoding model that combines.It considers not only Time domain relevance before and after video frame preferably merges the effective information of video interframe, so that mapping of the original video to abstract Process is as complete as possible.And global semantic differentiation loss is introduced, the life of abstract is guided using the semantic information of original video At slowing down the less problem of supervision message in data set.

2, validity: the experiment in experiment and corresponding enhancing data set on SumMe, TVSum.The result shows that Method of the invention achieves current advanced level, the experiment knot in experiment and its enhancing data set especially on TVSum 3.1% and 2.8% is respectively increased than current fresh approach in fruit.

3, it practicability: can be used in multimedia signal processing field, reduction user obtains related resource as far as possible Time can preferably improve the search experience of user.

Detailed description of the invention

Fig. 1 is that the present invention is based on the flow charts of the video summarization method of attention expansion coding and decoding network.

Specific embodiment

Below with reference to embodiment and attached drawing to the video summarization method of the invention based on attention expansion coding and decoding network It is described in detail.

Abstract is generated for enhancing and retains the important ability with correlation information of original video, and the present invention uses for reference retrospective coding Thought is introduced global semantic differentiation loss, the generation of abstract is guided using the semantic information of original video, is constrained on the whole Summarization generation process, and label information is not needed in the process, model is slowed down to the data dependence of tape label.But with Unlike retrospective coding, the present invention is constrained to and sets out to obtain video frame range information and the maximum semantic information of construction Point has merged video frame contextual information, during which and is not introduced into mismatch loss between video, but regards video frequency abstract as single Video sequence regression process inputs the importance score for exporting the corresponding every frame of video for sequence of frames of video, reduces the ginseng of model Number, improves trained efficiency.Based on this present invention in a model decoder section be added multiplying property attention mechanism, not only Input of the last hiding layer state of encoder as decoder, but in the input and decoder for passing through current time decoder One moment output building similarity function, obtains the importance score of current time decoder input, by each of decoder Input is endowed different weights according to its significance level, so that model is obtained the more semantic informations of original video, was decoding Cheng Zhongneng preferably predicts the importance score of each video frame, obtains ideal set of video, generates last Abstract.

As shown in Figure 1, the video summarization method of the invention based on attention expansion coding and decoding network, comprising: by video Abstract regards sequence as to Sequence Learning process, and the relativity of time domain of utilization video interframe will be in SumMe or TVSum Original video obtains video frame characteristic sequence by pre-training network, using video frame characteristic sequence as attention expansion coding and decoding The input of encoder network in network obtains the semantic information sequence of video frame, then by the decoding network of multiplying property attention, obtains To the score of each video frame of correspondence；Then the score of all video frames is constituted into abstract sequence, is obtained by retrospective encoder To the semantic information sequence of abstract sequence, construct it is global semantic differentiate loss, introduce moving average model, study abstract sequence with The semantic dependency of video frame characteristic sequence obtains the new abstract sequence for retaining original video important information, and described is original Video important information is the importance score information marked in SumMe or TVSum.Come finally by new abstract sequence Select set final abstract.

Video summarization method based on attention expansion coding and decoding network of the invention, specifically comprises the following steps:

2) by video frame characteristic sequence be input in attention expansion coding and decoding network by two-way shot and long term memory network (Bi-LSTM) in the encoder constituted, coding obtains the semantic information sequence V={ v of video frame₁,v₂,...,v_T}；

The semantic information sequence V={ v of the video frame₁,v₂,...,v_TIt is by forward direction shot and long term memory network (LSTM) hidden stateWith the hidden state of backward shot and long term memory network (LSTM)Melt Conjunction obtains, whereinV has merged the contextual information of video frame, and t takes 1~T.

The present invention is introduced into attention mechanism for improving each video frame score prediction in decoding process in decoding process Accuracy, the attention mechanism be fusion abstract sequence guidance current video frame importance scores prediction, that is, pass through It is similar to the semantic information sequence of current video frame that similarity function measures the hiding layer state of previous shot and long term memory network Degree, similarity function areAgain by normalizing the power weight that gains attentionWherein pay attention to Power weightThe input for corresponding to decoder at this time is the new semantic information sequence V of video frame_t, InObtain corresponding to the score of each video frame by the new semantic information sequence of video frame.

4) by the abstract sequence inputting of generation into the retrospective encoder being made of two-way shot and long term memory network, coding Obtain the semantic information sequence U={ u of abstract sequence₁,u₂,...,u_T, then by abstract sequence semantic information sequence with it is corresponding The importance score information that is marked in SumMe or TVSum constitute local discriminant loss on the basis of, be re-introduced into video The semantic information sequence of the semantic information sequence of frame and abstract sequence at it is global differentiate loss, generate representative new plucks Want sequence；Wherein local discriminant loss and the global loss function for differentiating loss building are as follows:

L=L_O+λL_s

The overall situation differentiates lossV indicates the semantic information sequence of video frame, and U indicates the semanteme of abstract sequence Information sequence, to weigh parameter, value 0.001；

5) moving average model is introduced, record constitutes the network of encoder, constitutes the network of decoder and constitute retrospective The average value of each parameter in the network of encoder passing parameter value within the set time, changes each parameter smoothing, suppression Parameter mutation processed, increases the robustness of the video summarization method based on attention expansion coding and decoding network, repeats step 1)~step It is rapid to constitute new abstract sequence 4) until obtain the score of all video frames, it is selected finally by new abstract sequence set Fixed final abstract.

The moving average model is to take turns number by updating when using stochastic gradient descent algorithm training neural network The size of attenuation rate is set dynamically, carrys out the update amplitude of control parameter with this so that model training just period parameters update compared with Fastly, close at optimal value parameter update slower, amplitude is smaller, parameter by training, can finally it is stable at one close to true Near weighted value.It in test phase, is predicted using smoothing parameter, improves performance of the final mask in test data. Parameter updates as follows: s_r=d*s_r-1+ (1-d) * v, wherein s_rIt indicates by the updated parameter of r wheel training, s_r-1For it is passing more New parameter, v are the parameter when front-wheel number is updated.Wherein d=min { id, (1+r)/(10+r) }, wherein d indicates attenuation rate, Id is that the initial attenuation rate r of setting is the wheel number that model parameter updates.

Claims

1. a kind of video summarization method based on attention expansion coding and decoding network characterized by comprising see video frequency abstract Work is sequence to Sequence Learning process, and using the relativity of time domain of video interframe, by the original view in SumMe or TVSum Frequency obtains video frame characteristic sequence by pre-training network, using video frame characteristic sequence as in attention expansion coding and decoding network The input of encoder network obtains the semantic information sequence of video frame, then by the decoding network of multiplying property attention, is corresponded to The score of each video frame；Then the score of all video frames is constituted into abstract sequence, is made a summary by retrospective encoder The semantic information sequence of sequence constructs global semantic differentiation loss, introduces moving average model, study abstract sequence and video frame The semantic dependency of characteristic sequence obtains the new abstract sequence for retaining original video important information, finally by new abstract Sequence selects set final abstract.

2. the video summarization method according to claim 1 based on attention expansion coding and decoding network, which is characterized in that institute The original video important information stated is the importance score information marked in SumMe or TVSum.

3. the video summarization method according to claim 1 based on attention expansion coding and decoding network, which is characterized in that tool Body includes the following steps:

1) video frame is obtained with the polydispersity index of 2fps to original video in SumMe or TVSum, the video frame is passed through Pre-training obtains GoogLeNet network on ImageNet data set, extracts video frame characteristic sequence X={ x1, x₂,...,x_T}；

2) video frame characteristic sequence is input to being made of in attention expansion coding and decoding network two-way shot and long term memory network Encoder in, coding obtain the semantic information sequence V={ v of video frame₁,v₂,...,v_T}；

3) the semantic information sequence inputting of video frame is drawn in a decoder into the decoder being made of shot and long term memory network Enter attention mechanism, decoding obtains corresponding to the score of each video frame, and the score of all video frames is constituted abstract sequence Y= {y₁,y₂,...,y_L}；

4) by the abstract sequence inputting of generation into the retrospective encoder being made of two-way shot and long term memory network, coding is obtained The semantic information sequence U={ u for sequence of making a summary₁,u₂,...,u_T, then by abstract sequence semantic information sequence with it is corresponding On the basis of the local discriminant loss that the importance score information marked in SumMe or TVSum is constituted, it is re-introduced into video frame The global of the semantic information Sequence composition of semantic information sequence and abstract sequence differentiates loss, generates representative new abstract Sequence；Wherein local discriminant loss and the global loss function for differentiating loss building are as follows:

L=L_O+λL_s

Wherein: local discriminant lossg_iIndicate what i-th of video frame was marked in SumMe or TVSum Importance score, y_iIndicate the score of each video frame generated；

The overall situation differentiates lossV indicates the semantic information sequence of video frame, and U indicates the semantic information of abstract sequence Sequence, λ are tradeoff parameter, and being worth is 0.001；

5) moving average model is introduced, record constitutes the network of encoder, constitutes the network of decoder and constitutes retrospective coding The average value of each parameter in the network of device passing parameter value within the set time, changes each parameter smoothing, inhibits ginseng Numerical mutation repeats step 1)~step 4) until obtaining the score of all video frames, new abstract sequence is constituted, finally by new Abstract sequence select set final abstract.

4. the video summarization method according to claim 3 based on attention expansion coding and decoding network, which is characterized in that step It is rapid 2) described in video frame semantic information sequence V={ v₁,v₂,...,v_TIt is by the hiding shape of forward direction shot and long term memory network StateWith the hidden state of backward shot and long term memory networkFusion obtains, whereinV has merged the contextual information of video frame, and t takes 1~T.

5. the video summarization method according to claim 3 based on attention expansion coding and decoding network, which is characterized in that step The rapid attention mechanism that 3) is introduced into decoding process is for improving the accuracy of each video frame score prediction in decoding process, institute The attention mechanism stated is the prediction of fusion abstract sequence guidance current video frame importance scores, that is, passes through similarity function degree Measure the similarity that previous shot and long term memory network hides the semantic information sequence of layer state and current video frame, similarity function ForAgain by normalizing the power weight that gains attentionWherein attention weightThe input for corresponding to decoder at this time is the new semantic information sequence V of video frame_t, whereinObtain corresponding to the score of each video frame by the new semantic information sequence of video frame.