CN104113789B

CN104113789B - On-line video abstraction generation method based on depth learning

Info

Publication number: CN104113789B
Application number: CN201410326406.9A
Authority: CN
Inventors: 李平; 俞俊; 李黎; 徐向华
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Huicui Intelligent Technology Co ltd
Priority date: 2014-07-10
Filing date: 2014-07-10
Publication date: 2017-04-12
Anticipated expiration: 2034-07-10
Also published as: CN104113789A

Abstract

The invention relates to an on-line video abstraction generation method based on depth learning. An original video is subjected to the following operation: 1) cutting the video uniformly into a group of small frame blocks, extracting statistical characteristics of each frame image and forming corresponding vectorization expressions; 2) pre-training video frame multilayer depth network and obtaining the nonlinearity expression of each frame; 3) selecting the front m frame blocks being as an initial concise video, and carrying out reconstruction on the concise video through a group sparse coding algorithm to obtain an initial dictionary and reconstruction coefficients; 4) updating depth network parameters according to the next frame block, carrying out reconstruction and reconstruction error calculation on the frame block, and adding the frame block to the concise video and updating the dictionary if the error is larger than a set threshold; and 5) processing new frame blocks till the end in sequence on line according to the step 4), and the updated concise video being generated video abstraction. With the method, latent high-level semantic information of the video can be excavated deeply, the video abstraction can be generated quickly, time of users is saved, and visual experience is improved.

Description

A Deep Learning-Based Online Video Summary Generation Method

技术领域technical field

本发明属于视频摘要生成的技术领域，特别是基于深度学习的视频摘要在线生成方法。The invention belongs to the technical field of video abstract generation, in particular to an online video abstract generation method based on deep learning.

背景技术Background technique

近年来，随着数字摄像机、智能手机、掌上电脑等便携式设备的日益普及，各类视频的数量呈井喷式增长。例如，在智能交通、安防监控、公安布防等社会重要领域的视频采集设备在一个中型城市中高达几万路，这些设备产生的视频数据达PB级。为了锁定目标人物或车辆，公安交警等人员需要耗费大量的时间调看冗长乏味监控的视频流，这极大地影响了办事效率，不利于平安城市的创建。因此，从冗长的视频流中有效地选取包含关键信息的视频帧，即视频摘要技术，受到了学术界和工业界的广泛关注。In recent years, with the increasing popularity of portable devices such as digital cameras, smart phones, and handheld computers, the number of various types of videos has shown a blowout growth. For example, there are tens of thousands of video acquisition devices in important social fields such as intelligent transportation, security monitoring, and public security deployment in a medium-sized city, and the video data generated by these devices reaches PB level. In order to lock the target person or vehicle, the public security traffic police and other personnel need to spend a lot of time watching the tedious surveillance video stream, which greatly affects the efficiency of work and is not conducive to the creation of a safe city. Therefore, efficient selection of video frames containing key information from lengthy video streams, namely video summarization techniques, has drawn extensive attention from academia and industry.

传统的视频摘要技术主要针对编辑过的结构化视频，如一部电影可分为多个场景，每个场景由同一地点发生的多个情节组成，每个情节又由一系列光滑连续的视频帧构成。不同于传统的电影、电视剧、新闻报道等结构化视频，监控视频一般是未经剪辑的非结构化视频，这为视频摘要技术的应用带来较大挑战。Traditional video summarization techniques are mainly aimed at edited structured videos, such as a movie can be divided into multiple scenes, each scene consists of multiple plots that take place at the same location, and each plot consists of a series of smooth and continuous video frames . Unlike traditional structured videos such as movies, TV dramas, and news reports, surveillance videos are generally unedited and unstructured videos, which brings great challenges to the application of video summarization technology.

目前，主要的视频摘要领域有基于关键帧方法、创建新图像、视频帧块、转自然语言处理等技术。基于关键帧的方法包括情节边缘检测、视频帧聚类、颜色直方图、动作稳定性等策略；创建新图像利用包含重要内容的一些连续帧生成，该方法容易受到不同帧之间的模糊因素影响；视频帧块方法利用结构化视频中的场景边缘检测、对话分析等技术对原始进行裁剪，形成短小的主题电影；转自然语言处理是指利用视频中的字幕和语音信息将视频摘要转化为文本摘要的技术，该技术不适合处理无字幕或声音的监控视频。At present, the main fields of video summarization are based on key frame methods, creating new images, video frame blocks, and converting to natural language processing. Keyframe-based methods include strategies such as plot edge detection, video frame clustering, color histogram, motion stability, etc.; creating a new image is generated using a few consecutive frames containing important content, which is susceptible to blurring factors between different frames ; The video frame block method uses scene edge detection, dialogue analysis and other technologies in the structured video to cut the original to form a short theme movie; converting to natural language processing refers to using the subtitles and voice information in the video to convert the video summary into text Abstract technology, which is not suitable for processing surveillance video without subtitles or sound.

针对智能交通、安防布控等重要领域源源不断地产生大量非结构化视频，传统的视频摘要方法不能满足在线处理流式视频的应用要求。为此，迫切需要既能在线处理视频流，又能高效准确选取包含关键内容的视频摘要方法。A large number of unstructured videos are continuously generated in important fields such as intelligent transportation and security control. Traditional video summarization methods cannot meet the application requirements of online processing streaming video. Therefore, there is an urgent need for a method that can not only process video streams online, but also efficiently and accurately select video summarization that contains key content.

发明内容Contents of the invention

为了高效准确地在线浓缩和精简冗长乏味的视频流，以节省用户时间并增强视频内容的视觉效果，本发明提出了一种基于深度学习的视频摘要在线生成方法，该方法包括以下步骤：In order to efficiently and accurately condense and streamline lengthy and tedious video streams online, so as to save user time and enhance the visual effect of video content, the present invention proposes a method for online generation of video summaries based on deep learning, which includes the following steps:

1、获取原始视频数据后，进行以下操作：1. After obtaining the original video data, perform the following operations:

1)将视频均匀切分为一组小帧块，每个帧块包含多帧，提取各帧图像的统计特征，形成相应的向量化表示；1) The video is evenly divided into a group of small frame blocks, each frame block contains multiple frames, and the statistical features of each frame image are extracted to form a corresponding vectorized representation;

2)预训练视频帧多层深度网络，获得各帧的非线性表示；2) Pre-train the multi-layer deep network of video frames to obtain the nonlinear representation of each frame;

3)选取前m帧块为初始精简视频，并通过组稀疏编码算法对其进行重构，获得初始词典和重构系数；3) Select the first m frame blocks as the initial simplified video, and reconstruct it through the group sparse coding algorithm to obtain the initial dictionary and reconstruction coefficients;

4)根据下一帧块更新深度网络参数，同时对该帧块进行重构并计算重构误差，若误差大于设定阈值，则将该帧块加入精简视频中并更新词典；4) Update the deep network parameters according to the next frame block, and at the same time reconstruct the frame block and calculate the reconstruction error. If the error is greater than the set threshold, add the frame block to the simplified video and update the dictionary;

5)按照步骤4)依次在线处理新的帧块直到结束，更新的精简视频即为生成的视频摘要。5) According to step 4), the new frame blocks are processed online sequentially until the end, and the updated simplified video is the generated video summary.

进一步，所述的步骤1)中所述的提取各帧图像的统计特征形成相应向量化表示，具体是：Further, the statistical features of each frame image extracted in the described step 1) form a corresponding vectorized representation, specifically:

1)设原始视频均匀分为n个帧块，即每个帧块包含t帧图像(如t＝80)，将各帧图像缩放成统一像素大小并保持原始的纵横比例；1) Suppose the original video is evenly divided into n frame blocks, namely Each frame block contains t frames of images (such as t=80), and each frame of images is scaled to a uniform pixel size and maintains the original aspect ratio;

2)提取各帧图像的颜色直方图、颜色矩、边缘方向直方图、Gabor小波变换、局部二值模式等全局特征和尺度不变特征变换(SIFT：Scale-Invariant Feature Transform)、加速鲁棒特征(SURF：Speeded Up Robust Feature)等局部特征；2) Extract global features such as color histogram, color moment, edge direction histogram, Gabor wavelet transform, local binary mode and scale invariant feature transform (SIFT: Scale-Invariant Feature Transform) and accelerated robust features of each frame image (SURF: Speeded Up Robust Feature) and other local features;

3)顺序联接各帧的上述图像特征，形成维度为n_f的向量化表示。3) The above-mentioned image features of each frame are sequentially connected to form a vectorized representation with dimension _nf .

进一步，所述的步骤2)中的预训练视频帧多层深度网络获得各帧的非线性表示，具体是：Further, the pre-training video frame multilayer depth network in the described step 2) obtains the non-linear representation of each frame, specifically:

利用堆叠去噪自编码器(SDA：Stacked Denoising Autoencoder)预训练多层深度网络(层数小于10)；Use stacked denoising autoencoder (SDA: Stacked Denoising Autoencoder) to pre-train multi-layer deep network (the number of layers is less than 10);

a、在每一层对各帧图像进行如下操作：首先，通过添加较小的高斯噪声、随机设输入变量为任意值等途径生成各帧噪声图像；然后，噪声图像通过自编码器(AE：AutoEncoder)进行映射得到其非线性表示；a. Perform the following operations on each frame of image at each layer: First, generate each frame of noise image by adding small Gaussian noise, randomly setting the input variable to any value, etc.; then, the noise image passes through the autoencoder (AE: AutoEncoder) is mapped to obtain its nonlinear representation;

b、利用随机梯度下降算法对深度网络的各层参数进行调整更新；b. Use the stochastic gradient descent algorithm to adjust and update the parameters of each layer of the deep network;

所述的步骤3)中的通过组稀疏编码算法对初始精简视频进行重构，具体是：In the described step 3), the initial simplified video is reconstructed by the group sparse coding algorithm, specifically:

1)初始精简视频由原始视频的前m个帧块组成(m为小于50的正整数)，即共有n_init＝m×t帧图像，X_k对应第k个原始帧块；通过预训练深度网络得到相应的非线性表示为Y_k对应第k个帧块的非线性表示；1) The initial streamlined video consists of the first m frame blocks of the original video (m is a positive integer less than 50), namely There are a total of n _init =m×t frame images, X _k corresponds to the kth original frame block; the corresponding nonlinear representation is obtained by pre-training the deep network as Y _k corresponds to the nonlinear representation of the kth frame block;

2)设初始词典D由n_d个原子组成，即d_j对应第j个原子；设重构系数为C，其元素个数对应帧数目，其维度对应词典的原子数目，即C_k对应第k个帧块系数，对应第i帧图像；2) Let the initial dictionary D consist of n _d atoms, namely d _j corresponds to the jth atom; let the reconstruction coefficient be C, the number of its elements corresponds to the number of frames, and its dimension corresponds to the number of atoms in the dictionary, namely C _k corresponds to the kth frame block coefficient, Corresponding to the i-th frame image;

3)利用乘子交替方向方法优化正则化词典的组稀疏编码目标函数，可以分别得到初始词典D和重构系数C，即求解3) Using the multiplier alternating direction method to optimize the group sparse coding objective function of the regularized dictionary, the initial dictionary D and the reconstruction coefficient C can be obtained respectively, that is, to solve

其中，符号||·||₂表示变量的l₂范式，正则化参数λ为大于0的实数，多元函数F(Y_k,C_k,D)的具体表达为：Among them, the symbol ||·|| ₂ represents the l ₂ normal form of the variable, the regularization parameter λ is a real number greater than 0, and the specific expression of the multivariate function F(Y _k ,C _k ,D) is:

其中，参数γ为大于0的实数，符号中的数学式子表示使用词典D对第i帧图像进行重构。这里的乘子交替方向方法具体为：先固定参数D，使上述目标函数变成针对参数C的凸函数；然后固定参数C，使上述目标函数变成针对参数D的凸函数，迭代交替更新两个参数。Among them, the parameter γ is a real number greater than 0, and the symbol The mathematical formula in represents that the i-th frame image is reconstructed using the dictionary D. The method of alternating direction of the multiplier here is as follows: first, fix the parameter D, so that the above objective function becomes a convex function for the parameter C; then fix the parameter C, make the above objective function become a convex function for the parameter D, iteratively update the two parameters.

所述的步骤4)中的根据下一帧块更新深度网络参数并对该帧块进行重构和计算重构误差，具体是：In the described step 4), update the deep network parameters according to the next frame block and reconstruct the frame block and calculate the reconstruction error, specifically:

1)对该帧块的各帧图像依次做如下操作：1) Do the following operations in turn for each frame image of the frame block:

a.利用在线梯度下降算法更新深度神经网络中最后一层的参数，即权重W和偏移量b；a. Use the online gradient descent algorithm to update the parameters of the last layer in the deep neural network, that is, the weight W and the offset b;

b.利用后向传播算法更新深度神经网络中其他层的参数；b. Utilize the backpropagation algorithm to update the parameters of other layers in the deep neural network;

2)根据新的参数更新各帧图像的非线性表示；2) Update the nonlinear representation of each frame image according to the new parameters;

3)基于现有词典D，利用组稀疏编码对当前帧块进行重构并计算误差ε，即对当前帧块X_k的非线性表示Y_k进行重构，具体步骤为：先最小化多元函数F(Y_k,C_k,D)得到最优重构系数然后带入的第一项中并计算其值即为当前重构误差ε。3) Based on the existing dictionary D, use group sparse coding to reconstruct the current frame block and calculate the error ε, that is, reconstruct the nonlinear representation Y _k of the current frame block X _k . The specific steps are: first minimize the multivariate function F(Y _k ,C _k ,D) to get the optimal reconstruction coefficient then bring in the first item of and calculate its value as the current reconstruction error ε.

所述的步骤4)中的若误差大于设定阈值则将当前帧块加入精简视频中并更新词典，具体是：If error in described step 4) is greater than setting threshold value then current frame block is added in the streamlined video and update dictionary, specifically:

1)若对当前帧块X_k的非线性表示Y_k计算得到的重构误差ε大于设定阈值θ(取经验值)，则将当前帧块加入精简视频中，即 1) If the reconstruction error ε calculated from the nonlinear representation Y _k of the current frame block X _k is greater than the set threshold θ (experimental value), then the current frame block is added to the simplified video, that is

2)若当前精简视频中含有q个帧块，则更新词典的帧图像非线性表示集合为那么使用更新词典D即求解目标函数2) If the current streamlined video Contains q frame blocks in , then the frame image nonlinear representation set of the updated dictionary is then use Updating the dictionary D means solving the objective function

其中，参数λ为大于0的实数，用于调节正则化项的影响。Among them, the parameter λ is a real number greater than 0, which is used to adjust the influence of the regularization term.

本发明提出了基于深度学习的视频摘要在线生成方法，其优点在于：利用深度学习挖掘视频中的高层语义特征，使得组稀疏编码能更好反映词典重构当前视频帧块的程度，从而最具信息量的视频帧块构成包含兴趣区域和关键人物事件的视频摘要；精简的视频摘要为用户节省了大量的时间，同时增强了关键内容的视觉体验。The present invention proposes an online generation method of video summaries based on deep learning, which has the advantage of using deep learning to mine high-level semantic features in videos, so that group sparse coding can better reflect the extent to which the dictionary reconstructs the current video frame blocks, thereby making the most Information-rich video frame blocks form a video summary that includes regions of interest and key person events; the streamlined video summary saves a lot of time for users, and at the same time enhances the visual experience of key content.

附图说明Description of drawings

图1是本发明的方法流程图。Fig. 1 is a flow chart of the method of the present invention.

具体实施方式detailed description

参照附图1，进一步说明本发明：With reference to accompanying drawing 1, further illustrate the present invention:

步骤1)中所述的提取各帧图像的统计特征形成相应向量化表示，具体是：Step 1) described in extracting the statistical feature of each frame image forms corresponding vectorized representation, specifically:

3)顺序联接各帧的上述图像特征，形成维度为nf的向量化表示。3) The above-mentioned image features of each frame are sequentially connected to form a vectorized representation with dimension nf.

步骤2)中的预训练视频帧多层深度网络获得各帧的非线性表示，具体是：The pre-training video frame multilayer deep network in step 2) obtains the non-linear representation of each frame, specifically:

步骤3)中的通过组稀疏编码算法对初始精简视频进行重构，具体是：In step 3), the initial simplified video is reconstructed through the group sparse coding algorithm, specifically:

1)初始精简视频由原始视频的前m个帧块组成(m为小于50的正整数)，即共有n_init＝m×t帧图像，X_k对应第k个原始帧块；通过预训练深度网络得到相应的非线性表示为Y_k对应第k个帧块的非线性表示；1) The initial streamlined video consists of the first m frame blocks of the original video (m is a positive integer less than 50), namely A total of n _init =m×t frame images, X _k corresponds to the kth original frame block; the corresponding nonlinear representation is obtained by pre-training the deep network as Y _k corresponds to the nonlinear representation of the kth frame block;

步骤4)中的根据下一帧块更新深度网络参数并对该帧块进行重构和计算重构误差，具体是：In step 4), the depth network parameters are updated according to the next frame block and the frame block is reconstructed and the reconstruction error is calculated, specifically:

步骤4)中的若误差大于设定阈值则将当前帧块加入精简视频中并更新词典，具体是：If the error in step 4) is greater than the set threshold, the current frame block is added to the simplified video and the dictionary is updated, specifically:

Claims

1. a kind of online generation method of video frequency abstract based on deep learning, the method is characterized in that and obtain after original video, Proceed as follows：

1) it is one group little frame block by the uniform cutting of video, each frame block includes t two field pictures, extracts the statistical nature of each two field picture, Formation dimension is n_fVectorization represent；

2) many layer depth networks of pre-training frame of video, obtain the non-linear expression of each frame；

3) m frames block is initially to simplify video before choosing, and it is reconstructed by a group sparse coding algorithm, obtains initial dictionary And reconstruction coefficients；

4) depth network parameter is updated according to next frame block, while reconstructed error is reconstructed and calculates to the frame block, if error More than given threshold, then the frame block is added and simplify in video and update dictionary；

5) according to step 4) successively the new frame block of online treatment until terminate, renewal simplify video be generate video pluck Will.

2. the online generation method of video frequency abstract of deep learning is based on as claimed in claim 1, it is characterised in that：Step 1) in The statistical nature of the described each two field picture of extraction forms corresponding vectorization and represents, comprises the concrete steps that：

1.1) set original video and be uniformly divided into n frame block, i.e.,Each frame block includes t two field pictures, will be each Two field picture is scaled to unified pixel size and keeps original vertical-horizontal proportion；

1.2) global characteristics and local feature of each two field picture are extracted；

The global characteristics include color histogram, color moment, edge orientation histogram, Gabor wavelet conversion, local binary mould Formula；

The local feature includes：Scale invariant features transform SIFT, acceleration robust features SURF；

1.3) sequentially couple the above-mentioned characteristics of image of each frame, form dimension for n_fVectorization represent.

3. the online generation method of video frequency abstract of deep learning is based on as claimed in claim 1, it is characterised in that：Step 2) in The many layer depth networks of described pre-training frame of video obtain the non-linear expression of each frame, specifically using stacking denoising self-encoding encoder The many layer depth networks of SDA pre-training, including：

A, each two field picture is proceeded as follows in each layer：First, by adding Gaussian noise or setting input variable at random Each frame noise image is generated for arbitrary value；Then, noise image carries out mapping and obtains its non-linear expression by self-encoding encoder AE；

B, renewal is adjusted to each layer parameter of depth network using stochastic gradient descent algorithm.

4. the online generation method of video frequency abstract of deep learning is based on as claimed in claim 1, it is characterised in that：Step 3) in Described is reconstructed by group sparse coding algorithm to initially simplifying video, is comprised the concrete steps that：

3.1) initially simplify video to be made up of the front m frame block of original video, i.e.,It is total n_init=m × t two field pictures, X_kK-th primitive frame block of correspondence；Corresponding non-linear table is obtained by pre-training depth network to be shown asY_kThe non-linear expression of k-th frame block of correspondence；

3.2) initial dictionary D is set by n_dIndividual atom composition, i.e.,d_jJ-th atom of correspondence；If reconstruction coefficients are C, its Element number correspondence frame number, the atom number of its dimension correspondence dictionary, i.e.,C_kFor the reconstruct system of k-th frame block Number,The i-th two field picture of correspondence；

3.3) using the group sparse coding object function of multiplier alternating direction implicit Optimal Regularization dictionary, can respectively obtain just Beginning dictionary D and reconstruction coefficients C, that is, solve：

Wherein, symbol | | | |₂Represent the l of variable₂Normal form, regularization parameter λ is the real number more than 0, function of many variables F (Y_k,C_k, Being embodied as D)：

F (Y_{k}, C_{k}, D) = \frac{1}{2 n_{f}} Σ_{y_{i} &Element; Y_{k}, d_{j} &Element; D} | | y_{i} - Σ_{j = 1}^{n_{d}} c_{j}^{i} d_{j} | |_{2}^{2} + γ Σ_{j = 1}^{n_{d}} | | c_{j} | |_{2};

Wherein, parameter γ is the real number more than 0, symbolIn mathematical expression subrepresentation the i-th two field picture is carried out using dictionary D Reconstruct；Here multiplier alternating direction implicit is specially：First preset parameter D, makes above-mentioned object function become for the convex of parameter C Function；Then preset parameter C, makes above-mentioned object function become the convex function for parameter D, and iteration alternately updates two parameters；

Step 4) described in depth network parameter is updated according to next frame block and reconstruct mistake is reconstructed to the frame block and calculates Difference, comprises the concrete steps that：

4.1) each two field picture of the frame block is done as follows successively：

4.1.1 the parameter of last layer in deep neural network, i.e. weight W and skew) are updated using online gradient descent algorithm Amount b；

4.1.2 the parameter of other layers in deep neural network) is updated using Back Propagation Algorithm；

4.2) the non-linear expression of each two field picture is updated according to new parameter；

4.3) based on existing dictionary D, present frame block is reconstructed using group sparse coding and calculation error ∈, i.e., to present frame Block X_kNon-linear expression Y_kIt is reconstructed, specially：First minimize function of many variables F (Y_k,C_k, D) and obtain optimum reconstruction coefficientsThen bring intoSection 1In and calculate its value and be current reconstructed error ∈；

Step 4) described in if error be more than given threshold if by present frame block add simplify in video and update dictionary, specifically It is：

(1) if to present frame block X_kNon-linear expression Y_kCalculated reconstructed error ∈ is more than given threshold θ, then will be current Frame block is added and simplified in video, i.e.,

(2) if currently simplifying videoIn contain q frame block, then update dictionary two field picture it is non-linear expression set beSo UseUpdate dictionary D and solve object function：

Wherein, parameter lambda is the real number more than 0, for adjusting the impact of regularization term.