CN106778571A

CN106778571A - A kind of digital video feature extracting method based on deep neural network

Info

Publication number: CN106778571A
Application number: CN201611104658.2A
Authority: CN
Inventors: 李岳楠; 陈学票
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2016-12-05
Filing date: 2016-12-05
Publication date: 2017-05-31
Anticipated expiration: 2036-12-05
Also published as: CN106778571B

Abstract

The invention discloses a kind of digital video feature extracting method based on deep neural network, the described method comprises the following steps：One denoising coding network of training realizes the Dimensionality Reduction to the initial descriptors of video, and condition generation model and encoder cascade are constituted into one group of basic characteristic extracting module；Multigroup characteristic extracting module is continuously trained, bottom-up stacking is done to gained module by training sequencing constitutes deep neural network；Training post processing network, is placed on the top of deep neural network, is used to optimize the robustness and distinction of video presentation symbol.Video feature extraction is that brief video presentation is accorded with by depth nerve net by this method, and video presentation symbol can realize the summaryization description to video-aware content, while having good robustness and distinction, be capable of achieving efficient, accurate video content recognition.

Description

A kind of digital video feature extracting method based on deep neural network

Technical field

Regarded the present invention relates to Signal and Information Processing technical field, more particularly to a kind of numeral based on deep neural network Frequency feature extracting method.

Background technology

Video data has that data volume is big relative to image data, data have sequential contact characteristic and data redundancy compared with Big the characteristics of.The management of video copyright protecting, video frequency searching and video dataization usually needs a kind of unique and extremely compact Descriptor as video content tab.The most straightforward procedure of generation video presentation symbol be it is independent from it is each represent frame in extract description Symbol, is cascaded the whole section of descriptor of video of composition.

Common methods have statistical method [1], brightness step method [2] and color correlation method [3].But this kind of method cannot Portray the temporal characteristicses of visual information.In order to realize the extraction to video space-time characteristic, document [4] adjacent block is along time and space Luminance difference on direction is accorded with as video presentation, and document [5] is accorded with using the track of characteristic point as video presentation.Additionally, three-dimensional Signal conversion [6], tensor resolution [7] and optical flow method [8] also all be used to be configured to the descriptor of reflecting video time-space attribute.

Inventor realize it is of the invention during, discovery at least has the following disadvantages and not enough in the prior art：

Existing feature extracting method has the shortcomings that redundancy is bigger than normal and sequential distortion is sensitive.And mostly rely on people Work is designed, but the feature extracting method of engineer is difficult to catch essential attribute of the video information on space-time direction.

The content of the invention

The invention provides a kind of digital video feature extracting method based on deep neural network, this method passes through depth Video feature extraction is brief video presentation symbol by nerve net, and video presentation symbol can be realized plucking video-aware content Change description, while having good robustness and distinction, be capable of achieving efficient, accurate video content recognition, it is as detailed below Description：

A kind of digital video feature extracting method based on deep neural network, the described method comprises the following steps：

One denoising coding network of training realizes the Dimensionality Reduction to the initial descriptors of video, by condition generation model and Encoder cascade constitutes one group of basic characteristic extracting module；

Multigroup characteristic extracting module is continuously trained, bottom-up stacking is done to gained module by training sequencing is constituted Deep neural network；

Training post processing network, is placed on the top of deep neural network, is used to optimize the robustness of video presentation symbol And distinction.

Wherein, methods described also includes：

Input video is pre-processed, the space-time connection of video content is expressed by condition generation model.

Wherein, it is described that input video is pre-processed, the space-time connection of video content is expressed by condition generation model The step of be specially：

LPF is done to video smooth and down-sampled, each frame picture size is compressed to and meets neural network input layer Size is needed, and regularization is done to the video after down-sampled, and the pixel average for making each frame is zero, and variance is 1；

By video data input condition Boltzmann machine (Conditional Restricted Boltzmann Machine, CRBM), each frame pixel of preprocessed video is set to the neuron of visible layer, CRBM networks are trained.

Wherein, described one denoising coding network of training realizes the Dimensionality Reduction to the initial descriptors of video, by condition The step of generation model and encoder cascade constitute one group of basic characteristic extracting module is specially：

Apply distortion to each training video and do pretreatment operation, distortion video as the input of CRBM is generated just Beginning descriptor, chooses the initial descriptors of multigroup original video and distortion video as training data, and one denoising of training is self-editing Code network；

Encoder E () obtained by training is stacked on CRBM, first group of characteristic extracting module is obtained.

Wherein, it is described continuously to train multigroup characteristic extracting module, bottom-up is done to gained module by training sequencing Stacking constitute deep neural network the step of be specially：

By the use of the output of features described above extraction module as training data, continue to train a pair of CRBM and encoder, use institute Obtain CRBM and encoder re-establishes second group of characteristic extracting module；

Train multiple CRBM and coder module successively, the training data of each module by previous module output group Into；

Modules are carried out bottom-up stacking by the sequencing according to training, form deep neural network.

Wherein, the training post processing network, is placed on the top of deep neural network, is used to optimize video presentation symbol Robustness and distinction the step of be specially：

It is that training video generates descriptor using the deep neural network being made up of K CRBM-E () module, passes through The cost function of postpositive disposal network is trained to be trained；

The post processing network is placed in the deep neural network top layer being made up of CRBM and encoder after completing training.

The beneficial effect of technical scheme that the present invention is provided is：

1st, the present invention extracts video features and is accorded with so as to generate video presentation by deep neural network, CRBM (Conditional Restricted Boltzmann Machine) network can portray the space-time essential attribute of video information；

2nd, autoencoder network can realize Data Reduction and the robustness lifting to descriptor, and post processing network can be overall Optimize the robustness and distinction of descriptor；

3rd, the present invention learns to obtain optimal feature extraction side without engineer's feature extracting method by training pattern Case；

4th, present procedure is simple, it is easy to accomplish, computation complexity is low.It is 3.2GHz in CPU frequency, inside saves as 32GB's Test result on computer shows that the time needed for the method for the invention calculates 500 frame video sequences is averagely only 1.52 Second.

Brief description of the drawings

Fig. 1 is a kind of flow chart of the digital video feature extracting method based on deep neural network；

Fig. 2 is the schematic diagram of the limited Boltzmann machine structure of condition；

Fig. 3 is the schematic diagram of the deep neural network structure for video feature extraction.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, further is made to embodiment of the present invention below Ground is described in detail.

Embodiment 1

In order to realize brief and robust the description to video content, the embodiment of the present invention proposes a kind of based on depth god Through the digital video feature extracting method of network, referring to Fig. 1, the method is comprised the following steps：

101：One denoising coding network of training realizes the Dimensionality Reduction to the initial descriptors of video, and condition is generated into mould Type and encoder cascade constitute one group of basic characteristic extracting module；

102：Multigroup characteristic extracting module is continuously trained, bottom-up stacking is done to gained module by training sequencing Constitute deep neural network；

103：Training post processing network, is placed on the top of deep neural network, is used to optimize the Shandong of video presentation symbol Rod and distinction.

Wherein, before step 101, the method also includes：

Wherein, it is above-mentioned to being pre-processed to input video, the space-time connection of video content is expressed by condition generation model The step of being is specially：

Video data is input into CRBM, each frame pixel of preprocessed video the neuron of visible layer is set to, to CRBM Network is trained.

Wherein, one denoising coding network of the training in step 101 realizes the Dimensionality Reduction to the initial descriptors of video, Condition generation model and encoder cascade are constituted into one group of basic characteristic extracting module to be specially：

Wherein, the multigroup characteristic extracting module of continuously training in step 102, is done certainly by training sequencing to gained module The upward stacking in bottom constitutes deep neural network and is specially：

Wherein, the training post processing network in step 103, is placed on the top of deep neural network, is used to optimize and regards The robustness and distinction of frequency descriptor are specially：

In sum, by video feature extraction it is that brief video presentation is accorded with by depth nerve net, video presentation symbol The summaryization description to video-aware content can be realized, while having good robustness and distinction, is capable of achieving efficient, accurate True video content recognition.

Embodiment 2

The scheme in embodiment 1 is described in detail with reference to specific accompanying drawing 2 and 3 and computing formula, is referred to It is described below：

201：Input video is pre-processed, the space-time connection between video content is expressed by condition generation model, and Generate the initial descriptors of video；

Wherein, the step 201 is specially：

1) in link is pre-processed, first by its video smoothing processing that input low pass filter carries out spatially per frame, In time to smoothing after video carry out it is down-sampled, will be finally normalized to per frame pixel average for 0, variance be 1.The present invention Embodiment is not particularly limited to low pass filter parameter.

2) with the limited Boltzmann machines of condition (Conditional Restricted Boltzmann Machine, CRBM) [9] generate the initial descriptors of video.CRBM can be modeled to each inter-frame statistics correlation of video, structure such as Fig. 2 institutes Show.The visible layer (i.e. video t frames) at current time is made to be expressed as v_t, t-m frames are v_t-m(m≥1).Current time hidden layer It is h_t, it is seen that layer and hidden layer weight parameter are W, it is seen that layer is biased to a, and hidden layer is biased to b, it is seen that layer previous instant Weight parameter to current time is A_k, it is seen that layer previous instant is B to the weight parameter at hidden layer current time_k。

Concrete operations are as follows：

1st, it is V by size₁×S₁×F₁Video (frame number is F₁, each frame picture size is V₁×S₁) do low pass filtered popin It is sliding and down-sampled, each frame picture size is compressed to V₂×S₂, to meet neural network input layer size needs, to frame number F₁ It is compressed to F₂(F₂=F₁/ N, will substitute the N frames per the average value of N frames).It is V to down-sampled rear size₂×S₂×F₂Regard Frequency does regularization, and the pixel average for making each frame is zero, and variance is 1.V is chosen in the example₂=32, S₂=32, F₂=4.

2nd, video data is input into CRBM, the corresponding visible layer of t frames for making CRBM is v_t∈R¹⁰²⁴, will in the present embodiment Each frame pixel of preprocessed video is set to the neuron of visible layer.So the neuron number of visible layer is 1024.

Hidden layer t frames are h_t, it is 300 that this example sets hidden layer neuron number.Visible layer in CRBM networks with Hidden layer weight parameter W ∈ R^1024×300, it is seen that the biasing a ∈ R of layer¹⁰²⁴, the biasing b ∈ R of hidden layer³⁰⁰, not in the same time between can See the weight parameter A of layer_k∈R^300×300, not in the same time between hidden layer weight parameter B_k∈R^300×1024.Can be by minimizing Following cost function realizes the training to CRBM networks：

Wherein, L_CRBMIt is the cost function of CRBM；p(v_t|v_t-1,...,v_t-m) be in t-1 ..., the frame at t-m moment v_t-1,...,v_t-mUnder conditions of, present frame v_tProbable value；E(v_t,h_t) it is energy function.

Wherein, k=1 ..., m are sequence number；M is the exponent number of CRBM；v_t-kIt is the vector being made up of the pixel value of t-k frames；T It is transposition symbol.The value of method and m of the embodiment of the present invention to minimizing formula (1) is not limited.

This example chooses the exponent number m=3 of CRBM, and training video number is 500, is calculated using reverse conduction stochastic gradient descent Method minimizes cost function (1).

202：One denoising coding network of training realizes the Dimensionality Reduction to the initial descriptors of video, and condition is generated into mould Type and encoder cascade constitute one group of basic characteristic extracting module；

Wherein, the step 202 is specially：

1) apply distortion (compression, plus make an uproar, and rotation etc.) to each training video and do pretreatment operation, distortion is regarded Frequency generates initial descriptors as the input of CRBM.The initial descriptors of multigroup original video and distortion video are chosen as instruction Practice data, train a denoising autoencoder network (Denoising Autoencoder, DAE) [10]；With it to by foregoing CRBM The video presentation symbol of generation carries out Dimensionality Reduction.Before training, generate original video with CRBM respectively and distortion video is (such as original Video through overcompression, plus the treatment such as make an uproar after version) descriptor, by n-th pair original and distortion video as a example by, make a_nRepresent former The descriptor of beginning video,Represent the descriptor of distortion video.Train DAE target be fromMiddle recovery a_n。

By taking n-th pair of training data as an example, a is made_n∈R^300×4The descriptor of original video is represented,Represent distortion The descriptor of video.The cost function of denoising autoencoder network is：

Wherein, L_DAEIt is the cost function of denoising autoencoder network；λ_DAEIt is weight decay term coefficient；W_i,j ^(l)It is network weight Weight, represents connection l layers of i-th neuron and the l+1 layers of weight of j neuron, and E () is encoder, and D () is decoding Device.

Cost function (2) is minimized using the stochastic gradient descent based on reverse conduction, optimal weights W is tried to achieve_i,j ^(l), it is complete Into training.Method and λ of the embodiment of the present invention to minimum formula (2)_DAEValue be not limited.

The input layer and hidden layer of denoising autoencoder network are made up of 300 and 100 neurons respectively in this example, λ_DAE =10^-5。

2) encoder E () obtained by training is stacked on CRBM, obtains first group of characteristic extracting module, be expressed as {CRBM-E(·)}₁.This feature extraction module is made up of three-layer neural network, and structure is 1024-300-100.

203：Multigroup characteristic extracting module is continuously trained, bottom-up is done to training gained module by training sequencing Stacking constitutes deep neural network；

Wherein, the step 203 is specially：

Using features described above extraction module { CRBM-E () }₁Output as training data, continue according to above-mentioned steps A pair of CRBM and encoder are trained, second group of characteristic extracting module is re-established with gained CRBM and encoder, be expressed as {CRBM-E(·)}₂.Repeat said process, train multiple CRBM and coder module successively, the training data of each module by The output composition of previous module.Modules are carried out bottom-up stacking by the sequencing according to training, form depth Neutral net.It is represented by by the deep neural network of K module composition：{CRBM-E(·)}₁-{CRBM-E(·)}₂-…- {CRBM-E(·)}_K, as shown in Figure 3.The embodiment of the present invention is not particularly limited to the value of number of modules K.

The present embodiment uses K=2, i.e., illustrated using two groups of characteristic extracting modules.Using features described above extraction module {CRBM-E(·)}₁Output as training data, continue to train a pair of CRBM and denoising encoder according to above-mentioned steps, use institute Obtain CRBM and encoder re-establishes second group of characteristic extracting module { CRBM-E () }₂。

In this example, the input layer and hidden layer neuron number of second group of CRBM are respectively 100 and 80, denoising own coding The input layer and hidden layer neuron number of device are respectively 80 and 50, therefore second group of structure of module for 100-80-50.By two Group module carries out bottom-up stacking, obtains the neutral net that structure is 1024-300-100-80-50.

204：Training post processing network, is placed on deep neural network top, is used to optimize the robust of video presentation symbol Property and distinction.

Wherein, the step 204 is specially：

1) using the deep neural network being made up of K CRBM-E () module for training video generates descriptor.With As a example by n-th pair of training data, (V_n,1,V_n,2,y_n), wherein V_n,1And V_n,2It is two descriptors of training video, y_nIt is label (y_n =+1 two training videos of expression have identical vision content, y_n=-1 two videos of expression have different vision contents).

It is the defined mapping of postpositive disposal network to make φ (), and L represents the number of plies (L of post processing network>1), then train The cost function of postpositive disposal network is as follows：

Wherein,It is network weight, constant λ_PostIt is weight decay term coefficient；V_n,1For first in n-th pair of training data The descriptor of individual video；V_n,2It is second descriptor of video.Cost function (3) is minimized, this is post-processed after completing training Unit is placed in the deep neural network top layer being made up of CRBM and encoder, as shown in Figure 3.The embodiment of the present invention is to minimum side Method and L, λ_PostValue after be not limited.

The deep neural network constituted using above-mentioned 2 CRBM-E () module is training video generation descriptor, by This composing training post-processes the sample of network.

Altogether identical to vision content by the n=4000 and different video of training set that this example is chosen to constituting, wherein, Video with identical vision content to by compression, plus make an uproar and the common distortion generation such as filter.

This example chooses the post processing network number of plies L=2, λ_Post=10^-5, two-layer neuron number is respectively 40 and 30.It is logical Reverse conduction algorithmic minimizing cost function (3) is crossed, after completing training, the foregoing depth being made up of CRBM and encoder is placed on Network top, obtains the feature extraction network that structure is 1024-300-100-80-50-40-30.

Embodiment 3

Feasibility checking is carried out to the scheme in embodiment 1 and 2 with reference to experimental data, it is described below：

600 videos are chosen as test video, is that each video applies following distortion respectively：

1) XVid lossy compression methods, 320 × 240 are reduced to by the resolution ratio of original video, and frame per second is reduced to 25fps, bit rate drop It is 256kps；

2) medium filtering, filter size is from 10 pixels to 20 pixels；

3) Gaussian noise is added, variance yields is 0.1,0.5 or 1；

4) rotate, the anglec of rotation：2,5,10 degree；

5) histogram equalization, gray level number：16,32 or 64；

6) frame losing, frame losing percentage 25%；

7) picture scaling, scaling：0.2,4.

Pass sequentially through above-mentioned steps 1) to step 7) treatment, collectively generate 9600 sections of distortion videos.

It is that each distortion video and original video generate feature and describe with the deep neural network trained in embodiment 2 Symbol.It is inquiry video to choose each video one by one, and content recognition experiment is carried out on test library, precision ratio P is counted respectively, is recalled Rate R and F₁Index.Wherein F₁Index calculating method is as follows：

F₁=2/ (1/P+1/R)

Test result shows, F₁Index is 0.980, close to ideal value 1.Understand that built depth network can learn to tool There are the video features of good robustness and distinction, be capable of the essential perceptual property of reflecting video, have in content recognition experiment There is recognition accuracy higher.

Bibliography

[1]C.D.Roover,C.D.Vleeschouwer,F.Lefèbvre,and B.Macq,“Robust video hashing based on radial projections of key frames,”IEEE Trans.Signal Process.,vol.53,no.10,pp.4020-4037,Oct.2005.

[2]S.Lee and C.D.Yoo,“Robust video fingerprinting for content-based video identification,IEEE Trans.Circuits Syst.Video Technol.,vol.18,no.7, pp.983-988,Jul.2008.

[3]Y.Lei,W.Luo,Y.Wang and J.Huang,“Video sequence matching based on the invariance of color correlation,”IEEE Trans.Circuits Syst.Video Technol., vol.22,no.9,pp.1332-1343,Sept.2012.

[4]J.C.Oostveen,T.Kalker,and J.Haitsma,“Visual hashing of digital video:applications and techniques,”in Proc.SPIE Applications of Digital Image Processing XXIV,July 2001,vol.4472,pp.121-131.

[5]S.Satoh,M.Takimoto,and J.Adachi,“Scene duplicate detection from videos based on trajectories of feature points,”in Proc.Int.Workshop on Multimedia Information Retrieval,2007,237C244

[6]B.Coskun,B.Sankur,and N.Memon,“Spatio-temporal transform based video hashing,”IEEE Trans.Multimedia,vol.8,no.6,pp.1190–1208,Dec.2006.

[7]M.Li and V.Monga,“Robust video hashing via multilinear subspace projections,”IEEE Trans.Image Process.,vol.21,no.10,pp.4397–4409,Oct.2012.

[8]M.Li and V.Monga,“Twofold video hashing with automatic synchronization,”IEEE Trans.Inf.Forens.Sec.,vol.10,no.8,pp.1727-1738, Aug.2015.

[9]G.W.Taylor,G.E.Hinton,and S.T.Roweis,``Modeling human motion using binary latent variables,”in Proc.Advances in Neural Information Processing Systems,2007,vol.19.

[10]P.Vincent,H.Larochelle,I.Lajoie,Y.Bengio,P.A.Manzagol,Stacked denoising autoencoders:learning useful representations in a deep network with a local denoising criterion,"J Mach.Learn.Res.,vol.11,pp.3371-3408,Dec.2010.

It will be appreciated by those skilled in the art that accompanying drawing is a schematic diagram for preferred embodiment, the embodiments of the present invention Sequence number is for illustration only, and the quality of embodiment is not represented.

The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all it is of the invention spirit and Within principle, any modification, equivalent substitution and improvements made etc. should be included within the scope of the present invention.

Claims

1. a kind of digital video feature extracting method based on deep neural network, it is characterised in that methods described includes following Step：

One denoising coding network of training realizes the Dimensionality Reduction to the initial descriptors of video, by condition generation model and coding Device cascade constitutes one group of basic characteristic extracting module；

Multigroup characteristic extracting module is continuously trained, bottom-up stacking is done to gained module by training sequencing constitutes depth Neutral net；

Training post processing network, is placed on the top of deep neural network, is used to optimize robustness and the area of video presentation symbol Divide property.

2. a kind of digital video feature extracting method based on deep neural network according to claim 1, its feature exists In methods described also includes：

3. a kind of digital video feature extracting method based on deep neural network according to claim 2, its feature exists In, it is described that input video is pre-processed, it is specific the step of the space-time connection of video content is expressed by condition generation model For：

LPF is done to video smooth and down-sampled, each frame picture size is compressed to and meets neural network input layer size Need, regularization is done to the video after down-sampled, the pixel average for making each frame is zero, and variance is 1；

By video data input condition Boltzmann machine, each frame pixel of preprocessed video is set to the neuron of visible layer, CRBM networks are trained.

4. a kind of digital video feature extracting method based on deep neural network according to claim 1, its feature exists In, one denoising coding network of the training realizes the Dimensionality Reduction to the initial descriptors of video, by condition generation model and The step of encoder cascade constitutes one group of basic characteristic extracting module is specially：

Apply distortion to each training video and do pretreatment operation, using distortion video as the input of CRBM, generate and initially retouch Symbol is stated, the initial descriptors of multigroup original video and distortion video is chosen as training data, a denoising own coding net is trained Network；

5. a kind of digital video feature extracting method based on deep neural network according to claim 1, its feature exists In, it is described continuously to train multigroup characteristic extracting module, bottom-up stacking is done to gained module by training sequencing and is constituted The step of deep neural network, is specially：

By the use of the output of features described above extraction module as training data, continue to train a pair of CRBM and encoder, with gained CRBM and encoder re-establish second group of characteristic extracting module；

Multiple CRBM and coder module are trained successively, and the training data of each module is made up of the output of previous module；

6. a kind of digital video feature extracting method based on deep neural network according to claim 1, its feature exists In, the training post processing network is placed on the top of deep neural network, be used to optimize video presentation symbol robustness and The step of distinction, is specially：

It is that training video generates descriptor using the deep neural network being made up of K CRBM-E () module, by training The cost function of postpositive disposal network is trained；

After completing training, the post processing network is placed in the deep neural network top layer being made up of CRBM and encoder.