CN106778571B

CN106778571B - Digital video feature extraction method based on deep neural network

Info

Publication number: CN106778571B
Application number: CN201611104658.2A
Authority: CN
Inventors: 李岳楠; 陈学票
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2016-12-05
Filing date: 2016-12-05
Publication date: 2020-03-27
Anticipated expiration: 2036-12-05
Also published as: CN106778571A

Abstract

The invention discloses a digital video feature extraction method based on a deep neural network, which comprises the following steps: training a denoising coding network to realize dimensionality reduction of an initial descriptor of a video, and cascading a condition generation model and a coder to form a group of basic feature extraction modules; continuously training a plurality of groups of feature extraction modules, and stacking the obtained modules from bottom to top according to the training sequence to form a deep neural network; and training a post-processing network, and placing the post-processing network on the top of the deep neural network to optimize the robustness and the distinguishability of the video descriptors. The method extracts the video features into a short video descriptor through a deep neural network, the video descriptor can realize abstract description of video perception content, meanwhile, the method has good robustness and distinctiveness, and efficient and accurate video content identification can be realized.

Description

Digital video feature extraction method based on deep neural network

Technical Field

The invention relates to the technical field of signal and information processing, in particular to a digital video feature extraction method based on a deep neural network.

Background

Compared with picture data, video data has the characteristics of large data volume, time sequence connection characteristic of data and large data redundancy. Video copyright protection, video retrieval, and video datamation management often require a unique and extremely compact descriptor as a content tag for video. The simplest method for generating the video descriptor is to independently extract the descriptor from each representative frame and concatenate the descriptors to form the descriptor of the whole video.

Common methods include statistical methods [1], intensity gradient methods [2] and color correlation methods [3 ]. But such methods do not characterize the timing of the visual information. In order to extract spatio-temporal features of video, document [4] uses the luminance difference values of neighboring blocks in the temporal and spatial directions as video descriptors, and document [5] uses the trajectories of feature points as video descriptors. Furthermore, three-dimensional signal transformation [6], tensor decomposition [7] and optical flow [8] are all used to construct descriptors that reflect the spatiotemporal properties of video.

In the process of implementing the invention, the inventor finds that at least the following disadvantages and shortcomings exist in the prior art:

the existing feature extraction method has the defects of large redundancy and sensitive time sequence distortion. And most rely on manual design, but the feature extraction method of manual design has difficulty in capturing the essential attributes of video information in the spatio-temporal direction.

Disclosure of Invention

The invention provides a digital video feature extraction method based on a deep neural network, which extracts video features into a short video descriptor through the deep neural network, the video descriptor can realize abstract description on video perception content, has good robustness and distinctiveness, and can realize efficient and accurate video content identification, and the details are described as follows:

a digital video feature extraction method based on a deep neural network comprises the following steps:

training a denoising coding network to realize dimensionality reduction of an initial descriptor of a video, and cascading a condition generation model and a coder to form a group of basic feature extraction modules;

continuously training a plurality of groups of feature extraction modules, and stacking the obtained modules from bottom to top according to the training sequence to form a deep neural network;

and training a post-processing network, and placing the post-processing network on the top of the deep neural network to optimize the robustness and the distinguishability of the video descriptors.

Wherein the method further comprises:

preprocessing an input video, and expressing the space-time connection of video contents through a condition generation model.

The method for preprocessing the input video and expressing the spatio-temporal connection of the video content through the condition generation model comprises the following steps:

performing low-pass filtering smoothing and down-sampling on the video, compressing the size of each frame of image to meet the size requirement of an input layer of a neural network, and regularizing the down-sampled video to enable the average value of pixels of each frame to be zero and the variance to be 1;

inputting video data into a Conditional simplified Boltzmann attack (CRBM), setting each frame of pixels of the preprocessed video as neurons of a visible layer, and training the CRBM network.

The method comprises the following steps of training a denoising coding network to realize dimensionality reduction of an initial descriptor of a video, and cascading a condition generation model and a coder to form a group of basic feature extraction modules:

applying distortion to each training video and carrying out preprocessing operation, using the distorted video as the input of a CRBM (China Mobile multimedia broadcasting), generating initial descriptors, selecting a plurality of groups of initial descriptors of original videos and distorted videos as training data, and training a denoising self-coding network;

the trained encoder E (-) is stacked on top of the CRBM, resulting in a first set of feature extraction modules.

The continuous training of the multiple groups of feature extraction modules is implemented by stacking the obtained modules from bottom to top according to the training sequence to form a deep neural network, and specifically comprises the following steps:

continuously training a pair of CRBM and an encoder by using the output of the characteristic extraction module as training data, and reestablishing a second group of characteristic extraction modules by using the obtained CRBM and the encoder;

training a plurality of CRBM and encoder modules in sequence, wherein the training data of each module consists of the output of the previous module;

and stacking the modules from bottom to top according to the training sequence to form a deep neural network.

Wherein, the training post-processing network is arranged at the top of the deep neural network, and the steps for optimizing the robustness and the distinguishability of the video descriptor specifically comprise:

generating descriptors for training videos by using a deep neural network formed by K CRBM-E (-) modules, and training by training a cost function of a post-processing network;

after the training is completed, the post-processing network is placed at the top layer of a deep neural network formed by a CRBM and an encoder.

The technical scheme provided by the invention has the beneficial effects that:

1. the video characteristics are extracted through a deep neural network so as to generate a video descriptor, and a CRBM (conditional simplified Boltzmann machine) network can depict the time-space essential attributes of video information;

2. the self-coding network can realize data reduction and robustness improvement of the descriptor, and the post-processing network can integrally optimize the robustness and the distinguishability of the descriptor;

3. according to the method, an optimal feature extraction scheme is obtained through training model learning without manually designing a feature extraction method;

4. the method has the advantages of simple program, easy realization and low calculation complexity. The test result on a computer with the CPU main frequency of 3.2GHz and the memory of 32GB shows that the average time required for calculating 500 frames of video sequences by the method is only 1.52 seconds.

Drawings

FIG. 1 is a flow chart of a method for extracting digital video features based on a deep neural network;

FIG. 2 is a schematic diagram of a conditional limited Boltzmann machine architecture;

fig. 3 is a schematic diagram of a deep neural network structure for video feature extraction.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

In order to realize brief and robust description of video content, an embodiment of the present invention provides a digital video feature extraction method based on a deep neural network, and referring to fig. 1, the method includes the following steps:

101: training a denoising coding network to realize dimensionality reduction of an initial descriptor of a video, and cascading a condition generation model and a coder to form a group of basic feature extraction modules;

102: continuously training a plurality of groups of feature extraction modules, and stacking the obtained modules from bottom to top according to the training sequence to form a deep neural network;

103: and training a post-processing network, and placing the post-processing network on the top of the deep neural network to optimize the robustness and the distinguishability of the video descriptors.

Wherein, before step 101, the method further comprises:

and inputting video data into the CRBM, setting each frame of pixel of the preprocessed video as a neuron of a visible layer, and training the CRBM network.

In step 101, training a denoising coding network to implement dimensionality reduction of an initial descriptor of a video, and cascading a condition generation model and a coder to form a group of basic feature extraction modules specifically includes:

Wherein, the continuous training of the multiple groups of feature extraction modules in step 102, and the bottom-up stacking of the obtained modules according to the training sequence to form the deep neural network specifically comprises:

Wherein, the training post-processing network in step 103 is placed at the top of the deep neural network to optimize robustness and distinguishability of the video descriptor specifically as follows:

In summary, the video features are extracted into a short video descriptor through the deep neural network, the video descriptor can realize abstract description of video perception content, meanwhile, the video descriptor has good robustness and distinctiveness, and efficient and accurate video content identification can be realized.

Example 2

The scheme of example 1 is described in detail below with reference to specific figures 2 and 3, and the calculation formula, and is described in detail below:

201: preprocessing an input video, expressing the spatio-temporal relation among video contents through a condition generation model, and generating an initial descriptor of the video;

wherein, the step 201 specifically includes:

1) in the preprocessing step, each frame of the video is input into a low-pass filter to be subjected to spatial smoothing, the smoothed video is subjected to down-sampling in time, and finally each frame of pixels is normalized to have a mean value of 0 and a variance of 1. The parameters of the low-pass filter are not particularly limited in the embodiments of the present invention.

2) Using a conditionally Restricted Boltzmann Machine (Conditional modified Boltzmann Machine, CRBM) [9 ]]An initial descriptor of the video is generated. The CRBM can model the statistical relevance among video frames, such asAs shown in fig. 2. Let the visible layer at the current time (i.e. the tth frame of the video) be denoted v_tThe t-m frame is v_t-m(m.gtoreq.1). Hidden layer at current moment is h_tThe weight parameter of the visible layer and the hidden layer is W, the bias of the visible layer is a, the bias of the hidden layer is b, and the weight parameter of the previous moment of the visible layer to the current moment is A_kThe weighting parameter of the previous moment of the visible layer to the current moment of the hidden layer is B_k。

The specific operation is as follows:

1. will have a size of V₁×S₁×F₁Video (frame number is F)₁Each frame of picture has a size V₁×S₁) Low-pass filtering, smoothing and down-sampling to compress the size of each frame of picture to V₂×S₂To meet the size requirement of the input layer of the neural network, the frame number F is set₁Is compressed to F₂(F₂＝F₁N, i.e. replacing the N frames by an average of every N frames). For the size after down sampling is V₂×S₂×F₂The video of (2) is regularized to make the pixel mean value of each frame zero and the variance 1. In this example, V is selected₂＝32,S₂＝32,F₂＝4。

2. Inputting video data into CRBM, and making the visible layer corresponding to t-th frame of CRBM be v_t∈R¹⁰²⁴In this embodiment, each frame of pixels of the pre-processed video is set as neurons of the visible layer. The number of neurons in the visible layer is 1024.

T frame of hidden layer is h_tThe present example sets the number of hidden layer neurons to 300. Weight parameter W belonging to R of visible layer and hidden layer in CRBM network^1024×300The bias of the visible layer a ∈ R¹⁰²⁴The bias b ∈ R of the hidden layer³⁰⁰The weight parameter A of the visible layer between different time instants_k∈R^300×300Weight parameter B of hidden layer between different time_k∈R^300×1024. Training of the CRBM network can be achieved by minimizing the cost function:

wherein L is_CRBMIs a cost function of the CRBM; p (v)_t|v_t-1,...,v_t-m) For the frame v at the time t-1, …, t-m_t-1,...,v_t-mUnder the condition of (b), the current frame v_tA probability value of (d); e (v)_t,h_t) As a function of energy.

Wherein k is 1, …, m is serial number; m is the order of CRBM; v. of_t-kIs a vector formed by pixel values of the t-k frame; t is the transposed symbol. The embodiment of the invention does not limit the method of the minimization formula (1) and the value of m.

In this example, the order m of the CRBM is 3, the number of training videos is 500, and the cost function (1) is minimized by using the reverse-propagation random gradient descent algorithm.

202: training a denoising coding network to realize dimensionality reduction of an initial descriptor of a video, and cascading a condition generation model and a coder to form a group of basic feature extraction modules;

wherein, the step 202 specifically includes:

1) and applying distortion (compression, noise addition, rotation and the like) to each training video, performing preprocessing operation, and taking the distorted video as the input of the CRBM to generate an initial descriptor. Selecting multiple groups of initial descriptors of original video and distorted video as training data, training a Denoising auto-encoder (DAE) [10 ]](ii) a Which is used to perform dimensional reduction on the video descriptors generated by the aforementioned CRBM. Before training, CRBM is used to generate descriptors of original video and distorted video (such as compressed and denoised versions of original video), and let a be the nth pair of original and distorted video_nA descriptor representing the original video and a descriptor representing the original video,

a descriptor representing the distorted video. The goal of training the DAE is to get from

Middle recovery of a_n。

Take the nth pair of training data as an example, let a_n∈R^300×4A descriptor representing the original video and a descriptor representing the original video,

a descriptor representing the distorted video. The cost function of the denoised self-coding network is as follows:

wherein L is_DAEA cost function for denoising the self-coding network; lambda [ alpha ]_DAEIs the weight attenuation term coefficient; w_i,j ^(l)The network weight represents the weight connecting the ith neuron of the l layer and the j neuron of the l +1 layer, E (-) is an encoder, and D (-) is a decoder.

Obtaining the optimal weight W by utilizing a random gradient descent minimization cost function (2) based on reverse conduction_i,j ^(l)And finishing the training. Embodiment of the invention is directed to a method and lambda for minimizing formula (2)_DAEThe value of (A) is not limiting.

The input layer and the hidden layer of the denoised self-coding network in this example are composed of 300 and 100 neurons, λ, respectively_DAE＝10^-5。

2) Stacking the trained encoder E (-) on the CRBM to obtain a first set of feature extraction modules, denoted as { CRBM-E (-) ]₁. The feature extraction module is composed of three layers of neural networks, and the structure is 1024-.

203: continuously training a plurality of groups of feature extraction modules, and stacking the modules obtained by training from bottom to top according to the training sequence to form a deep neural network;

wherein, the step 203 specifically comprises:

the feature extraction module { CRBM-E (·) } is utilized₁The output of which is used as training data, a pair of CRBM and a coder are continuously trained according to the steps, and the obtained CRBM and the coder are used for reestablishing a second groupA feature extraction module, denoted as { CRBM-E (·) }₂. Repeating the above process, training multiple CRBM and encoder modules in sequence, the training data of each module is composed of the output of the previous module. And stacking the modules from bottom to top according to the training sequence to form a deep neural network. The deep neural network consisting of K modules can be represented as: { CRBM-E (-) }₁-{CRBM-E(·)}₂-…-{CRBM-E(·)}_KAs shown in fig. 3. The value of the number K of modules is not particularly limited in the embodiment of the present invention.

The present embodiment adopts K-2, i.e., two sets of feature extraction modules are used for explanation. The feature extraction module { CRBM-E (·) } is utilized₁The output of the two-dimensional feature extraction module is used as training data, a pair of CRBM and a denoising coder are continuously trained according to the steps, and the obtained CRBM and the coder are used for reestablishing a second group of feature extraction module { CRBM-E (·) }₂。

In this example, the input layer and hidden layer neuron numbers of the second set of CRBMs are 100 and 80, respectively, and the input layer and hidden layer neuron numbers of the denoising autoencoder are 80 and 50, respectively, so the structure of the second set of modules is 100-80-50. And stacking the two groups of modules from bottom to top to obtain the neural network with the structure of 1024-300-100-80-50.

204: and training a post-processing network, and placing the post-processing network on the top of the deep neural network to optimize the robustness and the distinguishability of the video descriptors.

Wherein, the step 204 specifically comprises:

1) and generating descriptors for the training video by using a deep neural network formed by K CRBM-E (-) modules. Taking the nth pair of training data as an example, (V)_n,1,V_n,2,y_n) In which V is_n,1And V_n,2For two descriptors of the training video, y_nIs a label (y)_n+1 indicates that the two training videos have the same visual content, y_n— 1 indicates that the two videos have different visual content).

Let φ (-) be the mapping defined by the post-processing network, L denotes the number of layers of the post-processing network (L >1), the cost function for training the post-processing network is as follows:

wherein the content of the first and second substances,

as network weights, constant λ_PostIs the weight attenuation term coefficient; v_n,1A descriptor for a first video in the nth pair of training data; v_n,2Is a descriptor of the second video. The cost function (3) is minimized, and after training, the post-processing unit is placed at the top layer of the deep neural network formed by the CRBM and the encoder, as shown in FIG. 3. Embodiment of the invention is directed to minimization method and L and lambda_PostThe value of (A) is not limited.

And generating descriptors for the training video by using the deep neural network formed by the 2 CRBM-E (-) modules, thereby forming a sample of the training post-processing network.

The training set selected in this example is composed of n-4000 pairs of video pairs with the same or different visual contents, wherein the video pairs with the same visual contents are generated by common distortions such as compression, noise addition and filtering.

In this example, the number of post-processing network layers L is 2, λ_Post＝10^-5The number of neurons in the two layers is 40 and 30, respectively. After the training is completed, the cost function (3) is minimized through a back-propagation algorithm, and the training is placed at the top layer of the deep network formed by the CRBM and the encoder, so that the feature extraction network with the structure of 1024-300-100-80-50-40-30 is obtained.

Example 3

The following experimental data were combined to perform feasibility verification for the protocols of examples 1 and 2, as described in detail below:

selecting 600 videos as test videos, and applying the following distortions to each video respectively:

1) the XVid is subjected to lossy compression, the resolution of an original video is reduced to 320 multiplied by 240, the frame rate is reduced to 25fps, and the bit rate is reduced to 256 kps;

2) median filtering, filter size from 10 pixels to 20 pixels;

3) plus gaussian noise, variance value of 0.1, 0.5 or 1;

4) rotation, rotation angle: 2, 5, 10 degrees;

5) histogram equalization, number of gray levels: 16, 32 or 64;

6) frame loss, frame loss percentage 25%;

7) zooming, scaling: 0.2,4.

The 9600 distorted video segments are generated in total by the processing of the steps 1) to 7) in sequence.

Feature descriptors were generated for each distorted video and the original video using the deep neural network trained in example 2. Selecting each video one by one as a query video, developing a content identification experiment on a test library, and respectively counting the precision ratio P, the recall ratio R and the recall ratio F₁And (4) indexes. Wherein F₁The index calculation method is as follows:

F₁＝2/(1/P+1/R)

the test results show that F₁The index is 0.980, which is close to the ideal value of 1. The established deep network can learn the video characteristics with good robustness and distinctiveness, can reflect the essential visual attributes of the video, and has higher identification accuracy in a content identification experiment.

Reference to the literature

[1]C.D.Roover,C.D.Vleeschouwer,F.Lefèbvre,and B.Macq,“Robust videohashing based on radial projections of key frames,”IEEE Trans.SignalProcess.,vol.53,no.10,pp.4020-4037,Oct.2005.

[2]S.Lee and C.D.Yoo,“Robust video fingerprinting for content-basedvideo identification,IEEE Trans.Circuits Syst.Video Technol.,vol.18,no.7,pp.983-988,Jul.2008.

[3]Y.Lei,W.Luo,Y.Wang and J.Huang,“Video sequence matching based onthe invariance of color correlation,”IEEE Trans.Circuits Syst.Video Technol.,vol.22,no.9,pp.1332-1343,Sept.2012.

[4]J.C.Oostveen,T.Kalker,and J.Haitsma,“Visual hashing of digitalvideo:applications and techniques,”in Proc.SPIE Applications of Digital ImageProcessing XXIV,July 2001,vol.4472,pp.121-131.

[5]S.Satoh,M.Takimoto,and J.Adachi,“Scene duplicate detection fromvideos based on trajectories of feature points,”in Proc.Int.Workshop onMultimedia Information Retrieval,2007,237C244

[6]B.Coskun,B.Sankur,and N.Memon,“Spatio-temporal transform basedvideo hashing,”IEEE Trans.Multimedia,vol.8,no.6,pp.1190–1208,Dec.2006.

[7]M.Li and V.Monga,“Robust video hashing via multilinear subspaceprojections,”IEEE Trans.Image Process.,vol.21,no.10,pp.4397–4409,Oct.2012.

[8]M.Li and V.Monga,“Twofold video hashing with automaticsynchronization,”IEEE Trans.Inf.Forens.Sec.,vol.10,no.8,pp.1727-1738,Aug.2015.

[9]G.W.Taylor,G.E.Hinton,and S.T.Roweis,``Modeling human motion usingbinary latent variables,”in Proc.Advances in Neural Information ProcessingSystems,2007,vol.19.

[10]P.Vincent,H.Larochelle,I.Lajoie,Y.Bengio,P.A.Manzagol,Stackeddenoising autoencoders:learning useful representations in a deep network witha local denoising criterion,"J Mach.Learn.Res.,vol.11,pp.3371-3408,Dec.2010.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A digital video feature extraction method based on a deep neural network is characterized by comprising the following steps:

training a post-processing network, and placing the post-processing network on the top of the deep neural network to optimize the robustness and the distinguishability of the video descriptor;

wherein the method further comprises:

preprocessing an input video, and expressing the spatiotemporal relation of video contents through a condition generation model;

further, the step of preprocessing the input video and expressing the spatio-temporal relation of the video content through the condition generating model specifically comprises:

inputting video data into a conditional boltzmann machine, setting each frame of pixel of the preprocessed video as a neuron of a visible layer, and training a CRBM network;

stacking the trained encoder E (-) on the CRBM to obtain a first group of feature extraction modules;

further, the step of continuously training the plurality of groups of feature extraction modules, stacking the obtained modules from bottom to top according to the training sequence to form the deep neural network specifically comprises:

stacking the modules from bottom to top according to the training sequence to form a deep neural network;

after the training is completed, the post-processing network is placed at the top layer of the deep neural network formed by the CRBM and the encoder.