CN106778571B - Digital video feature extraction method based on deep neural network - Google Patents

Digital video feature extraction method based on deep neural network Download PDF

Info

Publication number
CN106778571B
CN106778571B CN201611104658.2A CN201611104658A CN106778571B CN 106778571 B CN106778571 B CN 106778571B CN 201611104658 A CN201611104658 A CN 201611104658A CN 106778571 B CN106778571 B CN 106778571B
Authority
CN
China
Prior art keywords
video
training
neural network
deep neural
crbm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611104658.2A
Other languages
Chinese (zh)
Other versions
CN106778571A (en
Inventor
李岳楠
陈学票
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201611104658.2A priority Critical patent/CN106778571B/en
Publication of CN106778571A publication Critical patent/CN106778571A/en
Application granted granted Critical
Publication of CN106778571B publication Critical patent/CN106778571B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a digital video feature extraction method based on a deep neural network, which comprises the following steps: training a denoising coding network to realize dimensionality reduction of an initial descriptor of a video, and cascading a condition generation model and a coder to form a group of basic feature extraction modules; continuously training a plurality of groups of feature extraction modules, and stacking the obtained modules from bottom to top according to the training sequence to form a deep neural network; and training a post-processing network, and placing the post-processing network on the top of the deep neural network to optimize the robustness and the distinguishability of the video descriptors. The method extracts the video features into a short video descriptor through a deep neural network, the video descriptor can realize abstract description of video perception content, meanwhile, the method has good robustness and distinctiveness, and efficient and accurate video content identification can be realized.

Description

Digital video feature extraction method based on deep neural network
Technical Field
The invention relates to the technical field of signal and information processing, in particular to a digital video feature extraction method based on a deep neural network.
Background
Compared with picture data, video data has the characteristics of large data volume, time sequence connection characteristic of data and large data redundancy. Video copyright protection, video retrieval, and video datamation management often require a unique and extremely compact descriptor as a content tag for video. The simplest method for generating the video descriptor is to independently extract the descriptor from each representative frame and concatenate the descriptors to form the descriptor of the whole video.
Common methods include statistical methods [1], intensity gradient methods [2] and color correlation methods [3 ]. But such methods do not characterize the timing of the visual information. In order to extract spatio-temporal features of video, document [4] uses the luminance difference values of neighboring blocks in the temporal and spatial directions as video descriptors, and document [5] uses the trajectories of feature points as video descriptors. Furthermore, three-dimensional signal transformation [6], tensor decomposition [7] and optical flow [8] are all used to construct descriptors that reflect the spatiotemporal properties of video.
In the process of implementing the invention, the inventor finds that at least the following disadvantages and shortcomings exist in the prior art:
the existing feature extraction method has the defects of large redundancy and sensitive time sequence distortion. And most rely on manual design, but the feature extraction method of manual design has difficulty in capturing the essential attributes of video information in the spatio-temporal direction.
Disclosure of Invention
The invention provides a digital video feature extraction method based on a deep neural network, which extracts video features into a short video descriptor through the deep neural network, the video descriptor can realize abstract description on video perception content, has good robustness and distinctiveness, and can realize efficient and accurate video content identification, and the details are described as follows:
a digital video feature extraction method based on a deep neural network comprises the following steps:
training a denoising coding network to realize dimensionality reduction of an initial descriptor of a video, and cascading a condition generation model and a coder to form a group of basic feature extraction modules;
continuously training a plurality of groups of feature extraction modules, and stacking the obtained modules from bottom to top according to the training sequence to form a deep neural network;
and training a post-processing network, and placing the post-processing network on the top of the deep neural network to optimize the robustness and the distinguishability of the video descriptors.
Wherein the method further comprises:
preprocessing an input video, and expressing the space-time connection of video contents through a condition generation model.
The method for preprocessing the input video and expressing the spatio-temporal connection of the video content through the condition generation model comprises the following steps:
performing low-pass filtering smoothing and down-sampling on the video, compressing the size of each frame of image to meet the size requirement of an input layer of a neural network, and regularizing the down-sampled video to enable the average value of pixels of each frame to be zero and the variance to be 1;
inputting video data into a Conditional simplified Boltzmann attack (CRBM), setting each frame of pixels of the preprocessed video as neurons of a visible layer, and training the CRBM network.
The method comprises the following steps of training a denoising coding network to realize dimensionality reduction of an initial descriptor of a video, and cascading a condition generation model and a coder to form a group of basic feature extraction modules:
applying distortion to each training video and carrying out preprocessing operation, using the distorted video as the input of a CRBM (China Mobile multimedia broadcasting), generating initial descriptors, selecting a plurality of groups of initial descriptors of original videos and distorted videos as training data, and training a denoising self-coding network;
the trained encoder E (-) is stacked on top of the CRBM, resulting in a first set of feature extraction modules.
The continuous training of the multiple groups of feature extraction modules is implemented by stacking the obtained modules from bottom to top according to the training sequence to form a deep neural network, and specifically comprises the following steps:
continuously training a pair of CRBM and an encoder by using the output of the characteristic extraction module as training data, and reestablishing a second group of characteristic extraction modules by using the obtained CRBM and the encoder;
training a plurality of CRBM and encoder modules in sequence, wherein the training data of each module consists of the output of the previous module;
and stacking the modules from bottom to top according to the training sequence to form a deep neural network.
Wherein, the training post-processing network is arranged at the top of the deep neural network, and the steps for optimizing the robustness and the distinguishability of the video descriptor specifically comprise:
generating descriptors for training videos by using a deep neural network formed by K CRBM-E (-) modules, and training by training a cost function of a post-processing network;
after the training is completed, the post-processing network is placed at the top layer of a deep neural network formed by a CRBM and an encoder.
The technical scheme provided by the invention has the beneficial effects that:
1. the video characteristics are extracted through a deep neural network so as to generate a video descriptor, and a CRBM (conditional simplified Boltzmann machine) network can depict the time-space essential attributes of video information;
2. the self-coding network can realize data reduction and robustness improvement of the descriptor, and the post-processing network can integrally optimize the robustness and the distinguishability of the descriptor;
3. according to the method, an optimal feature extraction scheme is obtained through training model learning without manually designing a feature extraction method;
4. the method has the advantages of simple program, easy realization and low calculation complexity. The test result on a computer with the CPU main frequency of 3.2GHz and the memory of 32GB shows that the average time required for calculating 500 frames of video sequences by the method is only 1.52 seconds.
Drawings
FIG. 1 is a flow chart of a method for extracting digital video features based on a deep neural network;
FIG. 2 is a schematic diagram of a conditional limited Boltzmann machine architecture;
fig. 3 is a schematic diagram of a deep neural network structure for video feature extraction.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Example 1
In order to realize brief and robust description of video content, an embodiment of the present invention provides a digital video feature extraction method based on a deep neural network, and referring to fig. 1, the method includes the following steps:
101: training a denoising coding network to realize dimensionality reduction of an initial descriptor of a video, and cascading a condition generation model and a coder to form a group of basic feature extraction modules;
102: continuously training a plurality of groups of feature extraction modules, and stacking the obtained modules from bottom to top according to the training sequence to form a deep neural network;
103: and training a post-processing network, and placing the post-processing network on the top of the deep neural network to optimize the robustness and the distinguishability of the video descriptors.
Wherein, before step 101, the method further comprises:
preprocessing an input video, and expressing the space-time connection of video contents through a condition generation model.
The method for preprocessing the input video and expressing the spatio-temporal connection of the video content through the condition generation model comprises the following steps:
performing low-pass filtering smoothing and down-sampling on the video, compressing the size of each frame of image to meet the size requirement of an input layer of a neural network, and regularizing the down-sampled video to enable the average value of pixels of each frame to be zero and the variance to be 1;
and inputting video data into the CRBM, setting each frame of pixel of the preprocessed video as a neuron of a visible layer, and training the CRBM network.
In step 101, training a denoising coding network to implement dimensionality reduction of an initial descriptor of a video, and cascading a condition generation model and a coder to form a group of basic feature extraction modules specifically includes:
applying distortion to each training video and carrying out preprocessing operation, using the distorted video as the input of a CRBM (China Mobile multimedia broadcasting), generating initial descriptors, selecting a plurality of groups of initial descriptors of original videos and distorted videos as training data, and training a denoising self-coding network;
the trained encoder E (-) is stacked on top of the CRBM, resulting in a first set of feature extraction modules.
Wherein, the continuous training of the multiple groups of feature extraction modules in step 102, and the bottom-up stacking of the obtained modules according to the training sequence to form the deep neural network specifically comprises:
continuously training a pair of CRBM and an encoder by using the output of the characteristic extraction module as training data, and reestablishing a second group of characteristic extraction modules by using the obtained CRBM and the encoder;
training a plurality of CRBM and encoder modules in sequence, wherein the training data of each module consists of the output of the previous module;
and stacking the modules from bottom to top according to the training sequence to form a deep neural network.
Wherein, the training post-processing network in step 103 is placed at the top of the deep neural network to optimize robustness and distinguishability of the video descriptor specifically as follows:
generating descriptors for training videos by using a deep neural network formed by K CRBM-E (-) modules, and training by training a cost function of a post-processing network;
after the training is completed, the post-processing network is placed at the top layer of a deep neural network formed by a CRBM and an encoder.
In summary, the video features are extracted into a short video descriptor through the deep neural network, the video descriptor can realize abstract description of video perception content, meanwhile, the video descriptor has good robustness and distinctiveness, and efficient and accurate video content identification can be realized.
Example 2
The scheme of example 1 is described in detail below with reference to specific figures 2 and 3, and the calculation formula, and is described in detail below:
201: preprocessing an input video, expressing the spatio-temporal relation among video contents through a condition generation model, and generating an initial descriptor of the video;
wherein, the step 201 specifically includes:
1) in the preprocessing step, each frame of the video is input into a low-pass filter to be subjected to spatial smoothing, the smoothed video is subjected to down-sampling in time, and finally each frame of pixels is normalized to have a mean value of 0 and a variance of 1. The parameters of the low-pass filter are not particularly limited in the embodiments of the present invention.
2) Using a conditionally Restricted Boltzmann Machine (Conditional modified Boltzmann Machine, CRBM) [9 ]]An initial descriptor of the video is generated. The CRBM can model the statistical relevance among video frames, such asAs shown in fig. 2. Let the visible layer at the current time (i.e. the tth frame of the video) be denoted vtThe t-m frame is vt-m(m.gtoreq.1). Hidden layer at current moment is htThe weight parameter of the visible layer and the hidden layer is W, the bias of the visible layer is a, the bias of the hidden layer is b, and the weight parameter of the previous moment of the visible layer to the current moment is AkThe weighting parameter of the previous moment of the visible layer to the current moment of the hidden layer is Bk
The specific operation is as follows:
1. will have a size of V1×S1×F1Video (frame number is F)1Each frame of picture has a size V1×S1) Low-pass filtering, smoothing and down-sampling to compress the size of each frame of picture to V2×S2To meet the size requirement of the input layer of the neural network, the frame number F is set1Is compressed to F2(F2=F1N, i.e. replacing the N frames by an average of every N frames). For the size after down sampling is V2×S2×F2The video of (2) is regularized to make the pixel mean value of each frame zero and the variance 1. In this example, V is selected2=32,S2=32,F2=4。
2. Inputting video data into CRBM, and making the visible layer corresponding to t-th frame of CRBM be vt∈R1024In this embodiment, each frame of pixels of the pre-processed video is set as neurons of the visible layer. The number of neurons in the visible layer is 1024.
T frame of hidden layer is htThe present example sets the number of hidden layer neurons to 300. Weight parameter W belonging to R of visible layer and hidden layer in CRBM network1024×300The bias of the visible layer a ∈ R1024The bias b ∈ R of the hidden layer300The weight parameter A of the visible layer between different time instantsk∈R300×300Weight parameter B of hidden layer between different timek∈R300×1024. Training of the CRBM network can be achieved by minimizing the cost function:
Figure BDA0001171098600000051
wherein L isCRBMIs a cost function of the CRBM; p (v)t|vt-1,...,vt-m) For the frame v at the time t-1, …, t-mt-1,...,vt-mUnder the condition of (b), the current frame vtA probability value of (d); e (v)t,ht) As a function of energy.
Figure BDA0001171098600000052
Wherein k is 1, …, m is serial number; m is the order of CRBM; v. oft-kIs a vector formed by pixel values of the t-k frame; t is the transposed symbol. The embodiment of the invention does not limit the method of the minimization formula (1) and the value of m.
In this example, the order m of the CRBM is 3, the number of training videos is 500, and the cost function (1) is minimized by using the reverse-propagation random gradient descent algorithm.
202: training a denoising coding network to realize dimensionality reduction of an initial descriptor of a video, and cascading a condition generation model and a coder to form a group of basic feature extraction modules;
wherein, the step 202 specifically includes:
1) and applying distortion (compression, noise addition, rotation and the like) to each training video, performing preprocessing operation, and taking the distorted video as the input of the CRBM to generate an initial descriptor. Selecting multiple groups of initial descriptors of original video and distorted video as training data, training a Denoising auto-encoder (DAE) [10 ]](ii) a Which is used to perform dimensional reduction on the video descriptors generated by the aforementioned CRBM. Before training, CRBM is used to generate descriptors of original video and distorted video (such as compressed and denoised versions of original video), and let a be the nth pair of original and distorted videonA descriptor representing the original video and a descriptor representing the original video,
Figure BDA0001171098600000061
a descriptor representing the distorted video. The goal of training the DAE is to get from
Figure BDA0001171098600000062
Middle recovery of an
Take the nth pair of training data as an example, let an∈R300×4A descriptor representing the original video and a descriptor representing the original video,
Figure BDA0001171098600000063
a descriptor representing the distorted video. The cost function of the denoised self-coding network is as follows:
Figure BDA0001171098600000064
wherein L isDAEA cost function for denoising the self-coding network; lambda [ alpha ]DAEIs the weight attenuation term coefficient; wi,j (l)The network weight represents the weight connecting the ith neuron of the l layer and the j neuron of the l +1 layer, E (-) is an encoder, and D (-) is a decoder.
Obtaining the optimal weight W by utilizing a random gradient descent minimization cost function (2) based on reverse conductioni,j (l)And finishing the training. Embodiment of the invention is directed to a method and lambda for minimizing formula (2)DAEThe value of (A) is not limiting.
The input layer and the hidden layer of the denoised self-coding network in this example are composed of 300 and 100 neurons, λ, respectivelyDAE=10-5
2) Stacking the trained encoder E (-) on the CRBM to obtain a first set of feature extraction modules, denoted as { CRBM-E (-) ]1. The feature extraction module is composed of three layers of neural networks, and the structure is 1024-.
203: continuously training a plurality of groups of feature extraction modules, and stacking the modules obtained by training from bottom to top according to the training sequence to form a deep neural network;
wherein, the step 203 specifically comprises:
the feature extraction module { CRBM-E (·) } is utilized1The output of which is used as training data, a pair of CRBM and a coder are continuously trained according to the steps, and the obtained CRBM and the coder are used for reestablishing a second groupA feature extraction module, denoted as { CRBM-E (·) }2. Repeating the above process, training multiple CRBM and encoder modules in sequence, the training data of each module is composed of the output of the previous module. And stacking the modules from bottom to top according to the training sequence to form a deep neural network. The deep neural network consisting of K modules can be represented as: { CRBM-E (-) }1-{CRBM-E(·)}2-…-{CRBM-E(·)}KAs shown in fig. 3. The value of the number K of modules is not particularly limited in the embodiment of the present invention.
The present embodiment adopts K-2, i.e., two sets of feature extraction modules are used for explanation. The feature extraction module { CRBM-E (·) } is utilized1The output of the two-dimensional feature extraction module is used as training data, a pair of CRBM and a denoising coder are continuously trained according to the steps, and the obtained CRBM and the coder are used for reestablishing a second group of feature extraction module { CRBM-E (·) }2
In this example, the input layer and hidden layer neuron numbers of the second set of CRBMs are 100 and 80, respectively, and the input layer and hidden layer neuron numbers of the denoising autoencoder are 80 and 50, respectively, so the structure of the second set of modules is 100-80-50. And stacking the two groups of modules from bottom to top to obtain the neural network with the structure of 1024-300-100-80-50.
204: and training a post-processing network, and placing the post-processing network on the top of the deep neural network to optimize the robustness and the distinguishability of the video descriptors.
Wherein, the step 204 specifically comprises:
1) and generating descriptors for the training video by using a deep neural network formed by K CRBM-E (-) modules. Taking the nth pair of training data as an example, (V)n,1,Vn,2,yn) In which V isn,1And Vn,2For two descriptors of the training video, ynIs a label (y)n+1 indicates that the two training videos have the same visual content, yn— 1 indicates that the two videos have different visual content).
Let φ (-) be the mapping defined by the post-processing network, L denotes the number of layers of the post-processing network (L >1), the cost function for training the post-processing network is as follows:
Figure BDA0001171098600000071
wherein the content of the first and second substances,
Figure BDA0001171098600000072
as network weights, constant λPostIs the weight attenuation term coefficient; vn,1A descriptor for a first video in the nth pair of training data; vn,2Is a descriptor of the second video. The cost function (3) is minimized, and after training, the post-processing unit is placed at the top layer of the deep neural network formed by the CRBM and the encoder, as shown in FIG. 3. Embodiment of the invention is directed to minimization method and L and lambdaPostThe value of (A) is not limited.
And generating descriptors for the training video by using the deep neural network formed by the 2 CRBM-E (-) modules, thereby forming a sample of the training post-processing network.
The training set selected in this example is composed of n-4000 pairs of video pairs with the same or different visual contents, wherein the video pairs with the same visual contents are generated by common distortions such as compression, noise addition and filtering.
In this example, the number of post-processing network layers L is 2, λPost=10-5The number of neurons in the two layers is 40 and 30, respectively. After the training is completed, the cost function (3) is minimized through a back-propagation algorithm, and the training is placed at the top layer of the deep network formed by the CRBM and the encoder, so that the feature extraction network with the structure of 1024-300-100-80-50-40-30 is obtained.
In summary, the video features are extracted into a short video descriptor through the deep neural network, the video descriptor can realize abstract description of video perception content, meanwhile, the video descriptor has good robustness and distinctiveness, and efficient and accurate video content identification can be realized.
Example 3
The following experimental data were combined to perform feasibility verification for the protocols of examples 1 and 2, as described in detail below:
selecting 600 videos as test videos, and applying the following distortions to each video respectively:
1) the XVid is subjected to lossy compression, the resolution of an original video is reduced to 320 multiplied by 240, the frame rate is reduced to 25fps, and the bit rate is reduced to 256 kps;
2) median filtering, filter size from 10 pixels to 20 pixels;
3) plus gaussian noise, variance value of 0.1, 0.5 or 1;
4) rotation, rotation angle: 2, 5, 10 degrees;
5) histogram equalization, number of gray levels: 16, 32 or 64;
6) frame loss, frame loss percentage 25%;
7) zooming, scaling: 0.2,4.
The 9600 distorted video segments are generated in total by the processing of the steps 1) to 7) in sequence.
Feature descriptors were generated for each distorted video and the original video using the deep neural network trained in example 2. Selecting each video one by one as a query video, developing a content identification experiment on a test library, and respectively counting the precision ratio P, the recall ratio R and the recall ratio F1And (4) indexes. Wherein F1The index calculation method is as follows:
F1=2/(1/P+1/R)
the test results show that F1The index is 0.980, which is close to the ideal value of 1. The established deep network can learn the video characteristics with good robustness and distinctiveness, can reflect the essential visual attributes of the video, and has higher identification accuracy in a content identification experiment.
Reference to the literature
[1]C.D.Roover,C.D.Vleeschouwer,F.Lefèbvre,and B.Macq,“Robust videohashing based on radial projections of key frames,”IEEE Trans.SignalProcess.,vol.53,no.10,pp.4020-4037,Oct.2005.
[2]S.Lee and C.D.Yoo,“Robust video fingerprinting for content-basedvideo identification,IEEE Trans.Circuits Syst.Video Technol.,vol.18,no.7,pp.983-988,Jul.2008.
[3]Y.Lei,W.Luo,Y.Wang and J.Huang,“Video sequence matching based onthe invariance of color correlation,”IEEE Trans.Circuits Syst.Video Technol.,vol.22,no.9,pp.1332-1343,Sept.2012.
[4]J.C.Oostveen,T.Kalker,and J.Haitsma,“Visual hashing of digitalvideo:applications and techniques,”in Proc.SPIE Applications of Digital ImageProcessing XXIV,July 2001,vol.4472,pp.121-131.
[5]S.Satoh,M.Takimoto,and J.Adachi,“Scene duplicate detection fromvideos based on trajectories of feature points,”in Proc.Int.Workshop onMultimedia Information Retrieval,2007,237C244
[6]B.Coskun,B.Sankur,and N.Memon,“Spatio-temporal transform basedvideo hashing,”IEEE Trans.Multimedia,vol.8,no.6,pp.1190–1208,Dec.2006.
[7]M.Li and V.Monga,“Robust video hashing via multilinear subspaceprojections,”IEEE Trans.Image Process.,vol.21,no.10,pp.4397–4409,Oct.2012.
[8]M.Li and V.Monga,“Twofold video hashing with automaticsynchronization,”IEEE Trans.Inf.Forens.Sec.,vol.10,no.8,pp.1727-1738,Aug.2015.
[9]G.W.Taylor,G.E.Hinton,and S.T.Roweis,``Modeling human motion usingbinary latent variables,”in Proc.Advances in Neural Information ProcessingSystems,2007,vol.19.
[10]P.Vincent,H.Larochelle,I.Lajoie,Y.Bengio,P.A.Manzagol,Stackeddenoising autoencoders:learning useful representations in a deep network witha local denoising criterion,"J Mach.Learn.Res.,vol.11,pp.3371-3408,Dec.2010.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (1)

1. A digital video feature extraction method based on a deep neural network is characterized by comprising the following steps:
training a denoising coding network to realize dimensionality reduction of an initial descriptor of a video, and cascading a condition generation model and a coder to form a group of basic feature extraction modules;
continuously training a plurality of groups of feature extraction modules, and stacking the obtained modules from bottom to top according to the training sequence to form a deep neural network;
training a post-processing network, and placing the post-processing network on the top of the deep neural network to optimize the robustness and the distinguishability of the video descriptor;
wherein the method further comprises:
preprocessing an input video, and expressing the spatiotemporal relation of video contents through a condition generation model;
further, the step of preprocessing the input video and expressing the spatio-temporal relation of the video content through the condition generating model specifically comprises:
performing low-pass filtering smoothing and down-sampling on the video, compressing the size of each frame of image to meet the size requirement of an input layer of a neural network, and regularizing the down-sampled video to enable the average value of pixels of each frame to be zero and the variance to be 1;
inputting video data into a conditional boltzmann machine, setting each frame of pixel of the preprocessed video as a neuron of a visible layer, and training a CRBM network;
the method comprises the following steps of training a denoising coding network to realize dimensionality reduction of an initial descriptor of a video, and cascading a condition generation model and a coder to form a group of basic feature extraction modules:
applying distortion to each training video and carrying out preprocessing operation, using the distorted video as the input of a CRBM (China Mobile multimedia broadcasting), generating initial descriptors, selecting a plurality of groups of initial descriptors of original videos and distorted videos as training data, and training a denoising self-coding network;
stacking the trained encoder E (-) on the CRBM to obtain a first group of feature extraction modules;
further, the step of continuously training the plurality of groups of feature extraction modules, stacking the obtained modules from bottom to top according to the training sequence to form the deep neural network specifically comprises:
continuously training a pair of CRBM and an encoder by using the output of the characteristic extraction module as training data, and reestablishing a second group of characteristic extraction modules by using the obtained CRBM and the encoder;
training a plurality of CRBM and encoder modules in sequence, wherein the training data of each module consists of the output of the previous module;
stacking the modules from bottom to top according to the training sequence to form a deep neural network;
wherein, the training post-processing network is arranged at the top of the deep neural network, and the steps for optimizing the robustness and the distinguishability of the video descriptor specifically comprise:
generating descriptors for training videos by using a deep neural network formed by K CRBM-E (-) modules, and training by training a cost function of a post-processing network;
after the training is completed, the post-processing network is placed at the top layer of the deep neural network formed by the CRBM and the encoder.
CN201611104658.2A 2016-12-05 2016-12-05 Digital video feature extraction method based on deep neural network Active CN106778571B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611104658.2A CN106778571B (en) 2016-12-05 2016-12-05 Digital video feature extraction method based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611104658.2A CN106778571B (en) 2016-12-05 2016-12-05 Digital video feature extraction method based on deep neural network

Publications (2)

Publication Number Publication Date
CN106778571A CN106778571A (en) 2017-05-31
CN106778571B true CN106778571B (en) 2020-03-27

Family

ID=58878783

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611104658.2A Active CN106778571B (en) 2016-12-05 2016-12-05 Digital video feature extraction method based on deep neural network

Country Status (1)

Country Link
CN (1) CN106778571B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107563391B (en) * 2017-09-06 2020-12-15 天津大学 Digital image feature extraction method based on expert model
CN108021927A (en) * 2017-11-07 2018-05-11 天津大学 A kind of method for extracting video fingerprints based on slow change visual signature
CN108874665A (en) * 2018-05-29 2018-11-23 百度在线网络技术(北京)有限公司 A kind of test result method of calibration, device, equipment and medium
CN108900888A (en) * 2018-06-15 2018-11-27 优酷网络技术(北京)有限公司 Control method for playing back and device
CN109857906B (en) * 2019-01-10 2023-04-07 天津大学 Multi-video abstraction method based on query unsupervised deep learning
CN111291634B (en) * 2020-01-17 2023-07-18 西北工业大学 Unmanned aerial vehicle image target detection method based on convolution-limited Boltzmann machine
CN111488932B (en) * 2020-04-10 2021-03-16 中国科学院大学 Self-supervision video time-space characterization learning method based on frame rate perception

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521671A (en) * 2011-11-29 2012-06-27 华北电力大学 Ultrashort-term wind power prediction method
CN104113789A (en) * 2014-07-10 2014-10-22 杭州电子科技大学 On-line video abstraction generation method based on depth learning
CN104268594A (en) * 2014-09-24 2015-01-07 中安消技术有限公司 Method and device for detecting video abnormal events
CN104900063A (en) * 2015-06-19 2015-09-09 中国科学院自动化研究所 Short distance driving time prediction method
CN105163121A (en) * 2015-08-24 2015-12-16 西安电子科技大学 Large-compression-ratio satellite remote sensing image compression method based on deep self-encoding network
CN106096568A (en) * 2016-06-21 2016-11-09 同济大学 A kind of pedestrian's recognition methods again based on CNN and convolution LSTM network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521671A (en) * 2011-11-29 2012-06-27 华北电力大学 Ultrashort-term wind power prediction method
CN104113789A (en) * 2014-07-10 2014-10-22 杭州电子科技大学 On-line video abstraction generation method based on depth learning
CN104268594A (en) * 2014-09-24 2015-01-07 中安消技术有限公司 Method and device for detecting video abnormal events
CN104900063A (en) * 2015-06-19 2015-09-09 中国科学院自动化研究所 Short distance driving time prediction method
CN105163121A (en) * 2015-08-24 2015-12-16 西安电子科技大学 Large-compression-ratio satellite remote sensing image compression method based on deep self-encoding network
CN106096568A (en) * 2016-06-21 2016-11-09 同济大学 A kind of pedestrian's recognition methods again based on CNN and convolution LSTM network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Automatic Neuron Detection in Calcium Imaging Data Using Convolutional Networks;Noah J. Apthorpe 等;《arXiv:1606.07372v1》;20160623;1-9 *
ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation;Adam Paszke 等;《arXiv:1606.02147v1》;20160607;1-10 *
Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion;Pascal Vincent 等;《Journal of Machine Learning Research》;20101210;第11卷;3371–3408 *

Also Published As

Publication number Publication date
CN106778571A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN106778571B (en) Digital video feature extraction method based on deep neural network
Du et al. Perceptual hashing for image authentication: A survey
Kumar et al. Object-based image retrieval using the u-net-based neural network
CN111325169B (en) Deep video fingerprint algorithm based on capsule network
Naikal et al. Towards an efficient distributed object recognition system in wireless smart camera networks
CN113221641A (en) Video pedestrian re-identification method based on generation of confrontation network and attention mechanism
Ayoobkhan et al. Prediction-based Lossless Image Compression
Wang et al. Semantic perceptual image compression with a laplacian pyramid of convolutional networks
Akbari et al. Joint sparse learning with nonlocal and local image priors for image error concealment
Shen et al. Codedvision: Towards joint image understanding and compression via end-to-end learning
Jindal et al. Applicability of fractional transforms in image processing-review, technical challenges and future trends
CN109615576B (en) Single-frame image super-resolution reconstruction method based on cascade regression basis learning
CN108021927A (en) A kind of method for extracting video fingerprints based on slow change visual signature
Mei et al. Learn a compression for objection detection-vae with a bridge
Zhang et al. A parallel and serial denoising network
Zeng et al. U-net-based multispectral image generation from an rgb image
Li et al. Robust content fingerprinting algorithm based on invariant and hierarchical generative model
CN106570509B (en) A kind of dictionary learning and coding method for extracting digital picture feature
CN107563391B (en) Digital image feature extraction method based on expert model
Gan et al. A two-branch convolution residual network for image compressive sensing
Han et al. Unsupervised hierarchical convolutional sparse auto-encoder for high spatial resolution imagery scene classification
Patil et al. Image hashing by SDQ-CSLBP
Galteri et al. Reading text in the wild from compressed images
Gu et al. A Two-Stream Network with Image-to-Class Deep Metric for Few-Shot Classification
Gan et al. Video Surveillance Object Forgery Detection using PDCL Network with Residual-based Steganalysis Feature

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant