CN108021927A

CN108021927A - A kind of method for extracting video fingerprints based on slow change visual signature

Info

Publication number: CN108021927A
Application number: CN201711087291.2A
Authority: CN
Inventors: 李岳楠; 汪冬冬
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-11-07
Filing date: 2017-11-07
Publication date: 2018-05-11

Abstract

The invention discloses a kind of based on the slow method for extracting video fingerprints for becoming visual signature, comprise the following steps：To the image of each image generation random distortion version of training set, space characteristics are trained with original image and distorted image, extract the parameter of network；The normalization that space and sequential are carried out to training set video data pre-processes, and handles and fixes frame number for fixed dimension；Training data, training LSTM networks are used as by the use of the characteristic sequence of space characteristics extraction network；Network is extracted from trained space characteristics, by obtained feature and LSTM cascades, extracts video finger print.This method simulates the visual perception principle of people, can extract the fingerprint of video, has the advantages that robust, efficient.Available for fields such as video copy detection, video frequency searchings.

Description

A kind of method for extracting video fingerprints based on slow change visual signature

Technical field

The present invention relates to video copy detection field, more particularly to a kind of video finger print extraction based on slow change visual signature Method.

Background technology

With the development of video sharing website and mobile Internet, the video data on network sharply increases, and thus brings The problems such as infringement of copyright is propagated with illegal contents.Since data volume is huge, manpower search can not be relied on to illegally copy video.For Solve the problems, such as this, some video copying detection methods are proposed successively in recent years.Video copy detection is in known source video On the premise of, it is searched for from mass data and copies version.Video fingerprinting algorithms are the key technologies of copy detection, it is by video Main contents be described as a brief synopsis, similar to mankind's fingerprint.Video copy detection technology can pass through ratio Distinguish whether two sections of videos are homologous compared with video finger print.

The concept of video finger print is by Stanford University Indyk etc.^[1]Itd is proposed in 1999.In recent years to video copy detection The research of video fingerprinting algorithms of the demand driving of technology, and feature extraction is to design the key issue of video fingerprinting algorithms, directly Connect the robustness and distinction for determining fingerprint.By feature extraction model split, video fingerprinting algorithms can be divided mainly into two classes：Frame The space fingerprint cascade of rank and space-time joint fingerprint.

First kind method independently calculates the fingerprint of each frame of video, and then cascade obtains the fingerprint of whole section of video.Roover Deng^[2]The radial direction hash algorithm (Rash, Radial hashing) of proposition is the representative of this kind of method, and Rash algorithms are along per frame footpath To direction calculating pixel Valued Statistics as fingerprint.There is robustness, Lee etc. using the center of gravity of gradient direction^[3]It is proposed Algorithm referred to the center of gravity (CGO, Centroid of Gradient Orientations) of gradient direction as feature calculation Line.In addition, some algorithms are with area-of-interest^[4]And sparse features^[5]Based on design video fingerprinting algorithms.

The time-space relationship of second class method combination video data calculates video finger print.Such as Li et al.^[6]The structure chart of proposition Model (SGM, Structural Graphical Models) generates video finger print by the efficient dimensionality reduction in time domain and space. Nie etc.^[7]It is tensor by representation of video shot, it is proposed that a kind of video fingerprinting algorithms based on tensor resolution.Esmaeili etc.^[8]It is proposed Based on regarding for Three-dimensional DCT (3D-DCT, Three-Dimensional Discrete Cosine Transform) Frequency fingerprint algorithm, it by the key visual information of video is brief fingerprint to concentrate characteristic using the energy of dct transform.

In the implementation of the present invention, discovery at least has the following disadvantages in the prior art and deficiency by inventor：

Most of conventional video fingerprint algorithm depends on the feature extracting method of hand-designed, is typically only capable to portray video One side visual signature.Since the information expressed by image and video data is sufficiently complex, the model of hand-designed is difficult comprehensive These information are portrayed, particularly some abstract characteristics.

The content of the invention

It is of the invention from video learning the present invention provides a kind of based on the slow method for extracting video fingerprints for becoming visual signature Video finger print, introduces slow change analysis principle, comprising the sequential relevant information in sequence of frames of video, while has good robust Property, distinction and high efficiency, can be applied to the fields such as video frequency searching or copy detection accordingly, described below：

It is a kind of that following step is included based on the slow method for extracting video fingerprints for becoming visual signature, the method for extracting video fingerprints Suddenly：

It is special with original image and distorted image training space to the image of each image generation random distortion version of training set Sign, extracts the parameter of network；

The normalization that space and sequential are carried out to training set video data pre-processes, and handles and fixes frame number for fixed dimension；

Training data, training LSTM networks are used as by the use of the characteristic sequence of space characteristics extraction network；

Network is extracted from trained space characteristics, by obtained feature and LSTM cascades, extracts video finger print.

Wherein, it is described to be specially with original image and distorted image training space characteristics, the step of parameter for extracting network：

Space characteristics extraction network structure is made of full Connection Neural Network；

Using the different image of content as inhomogeneity, original image and its distorted version are as similar；Training space characteristics carry Take network make in class away from it is small, class spacing is big.

Wherein, the normalization pretreatment that space and sequential are carried out to training set video data, is handled as fixed dimension The step of fixed frame number is specially：

Each frame of video data is normalized, and is fixed size by size change over, by sampling each video Frame number is changed into fixed frame number.

Further, the characteristic sequence by the use of space characteristics extraction network trains LSTM networks as training data Step is specially：

Each frame of video after each pretreatment is inputted into trained space characteristics extraction network, is obtained and frame number The feature of equal number；

The output of LSTM is set to reconstruct input frame, cost function includes two：Reconstructed error, the LSTM every adjacent two of LSTM The difference of a hidden layer state；The network parameter of LSTM is trained by optimizing cost function.

Further, it is described to extract network from trained space characteristics, by obtained feature and LSTM cascades, Extracting video finger print is specially：

Each input video input space feature extraction network after pretreatment, extracts the feature of each frame；

Obtained characteristic sequence is inputted in LSTM to the video finger print for extracting the video.

The beneficial effect of technical solution provided by the invention is：

1st, by the method for neural network learning video finger print, avoid traditional artificial algorithm for design and portraying video features The limitation of aspect；

2nd, deep-neural-network is combined the space-time characteristic of extraction video with LSTM (long memory network in short-term), can be at the same time Ensure robustness of the video finger print to spatial domain distortion (such as filter plus make an uproar) and sequential distortion (such as frame loss)；

3rd, present procedure is simple, it is easy to accomplish, the fingerprint learning process computation complexity is low.

Brief description of the drawings

Fig. 1 is a kind of flow chart based on the slow method for extracting video fingerprints for becoming visual signature；

Fig. 2 is the flow chart that video finger print is extracted using space-time deep neural network；

Fig. 3 is the schematic diagram that space characteristics extract network structure；

Fig. 4 is LSTM structure diagrams.

Embodiment

To make the object, technical solutions and advantages of the present invention clearer, embodiment of the present invention is made below further It is described in detail on ground.

Embodiment 1

In order to realize the robust of video finger print and efficient learning method, the embodiment of the present invention proposes a kind of based on slow change The method for extracting video fingerprints of visual signature, it is described below referring to Fig. 1 and Fig. 2：

101：To the image of each image generation random distortion version of training set, trained with original image and distorted image empty Between feature, extract the parameter of network；

Wherein, which is specially：

1) each image is added into random distortion so as to obtain corresponding distorted image；

It can be set for the type of distortion according to being actually needed, the embodiment of the present invention is not limited.

2) all images are normalized to n × n (i.e. including original image and distorted image), average 0, variance 1, and As similar, space is trained by optimizing following cost function as inhomogeneity, each image and its distorted version for different images Feature, extracts the parameter of network.

J=E d | l=+1 }-E d | l=-1 } (1)

In formula, E d | and l=+1 } represent that similar image (classification l=+1) obtains the phase of the distance between feature d by network Hope, i.e., class it is interior away from.E d | and l=-1 } represent that inhomogeneity image (classification l=-1) obtains the distance between feature d's by network It is expected, i.e. class spacing.

Pass through above-mentioned training so that the characteristic distance between similar image diminishes, the characteristic distance between inhomogeneity image Become larger.

The embodiment of the present invention is not limited the concrete structure and training method of space characteristics extraction network.

102：The normalization that space and sequential are carried out to training set video data pre-processes, and handles as fixed dimension anchor-frame Number；

Wherein, which is specially：

Each frame size of video data is transformed to n × n, and it is 0 that will be normalized to average, variance 1.Pass through sampling The frame number of each video is changed into fixed frame number m.For n, the value of m, and the method for sampling can be according to the need of practical application Set, the embodiment of the present invention is not limited.

103：Training data, training LSTM networks are used as by the use of the characteristic sequence of space characteristics extraction network；

Wherein, which is specially：

1) each frame input space feature extraction network in pretreated video, each video are obtained into m feature Sequence (x₁,x₂...x_m)。

LSTM is trained using the characteristic sequence of generation in formula (1), to optimize following cost function as target, training LSTM Network parameter：

In formula, x_tRepresent (x₁,x₂...x_m) in t-th of feature；h_tRepresent x_tHidden layer state during by LSTM units； y_tRepresent t-th of output, y_tBy h_tIt is calculated：W_yAnd b_yRespectively the weight matrix of LSTM networks and Bias vector,For activation primitive；λ is constant；||·||₂Represent two norms.In cost functionFor it is adjacent when The similarity between hidden layer state is carved, by minimizing this LSTM can be made to learn the slow change vision on to video along time orientation Feature, helps to improve the robustness of video finger print.

104：Network is extracted from trained space characteristics, by obtained feature and LSTM cascades, extracts video Fingerprint.

Wherein, which is specially：

Video input space feature extraction network after pretreatment, extracts the feature of each frame, obtains characteristic sequence (x₁,x₂...x_m).Then by (x₁,x₂...x_m) in each feature sequentially input in LSTM, last obtained hidden layer shape State h_mAs video finger print.

Refer in conclusion the embodiment of the present invention is realized from video learning video by above-mentioned steps 101- steps 104 Line, the video finger print combine the space characteristics and temporal aspect of video, while have good robustness and distinction, accordingly It can be applied to the fields such as video frequency searching or copy detection.

Embodiment 2

Below by taking a hands-on process as an example, the scheme in embodiment 1 is carried out with reference to specific calculation formula detailed Video finger print learning method thin to introduce, that embodiment that the present invention will be described in detail provides, it is described below：

201：Image preprocessing；

Training data of 15000 images as space characteristics extraction network is randomly selected from ImageNet, by image Standard size 512 × 512 is normalized to, and passes through mean filter.

202：Generate the image of random distortion version；

To every image application a kind of random distortion or conversion, the type of these distortions or conversion includes：JPEG is damaged Compression, Gaussian noise, rotation, medium filtering, histogram equalization, gamma correction, addition speckle noise, circulation filtering.Altogether 30000 pictures.

203：Training space characteristics extraction network；

The space characteristics extraction network used in the present embodiment is three layers of fully-connected network, is 1024- per node layer number 800-400.Every figure (32 × 32) is transformed to the vector that length is 1024, the vector I of original image generation_iRepresent, I_iIt is right The vector use for the distorted image generation answeredRepresent.I_iWithThe feature obtained by space characteristics extraction network is respectively f (I_i) WithThis example is using gradient descent method training space characteristics extraction network, wherein each batch size is 1000, every time Randomly select 1000 original graph (I₁,I₂...I₁₀₀₀) and corresponding distortion mapIt is excellent as training data Change following cost function：

204：The normalization that space and sequential are carried out to training set video data pre-processes, and pre-processes and is fixed for fixed dimension Frame number；

Wherein, the embodiment of the present invention uses training number of 400 video sequences downloaded from YouTube as LSTM According to.Space and sequential normalization pretreatment are carried out to video data, the video size after processing is fixed as：32 × 32 × 10, often Frame pixel average after normalization is 0, variance 1.

205：With the characteristic sequence of space characteristics extraction video；

Video features sequence is extracted in 400 video datas after being handled from step 204：By each frame of each video (32 × 32) input step 203 obtains network, therefore obtains the characteristic sequence that 400 sizes are 400 × 10.

206：Training LSTM networks；

The input node number of LSTM is 400 defined in the present embodiment, the number of hidden nodes 128.By the spy of each video Sign sequence is expressed as (x₁,x₂...x₁₀), t-th of feature x_tIt is h to be input to the hidden layer state obtained after LSTM networks_t, define y_t =sig (W_yh_t+b_y), W_yAnd b_yRespectively weight matrix and bias vector, sig () are sigmoid functions, y_tLength be 400.Weight selection attenuation coefficient λ=0.5, optimizes following cost function and obtains the network parameter of LSTM.

This example chooses stochastic gradient descent method training LSTM networks.

207：Extract the video finger print of video.

By any one input video after the pretreatment of step 204,32 × 32 × 20 video data is obtained.By this The space characteristics extraction network that the training of 20 two field picture input steps 203 obtains, obtains the characteristic sequence that size is 400 × 10.Will The LSTM networks obtained in this 10 video sequence input steps 206, and by last hidden layer state h₂₀As video finger print, The length of 128.

In conclusion the embodiment of the present invention combines the space of video from video learning video finger print, the video finger print Feature and temporal aspect；And introduce it is slow become analysis principle, make video finger print include video change in time it is slow important Information, while there is good robustness and distinction, it can be applied to the fields such as video frequency searching or copy detection accordingly.

Embodiment 3

With reference to specific experimental data, feasibility verification is carried out to the scheme in Examples 1 and 2, it is as detailed below to retouch State：

In above-described embodiment 2, choose 600 video sequences downloaded from YouTube and other 201 come from Totally 801 videos are different and non-overlapping with training data two-by-two between video as test video for the video sequence of TRECVID. Every section of video applies 9 kinds of common contents and keeps distortion, and different degree is chosen in every kind of distortion, as shown in the table：

1 type of distortion of table and parameter setting

Each original video produces 17 copy versions after these distortion processings, the original video in test library with copy The total number of shellfish is 14418.Each video is normalized to obtain 32 × 32 × 20 using the method for step 204, is classified as two A 32 × 32 × 10 video sequence simultaneously obtains the video features of two a length of 128 using the method for step 207, connects the two Video features obtain a length of 256 video finger print.According to video finger print, the original version of every distorted picture of inquiry, statistics is looked into Whether correct ask result.

According to accuracy rate P：The correct positive example data of query result account for the ratio for being predicted as positive example data；

Recall rate R：Query result be the data of positive example account for be actually positive example data ratio, calculate F₁Index：

F₁=2/ (1/P+1/R)=2 × P × R/P+R (5)

As a result 0.995, close to ideal value 1.The video finger print for understanding to learn has very strong robustness.

In conclusion the embodiment of the present invention combines the space of video from video learning video finger print, the video finger print Feature and temporal aspect, make video finger print include video along temporal slow change visual signature, and demonstrate in test its Shandong Rod and distinction, can be applied to the fields such as video frequency searching or copy detection accordingly.

Bibliography

[1]Indyk P,Iyengar G,Shivakumar N.Finding pirated video sequences on the internet[R].Technical report,Stanford University,1999.

[2]De Roover C,De Vleeschouwer C,Lefebvre F,et al.Robust video hashing based on radial projections of key frames[J].IEEE Transactions on Signal processing,2005,53(10):4020-4037.

[3]Lee S,Yoo C D.Robust video fingerprinting for content-based video identification[J].IEEE Transactions on Circuits and Systems for Video Technology,2008,18(7):983-988.

[4]Li J,Guo X,Yu Y,et al.A robust and low-complexity video fingerprint for multimedia security[C]//Wireless Personal Multimedia Communications(WPMC),2014International Symposium on.IEEE,2014:97-102.

[5]Wu B,Krishnan S S,Zhang N,et al.Compact and robust video fingerprinting usingsparse represented features[C]//Multimedia and Expo (ICME),2016IEEE International Conference on.IEEE,2016:1-6.

[6]Li M,Monga V.Compact video fingerprinting via structural graphical models[J].IEEE transactions on information forensics and security,2013,8(11): 1709-1721.

[7]Nie X,Yin Y,Sun J,et al.Comprehensive feature-based robust video fingerprinting using tensor model[J].IEEE Transactions on Multimedia,2017,19 (4):785-796.

[8]Esmaeili M M,Fatourechi M,Ward R K.A robust and fast video copy detection system using content-based fingerprinting[J].IEEE Transactions on information forensics and security,2011,6(1):213-226.

It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention Sequence number is for illustration only, does not represent the quality of embodiment.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on, should all be included in the protection scope of the present invention.

Claims

It is 1. a kind of based on the slow method for extracting video fingerprints for becoming visual signature, it is characterised in that the method for extracting video fingerprints Comprise the following steps：

To the image of each image generation random distortion version of training set, space characteristics are trained with original image and distorted image, Extract the parameter of network；

The normalization that space and sequential are carried out to training set video data pre-processes, and handles and fixes frame number for fixed dimension；

Training data, training LSTM networks are used as by the use of the characteristic sequence of space characteristics extraction network；

Network is extracted from trained space characteristics, by obtained feature and LSTM cascades, extracts video finger print.
It is 2. according to claim 1 a kind of based on the slow method for extracting video fingerprints for becoming visual signature, it is characterised in that institute State and be specially with original image and distorted image training space characteristics, the step of parameter for extracting network：

Space characteristics extraction network structure is made of full Connection Neural Network；

Using the different image of content as inhomogeneity, original image and its distorted version are as similar；Training space characteristics extraction net Network make in class away from it is small, class spacing is big.
It is 3. according to claim 1 a kind of based on the slow method for extracting video fingerprints for becoming visual signature, it is characterised in that institute The normalization pretreatment that space and sequential are carried out to training set video data is stated, the step of fixing frame number for fixed dimension is handled and has Body is：

Each frame of video data is normalized, and is fixed size by size change over, by sampling the frame number of each video It is changed into fixed frame number.
It is 4. according to claim 1 a kind of based on the slow method for extracting video fingerprints for becoming visual signature, it is characterised in that institute The step of characteristic sequence stated by the use of space characteristics extraction network is used as training data, trained LSTM networks be specially：

Each frame of video after each pretreatment is inputted into trained space characteristics extraction network, is obtained equal with frame number The feature of number；

The output of LSTM is set to reconstruct input frame, cost function includes two：Reconstructed error, the LSTM of LSTM is every two neighboring hidden The difference of layer state；The network parameter of LSTM is trained by optimizing cost function.
It is 5. according to claim 1 a kind of based on the slow method for extracting video fingerprints for becoming visual signature, it is characterised in that institute State and extract network from trained space characteristics, by obtained feature and LSTM cascades, extraction video finger print is specially：

Each input video input space feature extraction network after pretreatment, extracts the feature of each frame；

Obtained characteristic sequence is inputted in LSTM to the video finger print for extracting the video.