CN111291223A

CN111291223A - Four-embryo convolution neural network video fingerprint algorithm

Info

Publication number: CN111291223A
Application number: CN202010072025.8A
Authority: CN
Inventors: 李新伟; 郭辰; 杨艺
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-06-16
Anticipated expiration: 2040-01-21
Also published as: CN111291223B

Abstract

The invention provides a quadruple convolutional neural network video fingerprint algorithm, which comprises the steps of establishing a projection excitation network, establishing a quadruple convolutional neural network video fingerprint algorithm according to the projection excitation network, inputting an established quadruple video sequence into the quadruple convolutional neural network by selecting video data, and carrying out training and performance testing on the quadruple convolutional neural network, wherein the mapping of original video data to a discrete binary code can be realized end to end by the method, the algorithm complexity is simplified, the network parameters are optimized by adopting quadruple loss and quantization error loss together during training, on one hand, the quadruple loss reduces the intra-class variance and increases the inter-class variance, on the other hand, the quantization error loss can reduce the loss of semantic similar information in the real-value characteristic binarization process, and the precision and recall ratio in the aspect of video copy detection are obviously improved, the obtained video fingerprint can meet the compactness and simultaneously keep stronger robustness and uniqueness.

Description

Four-embryo convolution neural network video fingerprint algorithm

Technical Field

The invention relates to the technical field of multimedia information security, in particular to a quadruplet convolutional neural network video fingerprint algorithm.

Background

With the popularity of the internet, computers are undergoing a networked revolution. Various multimedia information technologies related to the multimedia data are started as spring shoots in the late spring, and therefore, the multimedia data are quickly and conveniently used and spread. In the process, while the human life is enriched and the human knowledge is increased by the mass video data, the illegal contents contained in the mass video data directly damage the personal interests of copyright owners in the process of transmission and seriously influence the development of social health. In order to increase the regulatory force for digital media, in recent years, the state issues relevant laws and regulations to effectively protect video copyright and monitor video content. In addition, the introduction and development of related video copy detection technology also manages and limits the distribution of illegal videos.

Compared with images, video data has higher dimensionality and more complex information content, and the video fingerprint technology is gradually developed into an important ring in the field of video copy detection in order to reduce computer memory space occupied by data and accelerate retrieval. Video fingerprints, also called video hashes, are binary sequences that are quantized into a compact representation by extracting features from raw video data and encoding, so as to achieve the purpose of representing a large amount of raw data with a very small amount of data.

In recent years, deep learning, as an emerging machine learning method, has been proven to model raw data in terms of computer vision such as image classification, face recognition, etc., by virtue of its powerful feature extraction capability, and has achieved a success beyond the conventional methods. The key technology of video fingerprinting is how to extract robust and unique video features and efficiently encode the extracted real-valued features. Therefore, researchers continuously try to autonomously learn deep semantic features with good generalization capability from video data by using various neural networks, such as CNN, LSTM, RNN and the like, and a new research enthusiasm of deep learning technology in the field of video copy detection is triggered. The document Wang L, Bao Y and Li h.compact CNN Based Video reproduction for Efficient Video Copy detection. International reference on multimedia modeling. spring International Publishing, pp.576-587,2017. first, features are extracted from densely sampled Video frames using VGGNet network, and then feature dimensions are reduced by Principal Component Analysis (PCA) and sparse coding, and the search effect is further improved. Document Yue, n.l.and p.c.xue.robust and compact video descriptor by deep neural network ieee International Conference on Acoustics, 2017. training a condition generating model and a nonlinear encoder respectively, finally obtaining a robust video description. The method is based on a two-dimensional convolutional network, only static frame space characteristics can be learned, but time relevance among continuous frames is ignored, in order to realize joint learning of video space-time characteristics, documents Li J, Zhang H and Wan W.two-class 3d-cnn classifiers combinations for video copy detection, multimedia Tools and Applications,2018, vol.5 and pp.1-13 propose that a parallel three-dimensional convolutional neural network is adopted to model video space-time information, compared with the method that a two-dimensional convolutional network is used for extracting characteristics, three-dimensional convolutional operation can capture motion information on a video time dimension, and therefore the overall performance is better. Because of the rapid increase of the video data volume, the computer memory is greatly consumed by directly utilizing the learned high-dimensional space-time characteristics to represent the video data, so that the researchers put forward to combine deep learning with the Hash technology and carry out quantitative processing on the high-dimensional real-valued characteristics learned by the neural network to obtain the low-dimensional discrete fingerprint code. The method comprises the following steps of respectively extracting Video spatial features and time features by adopting a convolutional Neural network and a long-time memory network, merging single-frame level features into Video level features through time sequences, and finally obtaining a binary sequence through quantization by using a traditional hash algorithm. The depth feature extraction and quantization coding are integrated into a unified framework, a Binary LSTM unit is proposed for the encoder RNN to generate the Binary code of the video, while the decoder RNN reconstructs the original video frame in forward and reverse order.

Overall, the prior art suffers from the following problems: firstly, most of the video copy detection technologies based on deep learning only utilize a neural network to learn features, and the quantization processing of a real-value feature sequence still adopts a traditional method, so that the overall complexity of the algorithm is high, and the algorithm is not beneficial to being executed on a large-scale data set; secondly, although part of the existing end-to-end depth video fingerprint algorithm integrates the step of quantization coding into the training of the feature extraction network, no matter in a supervised learning mode of a single sample and a label or an unsupervised learning mode formed by a couple of two-tuple or triple sample, the model cannot fully learn the video features with both robustness and uniqueness under the condition of limited sample number, which directly influences the quality of the finally generated fingerprint code; thirdly, for the feature extraction method using the three-dimensional convolutional neural network, the feedforward network with fewer levels and a simple structure is not enough to mine semantic similar information from a complex video structure.

Therefore, there is a need to provide an improved solution to the above-mentioned deficiencies of the prior art.

Disclosure of Invention

The invention provides an end-to-end deep neural network video fingerprint algorithm, which can reduce information loss in a real-valued characteristic quantization process, the precision ratio and the recall ratio of the method in the aspect of video copy detection are obviously improved, and the obtained video fingerprint can meet the compactness and simultaneously keep stronger robustness and uniqueness. .

In order to achieve the above purpose, the invention provides the following technical scheme:

the invention provides a video fingerprint algorithm of a quadruple convolution neural network, which comprises the following steps:

s1, establishing a projection excitation network, and establishing a video fingerprint algorithm of the quadruple convolution neural network according to the projection excitation network, wherein the video fingerprint algorithm specifically comprises the following steps:

s11, the principle of the projection excitation network comprises: inputting a feature map with the size of D multiplied by H multiplied by W multiplied by C;

s12, carrying out global average pooling along three dimensions D, H and W of each characteristic channel respectively to obtain three projection vectors;

s13, excitation operation comprises two layers of convolution with the size of 1 multiplied by 1, a ReLU function and a Sigmoid function are added after each convolution layer for activation, and each characteristic channel generates weight;

s14, weighting the weight generated by the excitation operation on the feature map in the S11 channel by channel to select features;

s15, fusing the projection excitation network into 50 layers of three-dimensional residual error networks to construct a quadruple convolution neural network, wherein the quadruple convolution neural network specifically comprises the following steps:

s151, the first network layer comprises a convolution layer with the core size of 7 x 7 and a maximum pooling layer with the core size of 3 x 3;

s152, the second network layer to the fifth network layer respectively comprise convolution layers formed by overlapping 3, 4, 6 and 3 bottleneck structures, the second 1 multiplied by 1 convolution in each bottleneck structure is nested in the projection excitation network to achieve the purpose of capturing the associated information between the characteristic channels in each residual error unit, and the sixth network layer comprises a 1 multiplied by 4 average pooling layer;

s2, selecting video data and preprocessing the video data;

s3, constructing a quadruple video sequence;

s4, inputting the quadruple video sequence to the quadruple convolutional neural network, and training the quadruple convolutional neural network;

and S5, performing performance test on the trained quadruplet convolutional neural network.

According to the above video fingerprint algorithm of the quadruple convolutional neural network, preferably, the S2 specifically includes:

s201, selecting a video sequence from the behavior identification data set and the video content identification data set;

s202, intercepting the video sequence in the S201, and then carrying out normalization processing;

and S203, dividing the video subjected to the normalization processing in the S202 into a training set and a testing set.

According to the above-mentioned quadruple convolutional neural network video fingerprint algorithm, preferably, the S3 specifically selects three video sequences with different contents from the training set of the S203 at will, performs distortion transformation on one of the video sequences to obtain a group of video pairs, and forms a quadruple video sequence with the other two videos.

According to the above video fingerprint algorithm of the quadruple convolutional neural network, preferably, the S4 specifically includes:

s401, respectively carrying out equal-interval uniform sampling processing on the quadruple video sequence to obtain an input frame image, and then carrying out horizontal random overturning to realize data enhancement;

s402, setting values of relevant parameters in the objective function;

s403, initializing the parameters of the quadruplet convolutional neural network;

s404, training the tetrad convolutional neural network by adopting a random gradient descent method;

s405, training the quadruplet convolutional neural network, and recording a training process.

According to the video fingerprint algorithm of the quadruple convolutional neural network, preferably, the trained quadruple convolutional neural network is subjected to performance test, and the performance test specifically comprises the following steps:

s501, respectively applying analog distortion transformation to all video sequences of the test set in the S203 to generate copy video sequences;

s502, respectively extracting all video sequences in the test set and fingerprint codes of corresponding copy video sequences according to the trained quadruplet convolutional neural network, and cross-calculating the Hamming distance between the fingerprint code of each video sequence in the test set and the fingerprint code of each copy video sequence;

s503, setting a threshold, comparing the Hamming distance in the S502 with the threshold, judging that the two video sequences form a copy relation when the Hamming distance is smaller than the set threshold, otherwise, judging that the two videos form a non-copy relation;

s504, evaluating the performance of the algorithm by adopting the true-case rate and the false-true-case rate, calculating to obtain a group of data of the true-case rate and the false-true-case rate under each threshold value, and drawing a working characteristic curve;

and S505, respectively carrying out comparison experiments of the same distortion transformation and different parameters and comparison experiments among different algorithms to verify the performance of the video fingerprint algorithm of the quadruple convolution neural network.

According to the above-mentioned quadruple convolutional neural network video fingerprint algorithm, preferably, the quadruple video sequences are v video sequences respectively_a，v_p，v_n1，v_n2Wherein v is_aAnd v_pHaving a copy relationship, v_aAnd v_n1And v_aAnd v_n2Has no copy relationship between v_n1And v_n2Is completely different.

According to the quadruple convolutional neural network video fingerprint algorithm, preferably, the quadruple video sequence is input to the quadruple convolutional neural network to extract space-time characteristics, high-dimensional characteristics are obtained through convolution and pooling operations, and a k-dimensional real value sequence is obtained through full-connection mapping, wherein the k-dimensional real value sequence is as follows:

(f(v_a；Θ)，f(v_p；Θ)，f(v_n1；Θ)，f(v_n2；Θ))；

wherein f (v; theta) is epsilon to R^k×1And Θ represents the parameters of the tetrad convolutional neural network;

normalizing the k-dimensional real-valued sequence to obtain (f)_e(v_a),f_e(v_p),f_e(v_n1),f_e(v_n2))；

Wherein

According to the video fingerprint algorithm of the quadruple convolutional neural network, preferably, the normalized k-dimensional real value feature sequence is quantized by adopting a sign function sgn (·) to generate a discrete fingerprint code, wherein the discrete fingerprint code is as follows:

(H(v_a),H(v_p),H(v_n1),H(v_n2))；

wherein

H(v)＝[h₁,h₂,…,h_k]^T∈{-1,1}^k,h_i＝sign(f_e(v)),h_i＝1iff_e(v)≥0,h_i＝-1iff_e(v)<0。

According to the above video fingerprint algorithm of the quadruple convolutional neural network, preferably, the objective function is:

wherein the adaptive threshold value

N is the training batch size, omega₁And ω₂Are the threshold coefficients, | Θ |, respectively₁Is the sum of absolute values of parameters of each part of the model, lambda and mu are respectively balance quantization error loss, and L1 is a hyperparameter of a regularization term.

Preferably, the quadruple video series (f) is obtained by the quadruple convolutional neural network video fingerprint algorithm_e(v_a),f_e(v_p),f_e(v_n1),f_e(v_n2) Computing the loss of the quadruple video sequence after the objective function computation, wherein the loss computation comprises gradient computation of the quadruple video sequence loss after normalization processing and gradient computation of the quadruple video sequence loss after quantization;

the gradient calculation method of the loss of the normalized quadruple video sequence comprises the following steps:

wherein

The method for calculating the gradient of the loss of the quantized quadruple video sequence comprises the following steps:

compared with the closest prior art, the technical scheme provided by the invention has the following excellent effects:

the invention provides a quadruple convolution neural network video fingerprint algorithm.A quadruple model structure in the algorithm consists of four sub-networks shared by parameters, each sub-network consists of a three-dimensional residual projection excitation network, and on one hand, the three-dimensional residual projection excitation network is used for learning deep space-time characteristics of a video, and on the other hand, a projection excitation module can generate weight for characteristic channels to learn the dependency relationship among the characteristic channels; the three-dimensional projection excitation residual error network is used as a feature extractor of a video copy detection task, and a real-valued fingerprint sequence is obtained in a full-connection mode, so that mapping from original video data to a discrete binary code can be realized end to end, and algorithm complexity is simplified. The method adopts the quadruple loss and the quantization error loss to jointly optimize network parameters during training, so that on one hand, the quadruple loss reduces the intra-class variance and increases the inter-class variance, and the aim of simultaneously learning the video robustness and uniqueness characteristics is fulfilled; on the other hand, the loss of quantization errors can reduce the loss of semantic similar information in the real-value feature binarization process, and the quality of video fingerprints is ensured. The invention combines deep learning and Hash coding, effectively carries out fast retrieval on video copy, can be used for copyright protection and illegal content monitoring of Internet digital video, obviously improves precision ratio and recall ratio in the aspect of video copy detection, and can keep stronger robustness and uniqueness while meeting the compactness of the obtained video fingerprint.

Drawings

FIG. 1 is a schematic flow chart of a video fingerprint algorithm of a quadruplet convolutional neural network according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating a principle of a projection excitation network in an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a quadruplet convolutional neural network according to an embodiment of the present invention;

FIG. 4 is a graph illustrating the variation of training loss values in an embodiment of the present invention;

FIG. 5 is a graph of training accuracy rate transformation in an embodiment of the present invention;

FIG. 6 is a graph of the same distortion transformation frame loss comparison in an embodiment of the present invention;

FIG. 7 is a graph of frame rate reduction versus frame rate for the same distortion conversion in an embodiment of the present invention;

FIG. 8 is a graph of the same distortion transformation rotation contrast in an embodiment of the present invention;

FIG. 9 is a graph of contrast for translation plus frame rate reduction for the same distortion transformation in an embodiment of the present invention;

FIG. 10 is a comparison graph of homomorphic transform insertion logo in an embodiment of the present invention;

FIG. 11 is a graph of the same distortion transform median filtering plus frame loss comparison in an embodiment of the present invention;

FIG. 12 is a graph of same distortion transformation salt and pepper noise plus frame loss contrast in an embodiment of the present invention;

FIG. 13 is a graph of scaling versus comparing the same distortion transformation in an embodiment of the present invention;

FIG. 14 is a graph of different distortion transformation frame loss versus comparison in an embodiment of the present invention;

FIG. 15 is a graph of frame rate reduction versus distortion conversion in an embodiment of the present invention;

FIG. 16 is a graph of rotational contrast for different distortion transformations in an embodiment of the present invention;

FIG. 17 is a graph of different distortion transformation shifts plus frame rate reduction contrast in an embodiment of the present invention;

FIG. 18 is a graph comparing different distortion transformation insertion logos in an embodiment of the present invention;

FIG. 19 is a graph of different distortion transformed median filters plus frame loss contrast in an embodiment of the present invention;

FIG. 20 is a graph of different distortion transformation salt and pepper noise plus frame loss contrast in an embodiment of the present invention;

FIG. 21 is a graph of scaling versus different distortion transformations in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

As shown in fig. 1, the present invention provides a video fingerprint algorithm using a quadruple convolutional neural network, where an overall framework of the video fingerprint algorithm is composed of four parts, i.e., an input end, a quadruple convolutional neural network, a fingerprint code generation layer, and a loss layer, where the input end is specifically used for inputting a quadruple video sequence, the quadruple video sequence is an anchor representing a source video, a positive representing a copy video, a negative1 representing a non-copy video 1, and a negative2 representing a non-copy video 2; the four-fetus convolutional neural network comprises four weight-sharing sub-networks, and each sub-network is a novel three-dimensional residual error network of a nested projection excitation block; the fingerprint code generation layer firstly performs average pooling operation on a feedforward network, maps the obtained 2048-dimensional feature vector into a low-bit real value sequence, and then performs binary coding to obtain a discrete fingerprint code; the model is preferably applied to the loss layer, the model-optimized objective function contains improved quadruple loss and quantization error loss, and an L1 regularization term is added.

The invention is an end-to-end structure, which utilizes a convolution network to acquire the time-space information of input original video data, reduces the dimension of the acquired high-dimensional characteristics to a real value sequence of a target length through full connection, and finally obtains quantized video fingerprints; the model optimization process is based on the relation of three pairs of positive and negative samples formed by four tuples, the internal variance is reduced and the inter-class variance is increased through the loss of the four tuples, the common learning of the robustness characteristics of a copy video pair and the uniqueness characteristics of a non-copy video pair is realized, the loss of similarity information during the binarization of a real value sequence can be effectively reduced through a quantization error term, the overfitting condition during training can be avoided through L1 regularization, in the optimization process, a compact fingerprint code meeting the robustness and the uniqueness is generated by a quadruple convolutional neural network through a minimized objective function, and meanwhile, the generated video fingerprint can reversely guide the learning of the quadruple convolutional neural network, the two are mutually promoted and mutually fed back.

As shown in fig. 2-3, the specific construction steps of the quadruple convolutional neural network in the quadruple convolutional neural network video fingerprint algorithm in the present invention are as follows:

s1, establishing a projection excitation network principle, fusing the projection excitation network principle according to a projection excitation network module in the image segmentation field, and establishing a quadruple convolution neural network, wherein the projection excitation network principle specifically comprises the following steps:

s11, the principle of the projection excitation network comprises the following steps: a feature map of size D × H × W × C is input.

And S12, inputting the feature map, and then performing projection operation, wherein the projection operation is to perform global average pooling respectively along three dimension directions D, H and W of each feature channel to obtain three projection vectors, and ⊕ indicates addition operation for fusing spatial information in each dimension direction.

And S13, performing excitation operation after the projection operation is finished, wherein the excitation operation comprises two layers of convolution with the size of 1 multiplied by 1, the first layer of convolution is activated by adding a ReLU function, the second layer of convolution is activated by adding a Sigmoid function, and the activated feature graph can generate weight for each feature channel so as to learn the interdependence relationship among the channels.

And S14, weighting the weight output by the excitation operation on the initial feature map channel by channel, selecting features according to the current task, improving useful features and suppressing useless features, wherein ⊙ represents dot product operation.

S15, fusing the projection excitation network into a 50-layer three-dimensional residual error network, and constructing a quadruple convolution neural network by taking the 50-layer three-dimensional residual error network as a basic structure through the three-dimensional projection excitation residual error network model, wherein the quadruple convolution neural network specifically comprises the following steps:

s151, the first network layer Conv1 is input with a convolution layer with a core size of 7 × 7 × 7 and a maximum pooling layer with a core size of 3 × 3 × 3.

S152, a second network layer Conv2_ x, a third network layer Conv3_ x, a fourth network layer Conv4_ x and a fifth network layer Conv5_ x respectively represent convolutional layers formed by overlapping 3, 4, 6 and 3 bottleneck structures, a projection excitation block is nested after convolution of the second 1 x 1 in each bottleneck structure, the purpose of capturing correlation information between feature channels in each residual error unit is achieved, and the sixth layer is an average pooling layer with the kernel size of 1 x 4.

And S153, outputting the video fingerprint length obtained by 16 bit numbers, so that the high-dimensional features extracted by the three-dimensional feature projection excitation residual error network are mapped into a 16-dimensional real value sequence, namely, neurons of a full connection layer in the fingerprint code generation layer comprise 16 nodes.

The video fingerprint algorithm flow of the quadruple convolutional neural network specifically comprises the following steps:

1) constructing four-tuple video sequence, wherein the four-tuple video sequence is v respectively_a，v_p，v_n1，v_n2Wherein v is_aAnd v_pHaving a copy relationship, v_aAnd v_n1And v_aAnd v_n2Has no copy relationship between v_n1And v_n2Is completely different.

2) V input_a，v_p，v_n1，v_n2To a quadruplet convolution neural network to extract space-time characteristics, obtaining high-dimensional characteristics through a series of convolution and pooling operations and mapping the high-dimensional characteristics to a k-dimensional real value sequence (f (v) through full connection_a；Θ),f(v_p；Θ),f(v_n1；Θ),f(v_n2(ii) a Θ)), and the k-dimensional real-valued sequence is a real-valued feature with lower dimension obtained by dimension reduction compression of a high-dimensional feature, wherein f (v; theta) is epsilon to R^k×1And Θ represents model parameters.

3) Normalizing the k-dimensional real value sequence to obtain (f)_e(v_a),f_e(v_p),f_e(v_n1),f_e(v_n2) Therein), wherein

4) And generating the discrete fingerprint code (H (v) ·) by quantization by adopting a sign function sgn (·)_a),H(v_p),H(v_n1),H(v_n2) Wherein h (v) ═ h₁,h₂,…,h_k]^T∈{-1,1}^k,h_i＝sign(f_e(v)),h_i＝1iff_e(v)≥0,h_i＝-1iff_e(v)<0。

5) Designing an optimized objective function of the quadruple convolutional neural network, wherein the objective function is an optimized calculation target of the quadruple convolutional neural network, network parameters are continuously adjusted to an optimal state according to the optimized objective function, and the designed and optimized objective function is as follows:

wherein the adaptive threshold value

N is the training batch size, omega₁And ω₂Are the threshold coefficients, | Θ |, respectively₁And (3) calculating loss according to an objective function and updating each parameter of the quadruple convolutional neural network through back propagation, wherein the sum of absolute values of each part of the model is defined as lambda and mu, and the lambda and the mu are respectively balance quantization error loss, and L1 is a hyper-parameter of a regularization term.

6) The method comprises the steps of carrying out gradient calculation on the loss of a novel quadruple video sequence in an objective function, wherein the novel quadruple video sequence is the quadruple loss provided by the embodiment of the invention, specifically, replacing a max (0, question) function with an inconductable smooth continuous function ln (1+ exp (question)) with another infinitely conductive smooth continuous function in the original quadruple loss function Quadrapleltlos (quadruple loss), and conveniently updating parameters by using a gradient descent method, wherein the loss of the novel quadruple video sequence is related to f_e(v_a)，f_e(v_p)，f_e(v_n1) And f_e(v_n2) The gradient calculation method comprises the following steps:

wherein

7) And carrying out gradient calculation on the quantization error loss in the target function, wherein the target function is the target function of the whole quadruple convolutional neural network and comprises quadruple loss, quantization error loss and a regularization term, and the quantization error loss in the target function is related to f_e(v_a)，f_e(v_p)，f_e(v_n1) And f_e(v_n2) The gradient calculation method comprises the following steps:

the video distortion copy can be effectively detected through the depth video fingerprint algorithm, the generated video fingerprint obviously reduces the fingerprint coding length while ensuring the robustness and uniqueness, and the matching efficiency is fundamentally improved.

In order that the invention may be better understood, it will now be further illustrated by the following examples.

S2, selecting experimental data and preprocessing the experimental data, wherein the preprocessing comprises the following steps:

s201, selecting 4986 video sequences which meet the requirements that the resolution is 320 multiplied by 240, the total number of frames is not less than 100 and the difference of video contents is large from the behavior identification data set UCF-101, the HMDB-51 and the video content identification data set FCVID.

S202, intercepting the first 100 frames of the selected 4986 video sequences to obtain the normalized video with the size of 100 multiplied by 320 multiplied by 240.

S203, dividing all the normalized video sequences into 3986 training sets and 1000 testing sets, wherein the videos in the training sets and the videos in the testing sets are not overlapped with each other.

S3, randomly selecting three video sequences with different contents from the training set, carrying out distortion change processing on one of the video sequences to obtain a group of analog copy video pairs, and forming a quadruple video sequence with the other two video sequences.

S4, training the quadruple convolutional neural network, specifically comprising:

s401, evenly sampling 16 frames at equal intervals for each video sequence, and

the four corners and center positions of the video frame are cropped at five spatial scales to obtain a color input frame image with the size of 16 × 112 × 112, and in addition, horizontal random inversion is performed to achieve data enhancement.

S402, setting the value of relevant parameters omega in the objective function₁And ω ₂1 and 0.5, respectively, and λ and μ are 0.01 and 0.001, respectively.

And S403, initializing model parameters before training, migrating the parameters into the quadruple convolutional neural network by using the 3D ResNet-50 network parameters pre-trained on a Kinetics data set, and randomly initializing the parameters, wherein the Kinetics data set is a public video data set related to human behavior recognition and is a recognized standard data set. The videos in the data set are from the optimal rabbit and are divided into 600 categories, each category is at least 600 videos, and each video lasts for about 10 seconds.

S404, training the model by adopting a random gradient descent method, setting the initial learning rate to be 0.01, setting the momentum to be 0.9, setting the weight attenuation parameter to be 0.001, selecting 10000 quadruple video sequences for each epoch, setting the batch size to be 10 according to the size of a computer video memory, and setting the learning rate adjustment strategy to be 0.1 time of the original learning rate reduction of 20000 times per iteration.

S405, respectively training the quadruple video fingerprint algorithm of the nested projection excitation network and the video fingerprint algorithm based on the non-local 3D residual error network under the same data set according to the same training strategy and parameter setting, recording loss and accuracy change conditions in the training process, drawing curves, and drawing results as shown in fig. 4 and 5.

S5, verifying the trained quadruplet convolutional neural network, wherein the verifying step specifically comprises the following steps:

s501, all video sequences in the test set respectively apply 8 analog distortion transforms to generate 34 copy videos, including 3 frame loss videos (randomly discarding 30, 40 and 50 frames), 5 frame rate dropping videos (FPS dropping 2, 4, 6, 8 and 10 per second), 4 rotation videos (rotating 5 degrees, 10 degrees clockwise and rotating 5 degrees and rotating 10 degrees counterclockwise), 3 frame translation and frame rate dropping videos (moving along the coordinate axes (-40, 40), (-60, 60), (-80, 80), and respectively FPS dropping 10 per second), 5 logo insertion videos (inserting position coordinates (40, 60), (60, 80), (80, 100), (100, 120 and 120), 5 median filtering and frame loss videos (templates 9 × 9, 11 × 11, 13 × 13, 15 × 15 and 17 × 17, and randomly discarding 50 frames), 5 salt and pepper noise plus lost frame videos (scale 0.02, 0.04, 0.06, 0.08, 0.10, and randomly dropping 50 frames), 4 scaled videos (multiple 0.5, 0.75, 1.5, 2.0).

S502, extracting fingerprint codes of all video sequences in the test set and fingerprint codes of copy video sequences corresponding to the fingerprint codes according to the trained quadruplet convolutional neural network, and calculating the Hamming distance between the fingerprint code of each video sequence in the test set and the fingerprint code of each copy video sequence in a crossed manner.

S503, setting a threshold value, wherein the threshold value is changed in all Hamming distance ranges, when the Hamming distance between two fingerprint codes is smaller than the set threshold value, judging that two video sequences form a copy relation, otherwise, judging that two videos form a non-copy relation.

And S504, evaluating the performance of the video fingerprint algorithm by adopting the true and false rate.

And S505, respectively carrying out two groups of experiments to verify the performance of the video fingerprint algorithm provided by the invention.

The first group of experiments are comparison experiments of different parameters of the same distortion transformation, the experiment results are shown in fig. 6, fig. 7, fig. 8, fig. 9, fig. 10, fig. 11, fig. 12 and fig. 13, and the curve shown in fig. 6 shows that the performance of the algorithm in the invention is in a descending trend along with the increase of the number of lost frames; the curves shown in fig. 7 indicate that changes in frame rate have little effect on algorithm performance; the curves shown in fig. 8 indicate that the larger the rotation angle, the worse the algorithm performance, and the robustness of the algorithm to counterclockwise rotation is better than that of clockwise rotation, whether clockwise or counterclockwise rotation; the curve shown in fig. 9 shows that the performance of the algorithm is in a descending trend with the distance of the frame translation position in the initial stage, and the performance of the algorithm is in a step-like ascending after the translation position reaches a certain distance; the curves shown in fig. 10 show that the interference rejection of the algorithm is worse as the inserted logo is closer to the center position of the video frame, and the interference rejection of the algorithm is stronger as the logo position is closer to the edge position of the video frame; the curves shown in fig. 11 show that the robustness of the algorithm shows a downward trend as the median filtering strength increases; the curves shown in fig. 12 show that as the intensity of salt and pepper noise increases, the robustness of the algorithm tends to decrease; the curves shown in fig. 13 show that the algorithm performance is in a downward trend as the video zoom factor is increased, and the influence of the zoom factor on the algorithm robustness is significantly greater than the influence of the zoom factor on the algorithm robustness.

The second group of experiments are comparison experiments among different algorithms, four traditional video fingerprint algorithm Structure Graph Models (SGM), Time Information Representative Images (TIRI), gradient direction Centroid (CGO), radial hash (RASH) and a video fingerprint algorithm (NL _ triple) based on a non-local 3D residual error network are selected to perform performance comparison with the algorithm (PE _ Quadrmplet) provided by the invention, and the fingerprint code lengths are all 16 bits. The NL _ triple algorithm and the NL _ triple algorithm are trained and tested under the same data set, the related parameter settings are kept consistent, the experimental results are shown in fig. 14, fig. 15, fig. 16, fig. 17, fig. 18, fig. 19, fig. 20 and fig. 21, the algorithm provided by the invention is obviously superior to the TIRI, CGO and RASH algorithms under the frame loss attack transformation shown in fig. 14, and the overall performance is excellent compared with the SGM algorithm and the deep learning algorithm NL _ triple; fig. 15 shows that under the frame rate down attack transformation, the algorithm proposed by the present invention is slightly inferior to the SGM algorithm and the NL _ Triplet algorithm, but the overall performance is still at a higher level; FIG. 16 shows that under the transformation of the rotation attack, the algorithm proposed by the present invention is the best compared with all the comparison algorithms, and the strong robustness of the algorithm to the rotation distortion is reflected; fig. 17 shows that under the combined attack transformation of frame translation and frame rate reduction, the performance of the algorithm proposed by the present invention is significantly higher than that of the comparison algorithm, and it can be seen that the algorithm has stronger anti-interference capability to the combined distortion of geometry and time domain; FIG. 18 shows that under the transformation of the insertion logo attack, the performance of the algorithm proposed by the present invention is still the best of all algorithms, which shows that the robustness of the algorithm to local distortion is good; fig. 19 shows that under the combined attack transformation of median filtering and frame loss, the performance of the algorithm proposed by the present invention is superior to that of the conventional comparison algorithm, and is comparable to that of the depth algorithm NL _ Triplet, which indicates that the robustness of the algorithm to signal processing class spatial distortion is excellent; fig. 20 shows that under the combined attack transformation of salt-pepper noise and frame loss, the performance of the algorithm proposed by the present invention has better superiority compared with the conventional algorithm, and the strong robustness of the algorithm to the signal processing space distortion is reflected again; under the scaling attack transformation, the algorithm provided by the invention has the best performance compared with the other five comparison algorithms, and the strong robustness of the algorithm to geometric distortion is fully shown in the figure 21.

In conclusion, the invention provides a quadruple convolutional neural network video fingerprint algorithm, a quadruple convolutional neural network video fingerprint algorithm is characterized in that a quadruple model structure consists of four sub-networks shared by parameters, each sub-network consists of a three-dimensional residual projection excitation network, on one hand, the three-dimensional residual projection excitation network is used for video deep space-time feature learning, and on the other hand, a projection excitation module can generate weight for feature channels to learn the dependency relationship among the feature channels; the three-dimensional projection excitation residual error network is used as a feature extractor of a video copy detection task, and a real-valued fingerprint sequence is obtained in a full-connection mode, so that mapping from original video data to a discrete binary code can be realized end to end, and algorithm complexity is simplified. The method adopts the quadruple loss and the quantization error loss to jointly optimize network parameters during training, so that on one hand, the quadruple loss reduces the intra-class variance and increases the inter-class variance, and the aim of simultaneously learning the video robustness and uniqueness characteristics is fulfilled; on the other hand, the loss of quantization errors can reduce the loss of semantic similar information in the real-value feature binarization process, and the quality of video fingerprints is ensured. The method combines deep learning and Hash coding, effectively carries out quick retrieval on the video copy, can be used for copyright protection and illegal content monitoring of Internet digital video, obviously improves precision ratio and recall ratio in the aspect of video copy detection, and can keep stronger robustness and uniqueness while meeting the compactness of the obtained video fingerprint.

The above description is only exemplary of the invention and should not be taken as limiting the invention, as any modification, equivalent replacement, or improvement made within the spirit and principle of the invention is intended to be covered by the appended claims.

Claims

1. A video fingerprint algorithm of a quadruple convolutional neural network, comprising the following steps:

s2, selecting video data and preprocessing the video data;

s3, constructing a quadruple video sequence;

2. The video fingerprint algorithm of the quadruple convolutional neural network according to claim 1, wherein the S2 specifically comprises:

3. The video fingerprinting algorithm of the quadruple convolutional neural network of claim 2, wherein the S3 is specifically a video sequence obtained by arbitrarily selecting three video sequences with different contents from the training set of S203, and performing distortion transformation on one of the video sequences to obtain a set of video pairs, and forming a quadruple video sequence with the other two videos.

4. The video fingerprint algorithm of the quadruple convolutional neural network according to claim 3, wherein the S4 specifically comprises:

s402, setting values of relevant parameters in the objective function;

5. The video fingerprinting algorithm of the quadruple convolutional neural network according to claim 4, wherein the trained quadruple convolutional neural network is subjected to performance testing, the performance testing specifically comprising:

6. The quadruple convolutional neural network video fingerprint algorithm of claim 3, wherein the quadruple video sequences are v video sequences respectively_a，v_p，v_n1，v_n2Wherein v is_aAnd v_pHaving a copy relationship, v_aAnd v_n1And v_aAnd v_n2Has no copy relationship between v_n1And v_n2Is completely different.

7. The quadruple convolutional neural network video fingerprint algorithm according to claim 6, wherein the quadruple video sequence is input into the quadruple convolutional neural network to extract space-time features, high-dimensional features are obtained through convolution and pooling operations, and a k-dimensional real value sequence is obtained through full-connection mapping, wherein the k-dimensional real value sequence is as follows:

(f(v_a；Θ)，f(v_p；Θ)，f(v_n1；Θ)，f(v_n2；Θ))；

Wherein

8. The video fingerprinting algorithm of the tetrasticytis convolutional neural network of claim 7, wherein the normalized k-dimensional real-valued feature sequence is quantized with sign function sgn (·) to generate discrete fingerprint code:

(H(v_a),H(v_p),H(v_n1),H(v_n2))；

wherein

9. The video fingerprinting algorithm of the tetrasticytis convolutional neural network of claim 4, wherein the objective function is:

wherein the adaptive threshold value

10. The video fingerprinting algorithm of the quadruple convolutional neural network of claim 9, wherein the quadruple video series (f) is_e(v_a),f_e(v_p),f_e(v_n1),f_e(v_n2) Computing the loss of the quadruple video sequence after the objective function computation, wherein the loss computation comprises gradient computation of the quadruple video sequence loss after normalization processing and gradient computation of the quadruple video sequence loss after quantization;

wherein