CN110738128A

CN110738128A - repeated video detection method based on deep learning

Info

Publication number: CN110738128A
Application number: CN201910888907.9A
Authority: CN
Inventors: 宋晓康; 陈锦言
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2020-01-31

Abstract

The invention discloses repetitive video detection methods based on deep learning, which comprises the steps of extracting features of an existing video by using a neural network, establishing a video feature library, extracting the features of the video to be detected, calculating Euclidean distances between the features of the video and the features in the library as similarity measurement, and marking as a repetitive video when the distances are smaller than a set threshold value.

Description

repeated video detection method based on deep learning

Technical Field

The invention belongs to the technical field of computer vision, digital image processing and deep learning, and particularly relates to repeated video detection methods based on a deep learning technology.

Background

With the advent of the internet era, video production and distribution are more and more convenient, video data is growing in a large scale, for example general use of short video applications, video shooting is gradually becoming a way for many people to share life content, but a large amount of repeated videos are inevitably generated at the same time, video content has an economic value of , many actions such as making profit by using pirated videos exist today, videos which are protected by copyright are modified and uploaded to video websites which are not authorized by video producers, copyright problems are generated, interests of video producers are damaged, the video websites face legal risks of , the repeated videos also increase bandwidth and storage cost of the video websites.

The common repeated video mainly comprises format conversion of the video, addition of subtitles and watermarks in the video, video compression, rotation, clipping and the like, the traditional video file hash detection can generate the same hash value for the video files with the same content, and whether the video files belong to the same video is judged by comparing the hash value with the hash value of , but the method can only detect the video with complete content, the changed video files have larger hash value and cannot be used for repeated video detection, so that the digital image processing technology is required to automatically detect the video content and judge the similarity of the video^[1]，SURF^[2]Etc., but this approach is less robust.

The use of convolutional neural networks in general for computer vision tasks in recent years by has facilitated an increase in the accuracy of the series of vision tasks, visual features extracted by convolutional neural networks are generally more robust, but there are fewer ways to solve using neural networks for repeated video detection.

[ reference documents ]

[1]Liu H,Lu H,Xue X.A Segmentation and Graph-Based Video SequenceMatching Method for Video Copy Detection[J].IEEE Transactions on Knowledgeand Data Engineering,2013,25(8):1706-1718.

[2]Yang G，Chen N，Jiang Q.A robust hashing algorithm based on SURF forvideo copy detection[J].Computers&Security，2012，31(1)：33-39.

[3]Simonyan K，Zisserman A.Very Deep Convolutional Networks for Large-Scale ImageRecognition[J].Computer Science，2014.

Disclosure of Invention

Firstly, carrying out feature extraction on video frames by using a neural network, using the intermediate layer output of the neural network as the feature representation of an image, then fusing the features of all the video frames as the feature representation of the video, and finally measuring the similarity of different videos by using the distance between the video features.

In order to solve the technical problem, the repetitive video detection methods based on deep learning provided by the invention use a neural network to extract features from an existing video, establish a video feature library, extract features from a video to be detected, calculate the Euclidean distance between the video features and the features in the library as similarity measurement, and mark the video as a repetitive video when the distance is smaller than a set threshold.

The repeated video detection method comprises the following steps:

step 1: acquiring video frames from an existing video set to obtain a set of all the video frames;

step 2: extracting characteristics of the video frame by adopting a convolutional neural network intermediate layer; the convolutional neural network is a network structure of vgg 16;

firstly, for videos, acquiring a video frame set S, wherein each frame in the set is scaled to be a 3-channel image with the size of 224 multiplied by 224 and used as an input of a neural network, wherein the neural network intermediate layer output is used as a video characteristic, and a feature map of conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2 and conv5_3 layers is taken for a network structure of vgg16, wherein 11 layers are all convolution layer outputs, the convolution kernel size of the layers is 3 multiplied by 3, all 0 is filled, and the convolution kernels move by 1 pixel at a time when sliding;

combining the feature output of the middle layer to finally obtain unique 4096 feature vectors for each video;

and step 3: extracting characteristics from the video library V to obtain a video characteristic library Fv;

and 4, step 4: in the retrieval stage, extracting features of a video v to be retrieved;

and 5: comparing the characteristics of the video characteristic library by using the characteristics of the video v to be retrieved, if the conditions are met, determining the video is a repeated video, and setting the comparison mode and the conditions as follows: during retrieval, a distance formula d between different video features is calculated, the distance between videos i and j is d, a threshold value t is set, after the distance d is obtained, when d is smaller than t, a similar video is judged, and otherwise, the similar video is not judged.

Further , in step 1, the video frames are obtained from the existing video set, and the set of all the video frames is obtained as follows:

V＝(S₍₁₎，S₍₂₎，…S_(n))，S＝(P₍₁₎，P₍₂₎，…P_(n)) (ii) a V, S represents a set of frames of a single video, S_(n)Set of frames, P, representing the nth video_(n)The nth video frame representing the video.

In step 2, the process of finally obtaining unique 4096 feature vectors for each video by combining the feature output of the middle layer is that the feature graph dimension of each layer output is as follows:

F_(k)＝d(W_(k)×W_(k)×C_(k))，k＝1，2，...，11 (1)

equation (1) represents that the dimension of the characteristic diagram output of the k-th layer is W_(k)×W_(k)×C_(k)，W_(k)×W_(k)Is the dimension of the characteristic diagram of the k layer，C_(k)Is the number of channels of the k-th layer feature map;

dimension of the compressed feature map:

FM_(k)，＝max(F_(k))，k＝1，2，...，11 (2)

equation (2) shows a characteristic diagram F for the k-th layer_(k)Taking the maximum value of each channel to obtain C_(k)Vector representation FM of dimensions_(k)Dimension is , vector length is C_(k)；

Connecting the feature representations of all layers yields a feature representation of the entire video frame: FP_nThe number of output channels of each convolutional layer is respectively as follows: 128, 128, 256, 256, 256, 512, 512, 512, 512, 512; the final feature dimension is the sum of the dimensions of the respective layers:

128+128+256+256+256+512+512+512+512+512+512＝4094 (3)

i.e. for video frames P_(n)The extracted vector size is 4096, and S is equal to (P) for all video frames of each video₍₁₎，P₍₂₎，…P_(n)) N 4096-dimensional vectors, and the n vectors are averaged to obtain 4096-dimensional vectors T, which are then classified into to obtain the whole video V_(n)Is characterized by F (V)_(n))；

The formula return to is:

in the formula (4), μ is the mean value of the vector T, σ is the variance, Tv is the final video feature vector, and finally, for each video, there are obtained unique 4096-dimensional vector tables Tv_nAnd represents the characteristics of the nth video.

In step 3, extracting features from the video library V by adopting the mode of step 2 to obtain a video feature library Fv; and 4, extracting features of the video v to be retrieved by adopting the mode of the step 2.

In step 5, the calculation formula of the distance between the videos i, j is as follows:

compared with the prior art, the invention has the beneficial effects that:

in the repeated video detection method, the video characteristics to be detected are obtained without manually judging whether the video is repeated with the existing video library, and the video characteristics extracted by the deep neural network are directly compared with the existing video characteristics to judge whether the video is repeated. The method is different from the traditional method of extracting a plurality of frames from the video, and a feature library is established by selecting a frame sequence. Instead, a single profile is generated for each video by merging the intermediate layer outputs of the neural networks. The scheme provided by the embodiment of the invention utilizes a deep learning method to detect the repeated video, the video can be better represented through the features extracted by the deep neural network, and compared with the traditional image feature extraction operator, the accuracy of detecting the repeated video is higher.

Drawings

FIG. 1 is a flow chart of a method for detecting repeated video based on deep learning according to the present invention;

fig. 2 is a schematic diagram of video frame feature extraction in the present invention.

Detailed Description

The invention is further illustrated in the following description with reference to the figures and the examples, which are not intended to limit the invention in any way.

The invention provides a repeated video detection method based on deep learning, which mainly comprises the following steps: extracting features of an existing video by using a neural network, establishing a video feature library, extracting the features of the video to be detected, calculating Euclidean distance between the video features and the features in the library to serve as similarity measurement, and marking the video to be repeated when the distance is smaller than a set threshold value. The invention obtains the feature representation of low-level to high-level semantic features of the video by extracting the feature output of different levels of the neural network, and can obtain more accurate video feature representation by combining the features of different levels.

As shown in fig. 1, the duplicate video detection method includes the following steps:

step 1: acquiring video frames from an existing video set to obtain a set of all video frames: v, S represent a set of frames of a single video.

V＝(S₍₁₎，S₍₂₎，…S_(n))，S_(n)A set of frames representing the nth video.

S＝(P₍₁₎，P₍₂₎，…P_(n))，P_(n)The nth video frame representing the video.

Step 2: using convolutional neural network vgg16^[3]The middle layer extracts features from the video frame, vgg16 has a network structure as shown in the following table:

layer represents the different layers of the neural network, Output Shape represents the Output vector dimension per layers, and Param represents the number of parameters per layers.

Using neural network interlayer outputs as video features, we take the feature maps of conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2, conv5_3 layers for vgg16, for a total of 11 layers, all being convolutional layer outputs, the convolutional kernel size of these layers is 3 × 3, all 0 padding, 1 pixel at a time of convolutional kernel sliding.

The feature map dimensions output by each layer are as follows:

F_(k)＝d(W_(k)×W_(k)×C_(k))，k＝1，2，...，11 (1)

the above formula represents that the dimension of the characteristic diagram output of the k layer is W_(k)×W_(k)×C_(k)，W_(k)×W_(k)Is the dimension of the characteristic map of the k-th layer, C_(k)Is the number of channels in the k-th layer profile.

And step 3: dimension of the compressed feature map:

FM_(k)，＝max(F_(k))，k＝1，2，...，11 (2)

the above formula shows the characteristic diagram F of the k layer_(k)Taking the maximum value of each channel to obtain C_(k)Vector representation FM of dimensions_(k)Dimension is , vector length is C_(k)。

Then the feature representations of all layers are concatenated, as shown in fig. 2, resulting in a feature representation of the entire video frame: FP_nAs can be seen from the table in step 2, the number of output channels of each convolutional layer is: 128, 128, 256, 256, 256, 512, 512, 512, 512, 512, 512. The final feature dimension is thus the sum of the dimensions of the respective layers:

128+128+256+256+256+512+512+512+512+512+512＝4094 (3)

i.e. for video frames P_(n)The extracted vector size is 4096, and S is equal to (P) for all video frames of each video₍₁₎，P₍₂₎，…P_(n)) N 4096-dimensional vectors, and the n vectors are averaged to obtain 4096-dimensional vectors T, which are then classified into to obtain the whole video V_(n)Is characterized by F (V)_(n))

The formula return to is:

mu is the mean of the vector T, sigma is the variance, Tv is the final video feature vector, with a dimension of 4096 for a set of video frames, unique tables of 4096-dimensional vectors Tv are finally obtained for each video_nAnd represents the characteristics of the nth video.

And step 3: and (5) extracting the characteristics of the video library V in the mode of the step 2 to obtain a video characteristic library Fv.

And 4, step 4: and in the retrieval stage, for the video v to be retrieved, extracting the features Tv in the mode of the step 2.

And 5: and comparing the video feature library features by using the Tv, if the conditions are met, the video is a repeated video, and the comparison mode and the conditions are set as follows:

at the time of retrieval, calculationDistance formula d between different video features, distance between videos i, j is d (Tv)_i，Tv_j) The calculation formula is:

and setting a threshold value t, judging as a repeated video when d is smaller than t after the distance d is obtained, otherwise, judging as the repeated video.

In summary, the repeated video detection method based on deep learning of the present invention can be summarized as including a feature library establishment stage and a discrimination stage.

1. The characteristic library establishing stage, extracting the characteristics of the existing videos through a neural network, outputting the characteristics as the characteristic representation of the videos through a middle layer of a combined neural network, wherein each video corresponds to characteristic files with 4096 dimensions (fig. 2 shows the characteristic extraction principle of a single video frame), and storing the characteristic files into a database to obtain the video characteristic library.

2. And a judging stage, namely extracting features of the judged video by using the same method, calculating Euclidean distance of the existing video features in the library, judging the video to be repeated if the distance is less than a preset threshold value t (the threshold value is 0.3 as set by ), otherwise not judging the video to be repeated.

The network described above uses the tensoflow deep learning framework to acquire a sequence of video frames via opencv and scale the sequence of video frames to a size that satisfies the neural network input. In this example, the feature representation of the entire video is obtained by merging vgg16 the feature maps of the middle layers.

While the present invention has been described with reference to the accompanying drawings, the present invention is not limited to the above-described embodiments, which are illustrative only and not restrictive, and various modifications which do not depart from the spirit of the present invention and which are intended to be covered by the claims of the present invention may be made by those skilled in the art.

Claims

A repetitive video detection method based on deep learning, which is characterized in that a neural network is used for extracting features of an existing video, a video feature library is established, then the features of the video to be detected are extracted, the Euclidean distance between the video features and the features in the library is calculated to be used as similarity measurement, and when the distance is smaller than a set threshold value, the video is marked as a repetitive video.
2. The method for detecting repeated video based on deep learning of claim 1, comprising the following steps:

step 1: acquiring video frames from an existing video set to obtain a set of all the video frames;

step 2: extracting characteristics of the video frame by adopting a convolutional neural network intermediate layer; the convolutional neural network is a network structure of vgg 16;

firstly, for videos, acquiring a video frame set S, wherein each frame in the set is scaled to be a 3-channel image with the size of 224 multiplied by 224 and used as an input of a neural network, wherein the neural network intermediate layer output is used as a video characteristic, and a feature map of conv2_1, conv2_2, conv3_1, conv3_2, conv3_3, conv4_1, conv4_2, conv4_3, conv5_1, conv5_2 and conv5_3 layers is taken for a network structure of vgg16, wherein 11 layers are all convolution layer outputs, the convolution kernel size of the layers is 3 multiplied by 3, all 0 is filled, and the convolution kernels move by 1 pixel at a time when sliding;

combining the feature output of the middle layer to finally obtain unique 4096 feature vectors for each video;

and step 3: extracting characteristics from the video library V to obtain a video characteristic library Fv;

and 4, step 4: in the retrieval stage, extracting features of a video v to be retrieved;

and 5: comparing the characteristics of the video characteristic library by using the characteristics of the video v to be retrieved, if the conditions are met, determining the video is a repeated video, and setting the comparison mode and the conditions as follows: during retrieval, a distance formula d between different video features is calculated, the distance between videos i and j is d, a threshold value t is set, after the distance d is obtained, when d is smaller than t, a similar video is judged, and otherwise, the similar video is not judged.
3. The method according to claim 2, wherein the video frames are obtained from an existing video set, and the set of all video frames is obtained as follows:

V＝(S₍₁₎,S₍₂₎,…S_(n))，S＝(P₍₁₎,P₍₂₎,…P_(n))；

v, S represents a set of frames of a single video, S_(n)Set of frames, P, representing the nth video_(n)The nth video frame representing the video.
4. The method for detecting repeated videos based on deep learning of claim 2, wherein in step 2, the process of combining the feature output of the middle layer to finally obtain unique 4096 feature vectors for each video is:

the feature map dimensions output by each layer are as follows:

F_(k)＝d(W_(k)×W_(k)×C_(k))，k＝1,2,...,11 (1)

equation (1) represents that the dimension of the characteristic diagram output of the k-th layer is W_(k)×W_(k)×C_(k)，W_(k)×W_(k)Is the dimension of the characteristic map of the k-th layer, C_(k)Is the number of channels of the k-th layer feature map;

dimension of the compressed feature map:

FM_(k),＝max(F_(k))，k＝1,2,...,11 (2)

equation (2) shows a characteristic diagram F for the k-th layer_(k)Taking the maximum value of each channel to obtain C_(k)Vector representation FM of dimensions_(k)Dimension is , vector length is C_(k)；

Connecting the feature representations of all layers yields a feature representation of the entire video frame: FP_nThe number of output channels of each convolutional layer is respectively as follows: 128, 128, 256, 256, 256, 512, 512, 512, 512, 512; final characteristicsThe dimension is the sum of the dimensions of the respective layers:

128+128+256+256+256+512+512+512+512+512+512＝4094 (3)

i.e. for video frames P_(n)The extracted vector size is 4096, and S is equal to (P) for all video frames of each video₍₁₎,P₍₂₎,…P_(n)) N 4096-dimensional vectors, and the n vectors are averaged to obtain 4096-dimensional vectors T, which are then classified into to obtain the whole video V_(n)Is characterized by F (V)_(n))；

The formula return to is:

in the formula (4), μ is the mean value of the vector T, σ is the variance, Tv is the final video feature vector, and finally, for each video, there are obtained unique 4096-dimensional vector tables Tv_nAnd represents the characteristics of the nth video.
5. The repeated video detection method based on deep learning of claim 4, wherein in step 3, the video library V is subjected to feature extraction in the manner of step 2 to obtain a video feature library Fv; and 4, extracting features of the video v to be retrieved by adopting the mode of the step 2.
6. The method for detecting repeated video based on deep learning of claim 5, wherein in step 5, the distance between the videos i, j is calculated by the following formula: