CN111709945B

CN111709945B - Video copy detection method based on depth local features

Info

Publication number: CN111709945B
Application number: CN202010691138.6A
Authority: CN
Inventors: 贾宇; 张家亮; 董文杰; 曹亮
Original assignee: Shenzhen Wanglian Anrui Network Technology Co ltd
Current assignee: Shenzhen Wanglian Anrui Network Technology Co ltd
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2023-06-30
Anticipated expiration: 2040-07-17
Also published as: CN111709945A

Abstract

The invention discloses a video copy detection method based on depth local characteristics, which comprises the following steps: (1) Extracting frame images for video data, and then constructing an image pyramid by utilizing different scales; (2) Constructing a deep convolutional neural network model, extracting a feature map from an input image pyramid, and carrying out feature fusion on the feature map to obtain a fusion feature map; (3) Training the deep convolutional neural network model by using a metric learning mode; (4) Extracting a fusion feature map from an image pyramid by using the trained deep convolutional neural network model; (5) Extracting key points from the fusion feature map by utilizing maximum suppression, and extracting corresponding local features according to the key points; (6) video copy detection based on the local features. The method has the advantages of higher extraction speed and stronger local feature characterization, so that the local feature can be accurately detected aiming at various complex transformed copy videos, and the method has the characteristic of high robustness.

Description

Video copy detection method based on depth local features

Technical Field

The invention relates to the technical field of multimedia information processing, in particular to a video copy detection method based on depth local characteristics.

Background

In the mobile internet age today, the difficulty of preventing the random propagation of tampered video data is increased due to the complexity of multimedia video data, the appearance of various video editing software, wide sources and the like. The related network supervision departments want to effectively supervise the online multimedia video data, and cannot rely on human supervision and user reporting only.

The current solution is that the traditional algorithm has low processing efficiency and low accuracy by the traditional image processing or global feature extraction method, and the global feature extraction method has good processing effect on general editing video, but has difficult processing effect on editing video of various complex transformations to be expected. The traditional image processing and global feature extraction methods have certain defects for the current multimedia video on the Internet.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the problems, a video copy detection method based on depth local features is provided.

The technical scheme adopted by the invention is as follows:

a video copy detection method based on depth local features comprises the following steps:

(1) Extracting frame images for video data, and then constructing an image pyramid by utilizing different scales;

(2) Constructing a deep convolutional neural network model, extracting a feature map from an input image pyramid, and carrying out feature fusion on the feature map to obtain a fusion feature map;

(3) Training the deep convolutional neural network model by using a metric learning mode;

(4) Extracting a fusion feature map from an image pyramid by using the trained deep convolutional neural network model;

(5) Extracting key points from the fusion feature map by utilizing maximum suppression, and extracting corresponding local features according to the key points;

(6) And detecting the video copy according to the local characteristics.

Further, the deep convolutional neural network model is a full convolutional model comprising n-1 layers of convolutional layers and 1-layer fusion convolutional layers; wherein,,

the n-i layer-n-1 layer convolution layer is used for extracting a feature map from an input image pyramid;

the fusion convolution layer is used for carrying out feature fusion on the feature graphs extracted by the n-i layer-n-1 layer convolution layers to obtain a fusion feature graph; i is more than or equal to 2 and less than or equal to n-1, and both i and n are integers.

Further, the convolution channels of the n-i layer to the n-1 layer convolution layers are 128.

Further, the convolution kernel size of the n-1 layer convolution layer is 1×1, and is used for convolving the feature map to the size of 1×1, and the feature map output by the layer convolution layer is used as a global feature for model training.

Further, the step (6) includes the following sub-steps:

(6.1) obtaining local characteristics of the library video through the steps (1) - (5);

(6.2) obtaining local characteristics of the video to be detected through the steps (1) - (5);

(6.3) carrying out random consistency space verification on the local features of the video to be detected and the local features of the library video, and filtering out irrelevant matching points;

(6.4) calculating the similarity according to the rest matching points;

and (6.5) sequencing the similarity calculation results to obtain a source video data result.

Preferably, the similarity is calculated by means of a vector inner product.

Preferably, the frame image extracted for the video data in step (1) is a key frame image.

In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:

the invention extracts the fusion feature map based on the deep convolutional neural network model, adopts maximum suppression to obtain key points, and can extract high-efficiency local features so as to comprehensively describe the video frame image. Compared with the traditional local feature extraction algorithm, the method has the advantages that the extraction speed is higher, the local feature characterization is stronger, therefore, the local feature can be accurately detected aiming at various complex transformed copy videos, the method has the characteristic of high robustness, and a feasible technical scheme is provided for a network supervision department to supervise a large amount of tampered and wantonly spread multimedia video data on the Internet.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a block flow diagram of a video copy detection method based on depth local features according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a deep convolutional neural network model in accordance with an embodiment of the present invention.

FIG. 3 is a schematic diagram of key point and local feature extraction of the present invention.

Fig. 4 is a diagram showing the effect of video copy detection according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the particular embodiments described herein are illustrative only and are not intended to limit the invention, i.e., the embodiments described are merely some, but not all, of the embodiments of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.

The technology related to the invention is described:

convolutional neural networks (Convolutional Neural Networks, CNN) are a type of feedforward neural network (Feedforward Neural Networks) that contains convolutional calculations and has a deep structure, and are one of the representative algorithms of deep learning.

Metric Learning (Metric Learning) is a core algorithm in fine-grained classification, retrieval, face, etc. tasks that can learn subtle distinctions of images through training.

The features and capabilities of the present invention are described in further detail below in connection with the examples.

As shown in fig. 1, the video copy detection method based on depth local features provided in this embodiment includes the following steps:

s1, extracting frame images from video data, and constructing an image pyramid by using different scales;

video data is a collection of images over time, so processing of video can be performed by decimating frame images, but since decimating the number of frames on a time scale results in much redundant information, it is preferable to decimate key frame images from video data. Therefore, the correlation of the video frame images is utilized to extract the key frames, and the similar characteristics only keep one characteristic, so that the redundancy is reduced, and the visual expression of the video data is improved. For example: the key frame extraction mainly uses the format and content of the video frame image to judge the characteristics of the color, texture, structure and the like of the image, filters out similar pictures, ensures that only one frame is extracted from each scene, and the content of the part is the prior art and is not repeated here.

S2, constructing a deep convolutional neural network model, extracting a feature map from an input image pyramid, and carrying out feature fusion on the feature map to obtain a fusion feature map;

as shown in fig. 2, the deep convolutional neural network model is a full convolutional model comprising n-1 convolutional layers and 1 fusion convolutional layers, and no pooling layer is arranged, so that original information of an image is reserved as much as possible; wherein,,

the fusion convolution layer is used for carrying out feature fusion on the feature graphs extracted by the n-i layer-n-1 layer convolution layers to obtain a fusion feature graph; i is more than or equal to 2 and less than or equal to n-1, and both i and n are integers. That is, the fusion convolutional layer is to fuse the feature graphs of the last convolutional layers.

In some embodiments, the convolution channels of the n-i layer-n-1 layer convolution layers are 128, so that the dimension of the local feature extracted subsequently is kept at 128, and the feature images extracted by the convolution layers are normalized in scale, so that the information of the fusion feature images is enhanced.

In some embodiments, the convolution kernel size of the n-1 layer convolution layer is 1×1 for convolving the feature map to a size of 1×1, and the feature map output by the layer convolution layer is used as a global feature for model training.

S3, training the deep convolutional neural network model by using a measurement learning mode;

and a measurement learning mode is adopted, so that the model learns the nuances among the images, and the detection precision is improved. The method specifically adopts an Arcface Loss function containing angle information, and is different from a traditional Triplet Loss function (Triplet Loss), the model of the method is easier to converge, and the learned information is more abundant.

S4, extracting a fusion feature map from the image pyramid by using the trained deep convolutional neural network model;

s5, as shown in FIG. 3, extracting key points from the fusion feature map by utilizing maximum suppression, and extracting corresponding local features according to the key points;

s6, video copy detection is carried out according to the local characteristics:

s61, obtaining local characteristics of the library video through the steps S1-S5, wherein the local characteristics can be understood as a local characteristic library of the library video which is pre-configured and used for detecting the video to be detected subsequently;

s62, the video to be detected is subjected to steps S1-S5 to obtain local characteristics of the video; if the library video is to construct a pyramid for the key frame image and acquire local features, the video to be detected also needs to construct a pyramid for the key frame image and acquire local features;

s63, carrying out random consistency space verification (RANSAC) on the local features of the video to be detected and the local features of the library video, and filtering out irrelevant matching points;

s64, calculating the similarity according to the residual matching points by adopting a vector inner product mode;

s65, sorting the similarity calculation results to obtain source video data results, as shown in FIG. 4.

From the above, the invention has the following beneficial effects:

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. The video copy detection method based on the depth local features is characterized by comprising the following steps of:

(6) Video copy detection is carried out according to the local characteristics;

the step (6) comprises the following sub-steps:

(6.4) calculating the similarity according to the rest matching points;

2. The video copy detection method based on depth local features of claim 1, wherein the depth convolutional neural network model is a full convolutional model comprising n-1 layer convolutional layers and 1 layer fusion convolutional layers; wherein,,

n-i to n-1 convolution layers for extracting feature images from the input image pyramid;

3. The method for video copy detection based on depth localized features of claim 2, wherein the convolution channels of the n-i layer-n-1 layer convolution layer are 128.

4. The depth local feature-based video copy detection method of claim 2, wherein a convolution kernel of an n-1 layer convolution layer has a size of 1 x 1 for convolving a feature map to a size of 1 x 1, and the feature map output by the layer convolution layer is used as a global feature for model training.

5. The method for depth local feature based video copy detection of claim 1, wherein the similarity is calculated as a vector inner product.

6. The method of any one of claims 1-5, wherein the frame images extracted for the video data in step (1) are key frame images.