CN111709945B - Video copy detection method based on depth local features - Google Patents

Video copy detection method based on depth local features Download PDF

Info

Publication number
CN111709945B
CN111709945B CN202010691138.6A CN202010691138A CN111709945B CN 111709945 B CN111709945 B CN 111709945B CN 202010691138 A CN202010691138 A CN 202010691138A CN 111709945 B CN111709945 B CN 111709945B
Authority
CN
China
Prior art keywords
video
feature
fusion
layer
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010691138.6A
Other languages
Chinese (zh)
Other versions
CN111709945A (en
Inventor
贾宇
张家亮
董文杰
曹亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wanglian Anrui Network Technology Co ltd
Original Assignee
Shenzhen Wanglian Anrui Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Wanglian Anrui Network Technology Co ltd filed Critical Shenzhen Wanglian Anrui Network Technology Co ltd
Priority to CN202010691138.6A priority Critical patent/CN111709945B/en
Publication of CN111709945A publication Critical patent/CN111709945A/en
Application granted granted Critical
Publication of CN111709945B publication Critical patent/CN111709945B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video copy detection method based on depth local characteristics, which comprises the following steps: (1) Extracting frame images for video data, and then constructing an image pyramid by utilizing different scales; (2) Constructing a deep convolutional neural network model, extracting a feature map from an input image pyramid, and carrying out feature fusion on the feature map to obtain a fusion feature map; (3) Training the deep convolutional neural network model by using a metric learning mode; (4) Extracting a fusion feature map from an image pyramid by using the trained deep convolutional neural network model; (5) Extracting key points from the fusion feature map by utilizing maximum suppression, and extracting corresponding local features according to the key points; (6) video copy detection based on the local features. The method has the advantages of higher extraction speed and stronger local feature characterization, so that the local feature can be accurately detected aiming at various complex transformed copy videos, and the method has the characteristic of high robustness.

Description

Video copy detection method based on depth local features
Technical Field
The invention relates to the technical field of multimedia information processing, in particular to a video copy detection method based on depth local characteristics.
Background
In the mobile internet age today, the difficulty of preventing the random propagation of tampered video data is increased due to the complexity of multimedia video data, the appearance of various video editing software, wide sources and the like. The related network supervision departments want to effectively supervise the online multimedia video data, and cannot rely on human supervision and user reporting only.
The current solution is that the traditional algorithm has low processing efficiency and low accuracy by the traditional image processing or global feature extraction method, and the global feature extraction method has good processing effect on general editing video, but has difficult processing effect on editing video of various complex transformations to be expected. The traditional image processing and global feature extraction methods have certain defects for the current multimedia video on the Internet.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems, a video copy detection method based on depth local features is provided.
The technical scheme adopted by the invention is as follows:
a video copy detection method based on depth local features comprises the following steps:
(1) Extracting frame images for video data, and then constructing an image pyramid by utilizing different scales;
(2) Constructing a deep convolutional neural network model, extracting a feature map from an input image pyramid, and carrying out feature fusion on the feature map to obtain a fusion feature map;
(3) Training the deep convolutional neural network model by using a metric learning mode;
(4) Extracting a fusion feature map from an image pyramid by using the trained deep convolutional neural network model;
(5) Extracting key points from the fusion feature map by utilizing maximum suppression, and extracting corresponding local features according to the key points;
(6) And detecting the video copy according to the local characteristics.
Further, the deep convolutional neural network model is a full convolutional model comprising n-1 layers of convolutional layers and 1-layer fusion convolutional layers; wherein,,
the n-i layer-n-1 layer convolution layer is used for extracting a feature map from an input image pyramid;
the fusion convolution layer is used for carrying out feature fusion on the feature graphs extracted by the n-i layer-n-1 layer convolution layers to obtain a fusion feature graph; i is more than or equal to 2 and less than or equal to n-1, and both i and n are integers.
Further, the convolution channels of the n-i layer to the n-1 layer convolution layers are 128.
Further, the convolution kernel size of the n-1 layer convolution layer is 1×1, and is used for convolving the feature map to the size of 1×1, and the feature map output by the layer convolution layer is used as a global feature for model training.
Further, the step (6) includes the following sub-steps:
(6.1) obtaining local characteristics of the library video through the steps (1) - (5);
(6.2) obtaining local characteristics of the video to be detected through the steps (1) - (5);
(6.3) carrying out random consistency space verification on the local features of the video to be detected and the local features of the library video, and filtering out irrelevant matching points;
(6.4) calculating the similarity according to the rest matching points;
and (6.5) sequencing the similarity calculation results to obtain a source video data result.
Preferably, the similarity is calculated by means of a vector inner product.
Preferably, the frame image extracted for the video data in step (1) is a key frame image.
In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:
the invention extracts the fusion feature map based on the deep convolutional neural network model, adopts maximum suppression to obtain key points, and can extract high-efficiency local features so as to comprehensively describe the video frame image. Compared with the traditional local feature extraction algorithm, the method has the advantages that the extraction speed is higher, the local feature characterization is stronger, therefore, the local feature can be accurately detected aiming at various complex transformed copy videos, the method has the characteristic of high robustness, and a feasible technical scheme is provided for a network supervision department to supervise a large amount of tampered and wantonly spread multimedia video data on the Internet.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a block flow diagram of a video copy detection method based on depth local features according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a deep convolutional neural network model in accordance with an embodiment of the present invention.
FIG. 3 is a schematic diagram of key point and local feature extraction of the present invention.
Fig. 4 is a diagram showing the effect of video copy detection according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the particular embodiments described herein are illustrative only and are not intended to limit the invention, i.e., the embodiments described are merely some, but not all, of the embodiments of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.
The technology related to the invention is described:
convolutional neural networks (Convolutional Neural Networks, CNN) are a type of feedforward neural network (Feedforward Neural Networks) that contains convolutional calculations and has a deep structure, and are one of the representative algorithms of deep learning.
Metric Learning (Metric Learning) is a core algorithm in fine-grained classification, retrieval, face, etc. tasks that can learn subtle distinctions of images through training.
The features and capabilities of the present invention are described in further detail below in connection with the examples.
As shown in fig. 1, the video copy detection method based on depth local features provided in this embodiment includes the following steps:
s1, extracting frame images from video data, and constructing an image pyramid by using different scales;
video data is a collection of images over time, so processing of video can be performed by decimating frame images, but since decimating the number of frames on a time scale results in much redundant information, it is preferable to decimate key frame images from video data. Therefore, the correlation of the video frame images is utilized to extract the key frames, and the similar characteristics only keep one characteristic, so that the redundancy is reduced, and the visual expression of the video data is improved. For example: the key frame extraction mainly uses the format and content of the video frame image to judge the characteristics of the color, texture, structure and the like of the image, filters out similar pictures, ensures that only one frame is extracted from each scene, and the content of the part is the prior art and is not repeated here.
S2, constructing a deep convolutional neural network model, extracting a feature map from an input image pyramid, and carrying out feature fusion on the feature map to obtain a fusion feature map;
as shown in fig. 2, the deep convolutional neural network model is a full convolutional model comprising n-1 convolutional layers and 1 fusion convolutional layers, and no pooling layer is arranged, so that original information of an image is reserved as much as possible; wherein,,
the n-i layer-n-1 layer convolution layer is used for extracting a feature map from an input image pyramid;
the fusion convolution layer is used for carrying out feature fusion on the feature graphs extracted by the n-i layer-n-1 layer convolution layers to obtain a fusion feature graph; i is more than or equal to 2 and less than or equal to n-1, and both i and n are integers. That is, the fusion convolutional layer is to fuse the feature graphs of the last convolutional layers.
In some embodiments, the convolution channels of the n-i layer-n-1 layer convolution layers are 128, so that the dimension of the local feature extracted subsequently is kept at 128, and the feature images extracted by the convolution layers are normalized in scale, so that the information of the fusion feature images is enhanced.
In some embodiments, the convolution kernel size of the n-1 layer convolution layer is 1×1 for convolving the feature map to a size of 1×1, and the feature map output by the layer convolution layer is used as a global feature for model training.
S3, training the deep convolutional neural network model by using a measurement learning mode;
and a measurement learning mode is adopted, so that the model learns the nuances among the images, and the detection precision is improved. The method specifically adopts an Arcface Loss function containing angle information, and is different from a traditional Triplet Loss function (Triplet Loss), the model of the method is easier to converge, and the learned information is more abundant.
S4, extracting a fusion feature map from the image pyramid by using the trained deep convolutional neural network model;
s5, as shown in FIG. 3, extracting key points from the fusion feature map by utilizing maximum suppression, and extracting corresponding local features according to the key points;
s6, video copy detection is carried out according to the local characteristics:
s61, obtaining local characteristics of the library video through the steps S1-S5, wherein the local characteristics can be understood as a local characteristic library of the library video which is pre-configured and used for detecting the video to be detected subsequently;
s62, the video to be detected is subjected to steps S1-S5 to obtain local characteristics of the video; if the library video is to construct a pyramid for the key frame image and acquire local features, the video to be detected also needs to construct a pyramid for the key frame image and acquire local features;
s63, carrying out random consistency space verification (RANSAC) on the local features of the video to be detected and the local features of the library video, and filtering out irrelevant matching points;
s64, calculating the similarity according to the residual matching points by adopting a vector inner product mode;
s65, sorting the similarity calculation results to obtain source video data results, as shown in FIG. 4.
From the above, the invention has the following beneficial effects:
the invention extracts the fusion feature map based on the deep convolutional neural network model, adopts maximum suppression to obtain key points, and can extract high-efficiency local features so as to comprehensively describe the video frame image. Compared with the traditional local feature extraction algorithm, the method has the advantages that the extraction speed is higher, the local feature characterization is stronger, therefore, the local feature can be accurately detected aiming at various complex transformed copy videos, the method has the characteristic of high robustness, and a feasible technical scheme is provided for a network supervision department to supervise a large amount of tampered and wantonly spread multimedia video data on the Internet.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (6)

1. The video copy detection method based on the depth local features is characterized by comprising the following steps of:
(1) Extracting frame images for video data, and then constructing an image pyramid by utilizing different scales;
(2) Constructing a deep convolutional neural network model, extracting a feature map from an input image pyramid, and carrying out feature fusion on the feature map to obtain a fusion feature map;
(3) Training the deep convolutional neural network model by using a metric learning mode;
(4) Extracting a fusion feature map from an image pyramid by using the trained deep convolutional neural network model;
(5) Extracting key points from the fusion feature map by utilizing maximum suppression, and extracting corresponding local features according to the key points;
(6) Video copy detection is carried out according to the local characteristics;
the step (6) comprises the following sub-steps:
(6.1) obtaining local characteristics of the library video through the steps (1) - (5);
(6.2) obtaining local characteristics of the video to be detected through the steps (1) - (5);
(6.3) carrying out random consistency space verification on the local features of the video to be detected and the local features of the library video, and filtering out irrelevant matching points;
(6.4) calculating the similarity according to the rest matching points;
and (6.5) sequencing the similarity calculation results to obtain a source video data result.
2. The video copy detection method based on depth local features of claim 1, wherein the depth convolutional neural network model is a full convolutional model comprising n-1 layer convolutional layers and 1 layer fusion convolutional layers; wherein,,
n-i to n-1 convolution layers for extracting feature images from the input image pyramid;
the fusion convolution layer is used for carrying out feature fusion on the feature graphs extracted by the n-i layer-n-1 layer convolution layers to obtain a fusion feature graph; i is more than or equal to 2 and less than or equal to n-1, and both i and n are integers.
3. The method for video copy detection based on depth localized features of claim 2, wherein the convolution channels of the n-i layer-n-1 layer convolution layer are 128.
4. The depth local feature-based video copy detection method of claim 2, wherein a convolution kernel of an n-1 layer convolution layer has a size of 1 x 1 for convolving a feature map to a size of 1 x 1, and the feature map output by the layer convolution layer is used as a global feature for model training.
5. The method for depth local feature based video copy detection of claim 1, wherein the similarity is calculated as a vector inner product.
6. The method of any one of claims 1-5, wherein the frame images extracted for the video data in step (1) are key frame images.
CN202010691138.6A 2020-07-17 2020-07-17 Video copy detection method based on depth local features Active CN111709945B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010691138.6A CN111709945B (en) 2020-07-17 2020-07-17 Video copy detection method based on depth local features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010691138.6A CN111709945B (en) 2020-07-17 2020-07-17 Video copy detection method based on depth local features

Publications (2)

Publication Number Publication Date
CN111709945A CN111709945A (en) 2020-09-25
CN111709945B true CN111709945B (en) 2023-06-30

Family

ID=72546636

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010691138.6A Active CN111709945B (en) 2020-07-17 2020-07-17 Video copy detection method based on depth local features

Country Status (1)

Country Link
CN (1) CN111709945B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI776668B (en) * 2021-09-07 2022-09-01 台達電子工業股份有限公司 Image processing method and image processing system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845499A (en) * 2017-01-19 2017-06-13 清华大学 A kind of image object detection method semantic based on natural language
CN111275044A (en) * 2020-02-21 2020-06-12 西北工业大学 Weak supervision target detection method based on sample selection and self-adaptive hard case mining

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376003B (en) * 2013-08-13 2019-07-05 深圳市腾讯计算机系统有限公司 A kind of video retrieval method and device
CN108229488B (en) * 2016-12-27 2021-01-01 北京市商汤科技开发有限公司 Method and device for detecting key points of object and electronic equipment
CN106991373A (en) * 2017-03-02 2017-07-28 中国人民解放军国防科学技术大学 A kind of copy video detecting method based on deep learning and graph theory
CN108197566B (en) * 2017-12-29 2022-03-25 成都三零凯天通信实业有限公司 Monitoring video behavior detection method based on multi-path neural network
CN113569797B (en) * 2018-11-16 2024-05-21 北京市商汤科技开发有限公司 Key point detection method and device, electronic equipment and storage medium
CN110781350B (en) * 2019-09-26 2022-07-22 武汉大学 Pedestrian retrieval method and system oriented to full-picture monitoring scene
CN111126412B (en) * 2019-11-22 2023-04-18 复旦大学 Image key point detection method based on characteristic pyramid network
CN111241338B (en) * 2020-01-08 2023-09-15 深圳市网联安瑞网络科技有限公司 Depth feature fusion video copy detection method based on attention mechanism

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845499A (en) * 2017-01-19 2017-06-13 清华大学 A kind of image object detection method semantic based on natural language
CN111275044A (en) * 2020-02-21 2020-06-12 西北工业大学 Weak supervision target detection method based on sample selection and self-adaptive hard case mining

Also Published As

Publication number Publication date
CN111709945A (en) 2020-09-25

Similar Documents

Publication Publication Date Title
CN111047548B (en) Attitude transformation data processing method and device, computer equipment and storage medium
Ji et al. Semi-supervised adversarial monocular depth estimation
WO2022000420A1 (en) Human body action recognition method, human body action recognition system, and device
Cai et al. FCSR-GAN: Joint face completion and super-resolution via multi-task learning
CN112541864A (en) Image restoration method based on multi-scale generation type confrontation network model
CN110728219A (en) 3D face generation method based on multi-column multi-scale graph convolution neural network
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN111241338B (en) Depth feature fusion video copy detection method based on attention mechanism
CN111738054B (en) Behavior anomaly detection method based on space-time self-encoder network and space-time CNN
Joseph et al. C4synth: Cross-caption cycle-consistent text-to-image synthesis
CN112487207A (en) Image multi-label classification method and device, computer equipment and storage medium
CN112084952B (en) Video point location tracking method based on self-supervision training
CN114339409A (en) Video processing method, video processing device, computer equipment and storage medium
CN115131218A (en) Image processing method, image processing device, computer readable medium and electronic equipment
CN116168329A (en) Video motion detection method, equipment and medium based on key frame screening pixel block
CN109766918A (en) Conspicuousness object detecting method based on the fusion of multi-level contextual information
CN116935486A (en) Sign language identification method and system based on skeleton node and image mode fusion
CN111709945B (en) Video copy detection method based on depth local features
Li et al. A discriminative self‐attention cycle GAN for face super‐resolution and recognition
Zheng et al. Pose flow learning from person images for pose guided synthesis
Huang et al. Temporally-aggregating multiple-discontinuous-image saliency prediction with transformer-based attention
CN114998814B (en) Target video generation method and device, computer equipment and storage medium
CN116977200A (en) Processing method and device of video denoising model, computer equipment and storage medium
LU101933B1 (en) Human action recognition method, human action recognition system and equipment
CN111047571B (en) Image salient target detection method with self-adaptive selection training process

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220517

Address after: 518000 22nd floor, building C, Shenzhen International Innovation Center (Futian science and Technology Plaza), No. 1006, Shennan Avenue, Xintian community, Huafu street, Futian District, Shenzhen, Guangdong Province

Applicant after: Shenzhen wanglian Anrui Network Technology Co.,Ltd.

Address before: Floor 4-8, unit 5, building 1, 333 Yunhua Road, high tech Zone, Chengdu, Sichuan 610041

Applicant before: CHENGDU 30KAITIAN COMMUNICATION INDUSTRY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant