CN112215908B - Compressed domain-oriented video content comparison system, optimization method and comparison method - Google Patents

Compressed domain-oriented video content comparison system, optimization method and comparison method Download PDF

Info

Publication number
CN112215908B
CN112215908B CN202011086137.5A CN202011086137A CN112215908B CN 112215908 B CN112215908 B CN 112215908B CN 202011086137 A CN202011086137 A CN 202011086137A CN 112215908 B CN112215908 B CN 112215908B
Authority
CN
China
Prior art keywords
module
video
compressed domain
comparison
video content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011086137.5A
Other languages
Chinese (zh)
Other versions
CN112215908A (en
Inventor
李扬曦
缪亚男
袁庆升
胡卫明
李兵
刘雨帆
胡赛军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
National Computer Network and Information Security Management Center
Original Assignee
Institute of Automation of Chinese Academy of Science
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science, National Computer Network and Information Security Management Center filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202011086137.5A priority Critical patent/CN112215908B/en
Publication of CN112215908A publication Critical patent/CN112215908A/en
Application granted granted Critical
Publication of CN112215908B publication Critical patent/CN112215908B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/008Vector quantisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention belongs to the field of computer vision, and particularly relates to a video content comparison system, an optimization method and a comparison method for a compressed domain, aiming at solving the problem of low efficiency of completing video content comparison by using full decoding information. The comparison system of the invention comprises: the characteristic learning module is used for respectively acquiring characteristic graphs of multiple modes based on multiple kinds of compressed domain information of the input video; the multi-modal compressed domain information fusion module is used for carrying out information fusion on the characteristic graphs of the multiple modalities output by the characteristic learning module to obtain a fusion characteristic vector of the input video; a second module configured to obtain an L1 distance of a fusion feature vector of two input videos; the classifier is a two-classification network and is configured to perform two classifications of the comparison result based on the L1 distance output by the second module. The invention can effectively extract the high-level semantic information of the video content and ensure the high speed and the high performance of the comparison of the video content.

Description

Compressed domain-oriented video content comparison system, optimization method and comparison method
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a video content comparison system, an optimization method and a comparison method for a compressed domain.
Background
In content-based video understanding systems, a large amount of video typically needs to be processed. At present, the internet video flow rate is more than 99% of the internet video flow rate in the encoding standards such as H264, H265 and the like, the volume of the encoded video is greatly reduced by tens of to hundreds of times, but the image information in the video is also converted into indirect information and can be restored into image frame information forming the video only by decoding. Most of the existing algorithms or systems for video recognition, comparison, retrieval, etc. need to decode the video into image frames, and then process and analyze the image frames in the image sequence. However, video decoding is very computation-resource-consuming and time-consuming, which will certainly greatly affect the practicability and flexibility of various application systems, especially for some video retrieval, comparison application systems and occasions requiring real-time processing.
Therefore, research is directed to a compression domain, and a scheme for understanding, comparing and identifying video content under a partial decoding condition is an urgent problem to be solved. Different from the conventional video processing method, the video comparison method facing the compression domain needs to be directly carried out on compressed data which is not decoded or is decoded as little as possible, and additional links of decompression and recompression are omitted, so that the overall processing time of the system is greatly reduced. The video content comparison task is taken as a representative task, how to play the characteristics of high efficiency and roughness of video compression domain information, and a proper network structure is designed, so that the task of efficiently finishing video content understanding is the technical problem to be solved.
Disclosure of Invention
In order to solve the above-mentioned problem in the prior art, that is, to solve the problem that the efficiency of comparing video contents using full video decoding information is not high, a first aspect of the present invention provides a video content comparison system for a compressed domain, which includes a first module, a second module, and a classifier, which are connected in sequence;
the first module comprises a feature learning module and a multi-modal compressed domain information fusion module; the feature learning module is configured to obtain feature maps of multiple modalities based on multiple kinds of compressed domain information of the input video; the multi-modal compressed domain information fusion module is configured to perform information fusion on the multi-modal feature maps output by the feature learning module to obtain a fusion feature vector of the input video;
the second module is configured to obtain an L1 distance of fusion feature vectors of the two input videos;
the classifier is a two-classification network and is configured to perform two classifications of the comparison result based on the L1 distance output by the second module.
In some preferred embodiments, the feature learning module is constructed based on a weight-shared twin convolutional neural network.
In some preferred embodiments, the method for the second module to obtain the L1 distance is:
and performing element-based difference on the fusion feature vectors of the two input videos to obtain a corresponding L1 distance.
In a second aspect of the present invention, a method for optimizing a compressed domain-oriented video content comparison system is provided, where the method is used for optimizing the compressed domain-oriented video content comparison system, and includes:
training the first module based on a preset training sample to obtain an optimized first module;
constructing a new comparison system based on the optimized first module, the optimized second module and the classifier;
and fixing the parameters of the optimized first module based on a preset training sample, and training the classifier in the new comparison system to obtain the optimized comparison system.
In some preferred embodiments, the "training of the first module" is performed using a loss function L of
Figure BDA0002720407000000031
Wherein N is the number of samples, D n The Euclidean distance of the fusion feature vectors of the two videos in the nth sample pair is obtained, Y is a label for judging whether the two samples are matched, and m is a preset threshold value.
In some preferred embodiments, the loss function used to "train the classifiers in the new comparison system" is the cross-entropy loss of the classifications.
In some preferred embodiments, the training sample is obtained by:
based on an offline video database, video clipping is carried out on copied video segments existing in different videos according to a label file, similar video segment pairs clipped according to the label file are used as positive samples, 1 video is randomly selected from other remaining video segments, and pairs formed by the videos and the original videos are used as negative samples.
The third aspect of the present invention provides a video content comparison method for a compressed domain, where the comparison method includes:
acquiring a video pair to be compared;
respectively carrying out partial decoding on two videos in the video pair to be compared, and extracting video compression domain information;
obtaining a comparison result through an optimized comparison system;
wherein the content of the first and second substances,
the obtaining method of the optimized comparison system comprises the following steps: and optimizing the video content comparison system facing the compressed domain based on the optimization method of the video content comparison system facing the compressed domain.
In a fourth aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, and the programs are adapted to be loaded and executed by a processor to implement the optimization method for the compressed domain-oriented video content comparison system or the compressed domain-oriented video content comparison method.
In a fifth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the optimization method of the video content comparison system facing the compressed domain or the video content comparison method facing the compressed domain.
The invention has the beneficial effects that:
1. according to the invention, the compressed domain information of the video is fully used, the deep twin neural network is designed, the high-level semantic information of the video content can be effectively extracted, and the high speed and the high performance of the comparison of the video content are ensured. By using the compressed domain information instead of the information of the video full decoding, the calculation amount of the video content understanding task is greatly reduced.
2. The invention designs a multi-modal fusion mode of compressed domain information, so that the information of different modes of a compressed domain is effectively fused, and the representation characteristic of high-level video semantics combined with video spatio-temporal information is constructed. The deep twin neural network effectively uses various coarse compressed domain information, and the video content comparison precision is improved.
3. According to the method, the characteristic of contrast loss in the deep twin neural network is used, namely the characteristic that the characteristic distance of a positive sample is as small as possible, and the characteristic distance of a negative sample is as large as possible, so that the effect similar to an SVM (support vector machine) large-distance classifier is learned by the network, the learned video characteristics are more decisive, and the network performance is more robust.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a block diagram of a compressed domain-oriented video content alignment system according to an embodiment of the present invention;
FIG. 2 is an algorithmic framework schematic of a deep twin neural network;
fig. 3 is a flowchart illustrating an optimization method of a compressed domain-oriented video content comparison system according to an embodiment of the present invention.
Detailed Description
To make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict.
The invention relates to a video content comparison system facing a compressed domain, which comprises a first module, a second module and a classifier which are connected in sequence as shown in figure 1;
the first module comprises a feature learning module and a multi-modal compressed domain information fusion module; the feature learning module is configured to obtain feature maps of multiple modalities based on multiple kinds of compressed domain information of the input video; the multi-modal compressed domain information fusion module is configured to perform information fusion on the multi-modal feature maps output by the feature learning module to obtain a fusion feature vector of the input video;
the second module is configured to obtain an L1 distance of fusion feature vectors of the two input videos;
the classifier is a two-classification network and is configured to perform two classifications of the comparison result based on the L1 distance output by the second module.
For the purpose of clearly illustrating the invention, reference will be made to the following detailed description of the various parts of the invention taken in conjunction with the accompanying drawings.
The video content comparison system facing the compressed domain comprises a first module, a second module and a classifier which are connected in sequence.
The embodiment further comprises a video compression domain information extraction module, and before comparing the video contents, the extraction of the video compression domain information needs to be respectively performed on the video pairs to be compared. And partially decoding the video, and extracting video compression domain information, including an I frame, a motion vector and a residual error.
The video compression domain information extraction module in this embodiment uses a core video codec frame of FFmpeg, and adopts a codec mode of an H264 code stream, for example, when decoding an I frame, steps of entropy decoding, inverse quantization, inverse transformation, and the like are provided for the code stream. For a decoded stream with motion vectors present in macroblock prediction, before entropy decoding is performed, it is necessary to first determine a prediction Mode or motion vector MV of a macroblock and a block coding Mode CBP, and then perform entropy decoding on luminance and chrominance, respectively. The source code based on the FFmpeg is designed into a c + + code, unnecessary decoding information and processes are skipped while decoding of a key decoding process, and efficient extraction of compressed domain information is completed. In addition, because the training of the whole network in the embodiment adopts end-to-end training, the mixed compilation of c + + and python needs to be completed in engineering, so that compressed domain information extracted from FFmemg by using c + + can be directly subjected to data exchange in the training using a PyTorch framework.
1. First module
The first module comprises a feature learning module and a multi-modal compressed domain information fusion module, forms a video similarity discrimination network based on a depth twin network, and is used for obtaining the video expressive feature vector from the video compressed domain information of the input video pair.
And the characteristic learning module is configured to respectively acquire characteristic graphs of multiple modes based on multiple kinds of compressed domain information of the input video. As shown in fig. 2, the structure of the module is a twin convolutional neural network structure based on weight sharing, the module takes compressed domain information of a pair of videos as input, respectively learns the compressed domain information by taking a multi-stream convolutional neural network as a branch of the twin network for different compressed domain information, such as I frames and motion vectors, specifically, a resnet34 skeleton is used for the I frames, and a feature map output by layer4 in the resnet structure is used as an output of the learning module for the motion vectors and the resnet18 skeleton is used. The feature learning network structure of the design includes, but is not limited to, the above structure configuration.
And the multi-modal compressed domain information fusion module is configured to perform information fusion on the multi-modal feature maps output by the feature learning module to obtain a fusion feature vector of the input video. The module inputs are feature maps from compressed domain information learning network outputs from different modalities. The method comprises the steps of stacking feature maps for the fusion of multi-modal compressed domain information features, learning the weights of different modal information under the condition that the number of channels is kept unchanged through a convolution layer with the core size of 1, using a kaiming initialization mode for the 1x1 convolution layer, setting the learning rate to be twice of the initial learning rate of a network during training, achieving a faster convergence effect and enabling multi-modal information to be effectively fused. Assuming that two compressed domain information of an I frame and a motion vector are used, the feature learning module outputs graph feature 1 (FeatureMap 1) and graph feature 2 (FeatureMap 2) with sizes of (N, C, T, W, H) and (N, C, T, W, H); then, the obtained product is spliced on a channel to obtain graph characteristics 3 (FeatureMap 3), (N, C2, T, W, H); and then, performing weighted learning on the channel again through conv1x1 convolution to obtain final graph features (FeatureMap), (N, C, T, W, H), and then obtaining a fused unique video-level feature vector through a Flatten operation (leveling operation), thereby completing the fusion of multi-mode information.
2. Second module
A second module configured to obtain an L1 distance of a fused feature vector of the two input videos.
The method for the second module to obtain the L1 distance comprises the following steps: and performing element-based difference on the fusion feature vectors of the two input videos to obtain a corresponding L1 distance.
3. Sorting device
The classifier is a two-classification network and is configured to perform two classifications of the comparison result based on the L1 distance output by the second module. The two-classification network in this embodiment is a full connection layer, and the number of neurons output by the two-classification network is similar or dissimilar, so that whether a video is a copy video can be determined.
A second embodiment of the present invention provides a method for optimizing a video content comparison system oriented to a compressed domain, which is used for optimizing the video content comparison system oriented to the compressed domain.
Training samples need to be constructed before optimization, and the embodiment adopts an off-line sampling method, so that a large number of positive and negative sample pairs required by training can be effectively obtained. The method comprises the steps of carrying out off-line sampling on a public data set VCDB, carrying out video cutting on copied video segments existing in different videos according to a marking file, taking a similar video segment pair cut according to the marking file as a positive sample, randomly selecting 1 video from other residual video segments, and taking a pair formed by the video and an original video as a negative sample. And repeating the method to complete the construction of the data set.
The optimization method of the embodiment, as shown in fig. 3, includes the following steps:
and S100, training the first module based on a preset training sample to obtain an optimized first module.
The obtained training samples are fed into a feature learning module of a first module in batches, and feature graphs of different compressed domain information are obtained according to forward propagation; and then, the feature map is sent to a multi-modal compressed domain information fusion module of the first module to obtain a unique feature vector of the video, and a back propagation training network is carried out by using the contrast loss. Wherein the contrast loss is defined as follows:
Figure BDA0002720407000000081
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002720407000000082
representing the fusion characteristics of two videos in the nth sample pairVector X 1 And X 2 P represents the feature dimension of the fused feature vector, Y is a label of whether two samples match, Y =1 represents that two samples are similar or match, Y =0 represents that two samples do not match, m is a set threshold, N is the number of samples, W is the length of the fused feature vector output by the first module, where 512 is selected.
And S200, constructing a new comparison system based on the optimized first module, the optimized second module and the classifier.
And (5) fixing the parameters of the first module based on the optimized parameters obtained in the step (S100), and constructing and obtaining a new comparison system together with the second module and the classifier.
And S300, fixing the parameters of the optimized first module based on a preset training sample, and training a classifier in the new comparison system to obtain the optimized comparison system.
During the training process of this step, the entire network is propagated backwards using the cross-entropy loss of the classification. Setting the training end condition as iteration times and/or preset convergence, repeating the forward propagation and the backward propagation in the process, setting the iteration times, training the network and the Hash function until the network converges, and stopping training.
It should be noted that the training process of the present embodiment adopts a staged training method, and adopts two loss functions, including the contrast loss of step S100 and the cross-entropy loss of step S300.
This embodiment may also be systematized, for example, by optimizing the video content comparison system for the compressed domain, where the optimizing system includes: the system comprises a first training module, an intermediate system building module and a second training module.
The first training module is configured to perform training of the first module based on a preset training sample to obtain an optimized first module;
the intermediate system construction module is configured to construct a new comparison system based on the optimized first module, the optimized second module and the classifier;
and the second training module is configured to fix the parameters of the optimized first module based on a preset training sample, and train the classifier in the new comparison system to obtain the optimized comparison system.
As shown in fig. 2, after the two video frames are respectively subjected to feature extraction by the feature learning module and multi-modal information fusion by the multi-modal compressed domain information fusion module, the fused feature vectors are compared and compared to obtain a comparison and comparison difference result of the two fused feature vectors, and then the comparison and comparison difference result is classified by the full connection layer to obtain a determination result. When the first module is trained, calculating the contrast loss by using the feature vector after multi-modal information fusion, and then performing reverse optimization; and when the classifier is optimized, keeping the first module parameter unchanged, and performing reverse optimization through cross entropy loss based on the classification result of the training sample.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and the related description of the optimization method described above may refer to the corresponding description in the foregoing system embodiment, and the specific working process and the related description of the optimization system described above may refer to the corresponding description in the foregoing optimization method embodiment, which are not repeated herein.
A video content comparison method for a compressed domain according to a third embodiment of the present invention includes:
acquiring a video pair to be compared;
respectively carrying out partial decoding on two videos in the video pair to be compared, and extracting video compression domain information;
obtaining a comparison result through an optimized comparison system;
the obtaining method of the optimized comparison system comprises the following steps: and optimizing the video content comparison system facing the compressed domain based on the optimization method of the video content comparison system facing the compressed domain.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and the related description of the above-described video content comparison method for a compressed domain may refer to the corresponding description in the embodiments of the video content comparison system for a compressed domain and the optimization method for a video content comparison system for a compressed domain, and are not described herein again.
A storage device according to a fourth embodiment of the present invention stores a plurality of programs, which are suitable for being loaded and executed by a processor to implement the optimization method of the compressed domain-oriented video content comparison system or the compressed domain-oriented video content comparison method.
A processing apparatus according to a fifth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the optimization method of the video content comparison system facing the compressed domain or the video content comparison method facing the compressed domain.
It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing embodiments, and are not described herein again.
It should be noted that, the video content comparison system for the compressed domain provided in the foregoing embodiment is only illustrated by the division of the foregoing functional modules, and in practical applications, the foregoing functions may be distributed by different functional modules according to needs, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the foregoing functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is apparent to those skilled in the art that the scope of the present invention is not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (7)

1. A video content comparison system facing a compressed domain is characterized by comprising a first module, a second module and a classifier which are connected in sequence;
the first module comprises a feature learning module and a multi-modal compressed domain information fusion module; the feature learning module is configured to respectively acquire feature maps of multiple modes based on multiple compressed domain information of an input video; the multi-modal compressed domain information fusion module is configured to perform information fusion on the feature maps of multiple modalities output by the feature learning module to obtain a fusion feature vector of the input video;
the second module is configured to obtain an L1 distance of fusion feature vectors of the two input videos;
the classifier is a two-classification network and is configured to perform two classifications of comparison results based on the L1 distance output by the second module;
the optimization method of the comparison system comprises the following steps:
training the first module based on a preset training sample to obtain an optimized first module;
wherein, the training of the first module is carried out, and the adopted loss function L is as follows:
Figure FDA0003766776570000011
wherein N is the number of samples, D n The Euclidean distance of fusion feature vectors of two videos in the nth sample pair is defined, Y is a label for judging whether the two samples are matched, and m is a preset threshold value;
constructing a new comparison system based on the optimized first module, the optimized second module and the classifier;
based on a preset training sample, fixing the parameters of the optimized first module, and training a classifier in the new comparison system to obtain an optimized comparison system; and training the classifier in the new comparison system, wherein the adopted loss function is the classified cross entropy loss.
2. The compressed domain-oriented video content comparison system according to claim 1, wherein the feature learning module is constructed based on a weight-sharing twin convolutional neural network.
3. The system according to claim 1, wherein the second module obtains the L1 distance by:
and performing element-based difference on the fusion feature vectors of the two input videos to obtain a corresponding L1 distance.
4. The system according to claim 1, wherein the training samples are obtained by:
based on an offline video database, video clipping is carried out on copied video segments existing in different videos according to a label file, similar video segment pairs clipped according to the label file are used as positive samples, 1 video is randomly selected from other remaining video segments, and pairs formed by the videos and the original videos are used as negative samples.
5. A video content comparison method facing a compressed domain is characterized in that the comparison method comprises the following steps:
acquiring a video pair to be compared;
respectively carrying out partial decoding on two videos in the video pair to be compared, and extracting video compression domain information;
obtaining an alignment result by the compressed domain-oriented video content alignment system of any one of claims 1-4.
6. A storage device having a plurality of programs stored therein, wherein the programs are adapted to be loaded and executed by a processor, and the method for matching video contents in compressed domain as claimed in claim 5 is performed.
7. A processing device comprising a processor and a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that said program is adapted to be loaded and executed by a processor to implement the compressed domain oriented video content comparison method of claim 5.
CN202011086137.5A 2020-10-12 2020-10-12 Compressed domain-oriented video content comparison system, optimization method and comparison method Active CN112215908B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011086137.5A CN112215908B (en) 2020-10-12 2020-10-12 Compressed domain-oriented video content comparison system, optimization method and comparison method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011086137.5A CN112215908B (en) 2020-10-12 2020-10-12 Compressed domain-oriented video content comparison system, optimization method and comparison method

Publications (2)

Publication Number Publication Date
CN112215908A CN112215908A (en) 2021-01-12
CN112215908B true CN112215908B (en) 2022-12-02

Family

ID=74052819

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011086137.5A Active CN112215908B (en) 2020-10-12 2020-10-12 Compressed domain-oriented video content comparison system, optimization method and comparison method

Country Status (1)

Country Link
CN (1) CN112215908B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990273B (en) * 2021-02-18 2021-12-21 中国科学院自动化研究所 Compressed domain-oriented video sensitive character recognition method, system and equipment
CN114445918A (en) * 2022-02-21 2022-05-06 支付宝(杭州)信息技术有限公司 Living body detection method, device and equipment
CN114666571B (en) * 2022-03-07 2024-06-14 中国科学院自动化研究所 Video sensitive content detection method and system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019242222A1 (en) * 2018-06-21 2019-12-26 北京字节跳动网络技术有限公司 Method and device for use in generating information

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6763069B1 (en) * 2000-07-06 2004-07-13 Mitsubishi Electric Research Laboratories, Inc Extraction of high-level features from low-level features of multimedia content
CN110163079A (en) * 2019-03-25 2019-08-23 腾讯科技(深圳)有限公司 Video detecting method and device, computer-readable medium and electronic equipment
CN110175266B (en) * 2019-05-28 2020-10-30 复旦大学 Cross-modal retrieval method for multi-segment video
CN111046766A (en) * 2019-12-02 2020-04-21 武汉烽火众智数字技术有限责任公司 Behavior recognition method and device and computer storage medium
CN111242173B (en) * 2019-12-31 2021-03-02 四川大学 RGBD salient object detection method based on twin network
CN111401267B (en) * 2020-03-19 2023-06-13 山东大学 Video pedestrian re-identification method and system based on self-learning local feature characterization
CN111626178B (en) * 2020-05-24 2020-12-01 中南民族大学 Compressed domain video motion recognition method and system based on new spatio-temporal feature stream

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019242222A1 (en) * 2018-06-21 2019-12-26 北京字节跳动网络技术有限公司 Method and device for use in generating information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多模态特征融合与多任务学习的特种视频分类;吴晓雨等;《光学精密工程》;20200513(第05期);全文 *

Also Published As

Publication number Publication date
CN112215908A (en) 2021-01-12

Similar Documents

Publication Publication Date Title
CN112215908B (en) Compressed domain-oriented video content comparison system, optimization method and comparison method
CN111382555B (en) Data processing method, medium, device and computing equipment
US8787692B1 (en) Image compression using exemplar dictionary based on hierarchical clustering
TWI744827B (en) Methods and apparatuses for compressing parameters of neural networks
TWI806199B (en) Method for signaling of feature map information, device and computer program
US11062210B2 (en) Method and apparatus for training a neural network used for denoising
CN109783691B (en) Video retrieval method for deep learning and Hash coding
US20200226797A1 (en) Systems and methods for image compression at multiple, different bitrates
WO2021205065A1 (en) Training a data coding system comprising a feature extractor neural network
CN112668559A (en) Multi-mode information fusion short video emotion judgment device and method
CN111970509B (en) Video image processing method, device and system
CN116978011B (en) Image semantic communication method and system for intelligent target recognition
US11227197B2 (en) Semantic understanding of images based on vectorization
WO2021205066A1 (en) Training a data coding system for use with machines
US20200272860A1 (en) System and method of training an appearance signature extractor
CN115481283A (en) Audio and video feature extraction method and device, electronic equipment and computer readable storage medium
CN113723344A (en) Video identification method and device, readable medium and electronic equipment
KR102315077B1 (en) System and method for the detection of multiple compression of image and video
Yang et al. Approaching optimal embedding in audio steganography with GAN
Youssef et al. Detecting double and triple compression in HEVC videos using the same bit rate
US20220417540A1 (en) Encoding Device and Method for Utility-Driven Video Compression
CN117014693A (en) Video processing method, device, equipment and storage medium
US20220377342A1 (en) Video encoding and video decoding
CN115147931A (en) Person-object interaction detection method based on person paired decoding interaction of DETR (digital enhanced tomography)
CN116074574A (en) Video processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant