CN112215908A - Compressed domain-oriented video content comparison system, optimization method and comparison method - Google Patents
Compressed domain-oriented video content comparison system, optimization method and comparison method Download PDFInfo
- Publication number
- CN112215908A CN112215908A CN202011086137.5A CN202011086137A CN112215908A CN 112215908 A CN112215908 A CN 112215908A CN 202011086137 A CN202011086137 A CN 202011086137A CN 112215908 A CN112215908 A CN 112215908A
- Authority
- CN
- China
- Prior art keywords
- module
- video
- compressed domain
- video content
- comparison
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 73
- 238000005457 optimization Methods 0.000 title claims abstract description 26
- 230000004927 fusion Effects 0.000 claims abstract description 35
- 239000013598 vector Substances 0.000 claims abstract description 30
- 230000006835 compression Effects 0.000 claims abstract description 11
- 238000007906 compression Methods 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims description 40
- 230000006870 function Effects 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 7
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 239000000126 substance Substances 0.000 claims description 3
- 238000004590 computer program Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 3
- 230000000644 propagated effect Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006837 decompression Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/002—Image coding using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/008—Vector quantisation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention belongs to the field of computer vision, and particularly relates to a video content comparison system, an optimization method and a comparison method for a compression domain, aiming at solving the problem of low efficiency of completing the comparison of video content by using full decoding information. The comparison system of the invention comprises: the characteristic learning module is used for respectively acquiring characteristic graphs of multiple modes based on multiple kinds of compressed domain information of the input video; the multi-modal compressed domain information fusion module is used for carrying out information fusion on the characteristic graphs of the multiple modalities output by the characteristic learning module to obtain a fusion characteristic vector of the input video; a second module configured to obtain an L1 distance of a fused feature vector of two input videos; the classifier is a two-classification network and is configured to perform two classifications of the comparison result based on the L1 distance output by the second module. The invention can effectively extract the high-level semantic information of the video content and ensure the high speed and the high performance of the comparison of the video content.
Description
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a video content comparison system, an optimization method and a comparison method for a compressed domain.
Background
In content-based video understanding systems, a large amount of video typically needs to be processed. At present, the internet video flow rate is more than 99% of the internet video flow rate in the encoding standards such as H264, H265 and the like, the volume of the encoded video is greatly reduced by tens of to hundreds of times, but the image information in the video is also converted into indirect information and can be restored into image frame information forming the video only by decoding. Most of the existing algorithms or systems for video recognition, comparison, retrieval, etc. need to decode the video into image frames, and then process and analyze the image frames in the image sequence. However, video decoding is very computation-resource-consuming and time-consuming, which will certainly greatly affect the practicability and flexibility of various application systems, especially for some video retrieval, comparison application systems and occasions requiring real-time processing.
Therefore, research is directed to a compression domain, and a scheme for understanding, comparing and identifying video content under a partial decoding condition is an urgent problem to be solved. Different from the conventional video processing method, the video comparison method facing the compression domain needs to be directly carried out on compressed data which is not decoded or is decoded as little as possible, and additional links of decompression and recompression are omitted, so that the overall processing time of the system is greatly reduced. The video content comparison task is taken as a representative task, how to play the characteristics of high efficiency and roughness of video compression domain information, and a proper network structure is designed, so that the task of efficiently finishing video content understanding is the technical problem to be solved.
Disclosure of Invention
In order to solve the above-mentioned problem in the prior art, that is, to solve the problem that the efficiency of comparing video contents using full video decoding information is not high, a first aspect of the present invention provides a video content comparison system for a compressed domain, which includes a first module, a second module, and a classifier, which are connected in sequence;
the first module comprises a feature learning module and a multi-modal compressed domain information fusion module; the feature learning module is configured to obtain feature maps of multiple modalities based on multiple kinds of compressed domain information of the input video; the multi-modal compressed domain information fusion module is configured to perform information fusion on the multi-modal feature maps output by the feature learning module to obtain a fusion feature vector of the input video;
the second module is configured to obtain an L1 distance of a fusion feature vector of two input videos;
the classifier is a two-classification network and is configured to perform two classifications of the comparison result based on the L1 distance output by the second module.
In some preferred embodiments, the feature learning module is constructed based on a weight-shared twin convolutional neural network.
In some preferred embodiments, the method for the second module to obtain the L1 distance is as follows:
and performing element-based difference on the fusion feature vectors of the two input videos to obtain a corresponding L1 distance.
In a second aspect of the present invention, a method for optimizing a compressed domain-oriented video content comparison system is provided, where the method is used for optimizing the compressed domain-oriented video content comparison system, and includes:
training the first module based on a preset training sample to obtain an optimized first module;
constructing a new comparison system based on the optimized first module, the optimized second module and the classifier;
and fixing the parameters of the optimized first module based on a preset training sample, and training a classifier in the new comparison system to obtain the optimized comparison system.
In some preferred embodiments, the "training of the first module" is performed using a loss function L of
Wherein N is the number of samples, DnThe Euclidean distance of the fusion feature vectors of the two videos in the nth sample pair is obtained, Y is a label for judging whether the two samples are matched, and m is a preset threshold value.
In some preferred embodiments, the loss function used in training the classifier in the new comparison system is the cross-entropy loss of the classification.
In some preferred embodiments, the training sample is obtained by:
based on an offline video database, video clipping is carried out on copied video segments existing in different videos according to a label file, similar video segment pairs clipped according to the label file are used as positive samples, 1 video is randomly selected from other remaining video segments, and pairs formed by the videos and the original videos are used as negative samples.
The third aspect of the present invention provides a video content comparison method for a compressed domain, where the comparison method includes:
acquiring a video pair to be compared;
respectively carrying out partial decoding on two videos in the video pair to be compared, and extracting video compression domain information;
obtaining a comparison result through an optimized comparison system;
wherein the content of the first and second substances,
the obtaining method of the optimized comparison system comprises the following steps: and optimizing the video content comparison system facing the compressed domain based on the optimization method of the video content comparison system facing the compressed domain.
In a fourth aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, and the programs are adapted to be loaded and executed by a processor to implement the optimization method of the compressed domain-oriented video content comparison system or the compressed domain-oriented video content comparison method.
In a fifth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the optimization method of the video content comparison system facing the compressed domain or the video content comparison method facing the compressed domain.
The invention has the beneficial effects that:
1. according to the invention, the compressed domain information of the video is fully used, the deep twin neural network is designed, the high-level semantic information of the video content can be effectively extracted, and the high speed and the high performance of the comparison of the video content are ensured. By using the compressed domain information instead of the information of the video full decoding, the calculation amount of the video content understanding task is greatly reduced.
2. The invention designs a multi-modal fusion mode of compressed domain information, so that the information of different modes of a compressed domain is effectively fused, and the representation characteristic of high-level video semantics combined with video spatio-temporal information is constructed. The depth twin neural network effectively uses various coarse compressed domain information, and the video content comparison precision is improved.
3. According to the method, the characteristic of contrast loss in the deep twin neural network is used, namely the characteristic that the characteristic distance of a positive sample is as small as possible, and the characteristic distance of a negative sample is as large as possible, so that the effect similar to an SVM (support vector machine) large-distance classifier is learned by the network, the learned video characteristics are more decisive, and the network performance is more robust.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a block diagram of a compressed domain-oriented video content alignment system according to an embodiment of the present invention;
FIG. 2 is an algorithmic framework schematic of a deep twin neural network;
fig. 3 is a flowchart illustrating an optimization method of a compressed domain-oriented video content comparison system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The invention relates to a video content comparison system facing a compressed domain, which comprises a first module, a second module and a classifier which are connected in sequence as shown in figure 1;
the first module comprises a feature learning module and a multi-modal compressed domain information fusion module; the feature learning module is configured to obtain feature maps of multiple modalities based on multiple kinds of compressed domain information of the input video; the multi-modal compressed domain information fusion module is configured to perform information fusion on the multi-modal feature maps output by the feature learning module to obtain a fusion feature vector of the input video;
the second module is configured to obtain an L1 distance of a fusion feature vector of two input videos;
the classifier is a two-classification network and is configured to perform two classifications of the comparison result based on the L1 distance output by the second module.
For the purpose of more clearly illustrating the present invention, reference is now made to the following detailed description taken in conjunction with the accompanying drawings.
The video content comparison system facing the compressed domain comprises a first module, a second module and a classifier which are connected in sequence.
The embodiment further includes a video compression domain information extraction module, and before comparing the video contents, the video compression domain information of the video pair to be compared needs to be extracted respectively. And partially decoding the video, and extracting video compression domain information, including an I frame, a motion vector and a residual error.
The video compression domain information extraction module in this embodiment uses a core video codec frame of FFmpeg, and adopts a codec mode of an H264 code stream, for example, when decoding an I frame, steps of entropy decoding, inverse quantization, inverse transformation, and the like are provided for the code stream. For a decoded stream with motion vectors present in macroblock prediction, before entropy decoding is performed, it is necessary to first determine a prediction Mode or motion vector MV of a macroblock and a block coding Mode CBP, and then perform entropy decoding on luminance and chrominance, respectively. The source code based on the FFmpeg is designed into a c + + code, unnecessary decoding information and processes are skipped while decoding of a key decoding process, and efficient extraction of compressed domain information is completed. In addition, because the training of the whole network in the embodiment adopts end-to-end training, the mixed compilation of c + + and python needs to be completed in engineering, so that compressed domain information extracted from FFmemg by using c + + can be directly subjected to data exchange in the training using a PyTorch framework.
1. First module
The first module comprises a feature learning module and a multi-modal compressed domain information fusion module, forms a video similarity discrimination network based on a depth twin network, and is used for obtaining the video expressive feature vector from the video compressed domain information of the input video pair.
And the characteristic learning module is configured to respectively acquire characteristic graphs of multiple modes based on multiple kinds of compressed domain information of the input video. As shown in fig. 2, the structure of the module is a twin convolutional neural network structure based on weight sharing, the module takes compressed domain information of a pair of videos as input, respectively learns the compressed domain information by taking a multi-stream convolutional neural network as a branch of the twin network for different compressed domain information, such as I frames and motion vectors, and specifically uses a resnet34 skeleton for the I frames, and uses a resnet18 skeleton for the motion vectors, and uses a feature map output by layer4 in the resnet structure as output of the learning module. The feature learning network structure of the design includes, but is not limited to, the above structure configuration.
And the multi-modal compressed domain information fusion module is configured to perform information fusion on the multi-modal feature maps output by the feature learning module to obtain a fusion feature vector of the input video. The module inputs feature maps from compressed domain information learning network outputs from different modalities. The fusion of the information features of the multi-modal compressed domain is firstly carried out by stacking feature maps, then the weights of different modal information are learned under the condition that the number of channels is kept unchanged by a convolution layer with the core size of 1, the 1x1 convolution layer uses a kaiming initialization mode, the learning rate is set to be twice of the initial learning rate of a network during training, the faster convergence effect is achieved, and the multi-modal information is effectively fused. Assuming that two compressed domain information of I-frame and motion vector are used, the feature learning module outputs graph feature 1(FeatureMap1), graph feature 2(FeatureMap2) with sizes of (N, C, T, W, H) and (N, C, T, W, H), respectively; then, the obtained product is subjected to on-channel splicing to obtain graph characteristics 3(FeatureMap3), (N, C2, T, W, H); and then, performing weighted learning on the channel again through convolution of conv1x1 to obtain final graph features (FeatureMap) and (N, C, T, W and H), and then obtaining a fused unique video-level feature vector through a Flatten operation (leveling operation), thereby completing fusion of multi-mode information.
2. Second module
A second module configured to obtain an L1 distance of a fused feature vector of two input videos.
The method for the second module to obtain the L1 distance comprises the following steps: and performing element-based difference on the fusion feature vectors of the two input videos to obtain a corresponding L1 distance.
3. Classifier
The classifier is a two-classification network and is configured to perform two classifications of the comparison result based on the L1 distance output by the second module. The two-classification network in this embodiment is a full connection layer, and the number of neurons output by the two-classification network is similar or dissimilar, so that whether a video is a copy video can be determined.
A second embodiment of the present invention provides an optimization method for a compressed domain-oriented video content comparison system, which is used for optimizing the compressed domain-oriented video content comparison system.
Training samples need to be constructed before optimization, and the embodiment adopts an off-line sampling method, so that a large number of positive and negative sample pairs required by training can be effectively obtained. The method comprises the steps of carrying out off-line sampling on a public data set VCDB, carrying out video cutting on copied video segments existing in different videos according to a marking file, taking a similar video segment pair cut according to the marking file as a positive sample, randomly selecting 1 video from other residual video segments, and taking a pair formed by the video and an original video as a negative sample. And repeating the above method to complete the construction of the data set.
The optimization method of the embodiment, as shown in fig. 3, includes the following steps:
and S100, training the first module based on a preset training sample to obtain an optimized first module.
The obtained training samples are fed into a feature learning module of a first module in batches, and feature graphs of different compressed domain information are obtained according to forward propagation; and then, the feature map is sent to a multi-modal compressed domain information fusion module of the first module to obtain a unique feature vector of the video, and a back propagation training network is carried out by using the contrast loss. Wherein the contrast loss is defined as follows:
wherein the content of the first and second substances,fusion feature vector X representing two videos in nth sample pair1And X2P represents the feature dimension of the fused feature vector, Y is a label of whether two samples match, Y ═ 1 represents that two samples are similar or matched, Y ═ 0 represents that two samples do not match, m is a set threshold, N is the number of samples, W is the length of the fused feature vector output by the first module, and 512 is selected here.
And S200, constructing a new comparison system based on the optimized first module, the optimized second module and the classifier.
And (5) fixing the parameters of the first module based on the optimized parameters obtained in the step (S100), and constructing and obtaining a new comparison system together with the second module and the classifier.
And S300, fixing the parameters of the optimized first module based on a preset training sample, and training a classifier in the new comparison system to obtain the optimized comparison system.
During the training process of this step, the entire network is propagated backwards using the cross-entropy loss of the classification. Setting the training end condition as iteration times and/or preset convergence, repeating the forward propagation and the backward propagation in the process, setting the iteration times, training the network and the Hash function until the network converges, and stopping training.
It should be noted that the training process of the present embodiment adopts a staged training method, and adopts two loss functions, including the contrast loss of step S100 and the cross-entropy loss of step S300.
The embodiment may also be systematized, for example, by optimizing an optimization system of a compressed domain-oriented video content comparison system, where the optimization system includes: the system comprises a first training module, an intermediate system building module and a second training module.
The first training module is configured to perform training of the first module based on a preset training sample to obtain an optimized first module;
the intermediate system construction module is configured to construct a new comparison system based on the optimized first module, the optimized second module and the classifier;
and the second training module is configured to fix the parameters of the optimized first module based on a preset training sample, and train the classifier in the new comparison system to obtain the optimized comparison system.
As shown in fig. 2, after the two video frames are respectively subjected to feature extraction by the feature learning module and multi-modal information fusion by the multi-modal compressed domain information fusion module, the fused feature vectors are compared and compared with each other to obtain a comparison and comparison difference result of the two fused feature vectors, and then the comparison and comparison difference result is classified by the full connection layer to obtain a determination result. When the first module is trained, calculating the contrast loss by using the feature vector after multi-modal information fusion, and then performing reverse optimization; and when the classifier is optimized, keeping the first module parameter unchanged, and performing reverse optimization through cross entropy loss based on the classification result of the training sample.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and the related description of the optimization method described above may refer to the corresponding description in the foregoing system embodiment, and the specific working process and the related description of the optimization system described above may refer to the corresponding description in the foregoing optimization method embodiment, which are not repeated herein.
A video content comparison method for a compressed domain according to a third embodiment of the present invention includes:
acquiring a video pair to be compared;
respectively carrying out partial decoding on two videos in the video pair to be compared, and extracting video compression domain information;
obtaining a comparison result through an optimized comparison system;
the obtaining method of the optimized comparison system comprises the following steps: and optimizing the video content comparison system facing the compressed domain based on the optimization method of the video content comparison system facing the compressed domain.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and the related description of the above-described video content comparison method for a compressed domain may refer to the corresponding description in the embodiments of the video content comparison system for a compressed domain and the optimization method for a video content comparison system for a compressed domain, and are not described herein again.
A storage device according to a fourth embodiment of the present invention stores a plurality of programs, which are suitable for being loaded and executed by a processor to implement the optimization method of the compressed domain-oriented video content comparison system or the compressed domain-oriented video content comparison method.
A processing apparatus according to a fifth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the optimization method of the video content comparison system facing the compressed domain or the video content comparison method facing the compressed domain.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing embodiments, and are not described herein again.
It should be noted that, the video content comparison system for the compressed domain provided in the foregoing embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
Claims (10)
1. A video content comparison system facing a compressed domain is characterized by comprising a first module, a second module and a classifier which are connected in sequence;
the first module comprises a feature learning module and a multi-modal compressed domain information fusion module; the feature learning module is configured to obtain feature maps of multiple modalities based on multiple kinds of compressed domain information of the input video; the multi-modal compressed domain information fusion module is configured to perform information fusion on the multi-modal feature maps output by the feature learning module to obtain a fusion feature vector of the input video;
the second module is configured to obtain an L1 distance of a fusion feature vector of two input videos;
the classifier is a two-classification network and is configured to perform two classifications of the comparison result based on the L1 distance output by the second module.
2. The system according to claim 1, wherein the feature learning module is constructed based on a weight-sharing twin convolutional neural network.
3. The system according to claim 1, wherein the second module obtains the L1 distance by:
and performing element-based difference on the fusion feature vectors of the two input videos to obtain a corresponding L1 distance.
4. A method for optimizing a compressed domain-oriented video content comparison system, which is used for optimizing the compressed domain-oriented video content comparison system according to any one of claims 1 to 3, and the method comprises:
training the first module based on a preset training sample to obtain an optimized first module;
constructing a new comparison system based on the optimized first module, the optimized second module and the classifier;
and fixing the parameters of the optimized first module based on a preset training sample, and training a classifier in the new comparison system to obtain the optimized comparison system.
5. The method as claimed in claim 4, wherein the training of the first module is performed by using a loss function L of
Wherein N is the number of samples, DnThe Euclidean distance of the fusion feature vectors of the two videos in the nth sample pair is obtained, Y is a label for judging whether the two samples are matched, and m is a preset threshold value.
6. The method as claimed in claim 5, wherein the loss function used in training the classifier in the new comparison system is cross entropy loss of the classification.
7. The method for optimizing a compressed domain-oriented video content comparison system according to any one of claims 4-6, wherein the training samples are obtained by:
based on an offline video database, video clipping is carried out on copied video segments existing in different videos according to a label file, similar video segment pairs clipped according to the label file are used as positive samples, 1 video is randomly selected from other remaining video segments, and pairs formed by the videos and the original videos are used as negative samples.
8. A video content comparison method facing a compressed domain is characterized in that the comparison method comprises the following steps:
acquiring a video pair to be compared;
respectively carrying out partial decoding on two videos in the video pair to be compared, and extracting video compression domain information;
obtaining a comparison result through an optimized comparison system;
wherein the content of the first and second substances,
the obtaining method of the optimized comparison system comprises the following steps: based on the optimization method of the compressed domain-oriented video content comparison system of any one of claims 4 to 7, the optimization method of the compressed domain-oriented video content comparison system of any one of claims 1 to 3 is obtained.
9. A storage device, having a plurality of programs stored therein, wherein the programs are adapted to be loaded and executed by a processor to implement the optimization method of the compressed domain-oriented video content comparison system according to any one of claims 4 to 7 or the compressed domain-oriented video content comparison method according to claim 8.
10. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to implement the optimization method of the compressed domain-oriented video content comparison system according to any one of claims 4 to 7 or the compressed domain-oriented video content comparison method according to claim 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011086137.5A CN112215908B (en) | 2020-10-12 | 2020-10-12 | Compressed domain-oriented video content comparison system, optimization method and comparison method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011086137.5A CN112215908B (en) | 2020-10-12 | 2020-10-12 | Compressed domain-oriented video content comparison system, optimization method and comparison method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112215908A true CN112215908A (en) | 2021-01-12 |
CN112215908B CN112215908B (en) | 2022-12-02 |
Family
ID=74052819
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011086137.5A Active CN112215908B (en) | 2020-10-12 | 2020-10-12 | Compressed domain-oriented video content comparison system, optimization method and comparison method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112215908B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112990273A (en) * | 2021-02-18 | 2021-06-18 | 中国科学院自动化研究所 | Compressed domain-oriented video sensitive character recognition method, system and equipment |
CN114445918A (en) * | 2022-02-21 | 2022-05-06 | 支付宝(杭州)信息技术有限公司 | Living body detection method, device and equipment |
CN114666571A (en) * | 2022-03-07 | 2022-06-24 | 中国科学院自动化研究所 | Video sensitive content detection method and system |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6763069B1 (en) * | 2000-07-06 | 2004-07-13 | Mitsubishi Electric Research Laboratories, Inc | Extraction of high-level features from low-level features of multimedia content |
CN110163079A (en) * | 2019-03-25 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Video detecting method and device, computer-readable medium and electronic equipment |
CN110175266A (en) * | 2019-05-28 | 2019-08-27 | 复旦大学 | A method of it is retrieved for multistage video cross-module state |
WO2019242222A1 (en) * | 2018-06-21 | 2019-12-26 | 北京字节跳动网络技术有限公司 | Method and device for use in generating information |
CN111046766A (en) * | 2019-12-02 | 2020-04-21 | 武汉烽火众智数字技术有限责任公司 | Behavior recognition method and device and computer storage medium |
CN111242173A (en) * | 2019-12-31 | 2020-06-05 | 四川大学 | RGBD salient object detection method based on twin network |
CN111401267A (en) * | 2020-03-19 | 2020-07-10 | 山东大学 | Video pedestrian re-identification method and system based on self-learning local feature characterization |
CN111626178A (en) * | 2020-05-24 | 2020-09-04 | 中南民族大学 | Compressed domain video motion recognition method and system based on new spatio-temporal feature stream |
-
2020
- 2020-10-12 CN CN202011086137.5A patent/CN112215908B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6763069B1 (en) * | 2000-07-06 | 2004-07-13 | Mitsubishi Electric Research Laboratories, Inc | Extraction of high-level features from low-level features of multimedia content |
WO2019242222A1 (en) * | 2018-06-21 | 2019-12-26 | 北京字节跳动网络技术有限公司 | Method and device for use in generating information |
CN110163079A (en) * | 2019-03-25 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Video detecting method and device, computer-readable medium and electronic equipment |
CN110175266A (en) * | 2019-05-28 | 2019-08-27 | 复旦大学 | A method of it is retrieved for multistage video cross-module state |
CN111046766A (en) * | 2019-12-02 | 2020-04-21 | 武汉烽火众智数字技术有限责任公司 | Behavior recognition method and device and computer storage medium |
CN111242173A (en) * | 2019-12-31 | 2020-06-05 | 四川大学 | RGBD salient object detection method based on twin network |
CN111401267A (en) * | 2020-03-19 | 2020-07-10 | 山东大学 | Video pedestrian re-identification method and system based on self-learning local feature characterization |
CN111626178A (en) * | 2020-05-24 | 2020-09-04 | 中南民族大学 | Compressed domain video motion recognition method and system based on new spatio-temporal feature stream |
Non-Patent Citations (1)
Title |
---|
吴晓雨等: "多模态特征融合与多任务学习的特种视频分类", 《光学精密工程》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112990273A (en) * | 2021-02-18 | 2021-06-18 | 中国科学院自动化研究所 | Compressed domain-oriented video sensitive character recognition method, system and equipment |
CN112990273B (en) * | 2021-02-18 | 2021-12-21 | 中国科学院自动化研究所 | Compressed domain-oriented video sensitive character recognition method, system and equipment |
CN114445918A (en) * | 2022-02-21 | 2022-05-06 | 支付宝(杭州)信息技术有限公司 | Living body detection method, device and equipment |
CN114666571A (en) * | 2022-03-07 | 2022-06-24 | 中国科学院自动化研究所 | Video sensitive content detection method and system |
Also Published As
Publication number | Publication date |
---|---|
CN112215908B (en) | 2022-12-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112215908B (en) | Compressed domain-oriented video content comparison system, optimization method and comparison method | |
CN111382555B (en) | Data processing method, medium, device and computing equipment | |
US8787692B1 (en) | Image compression using exemplar dictionary based on hierarchical clustering | |
CN109614517B (en) | Video classification method, device, equipment and storage medium | |
TWI744827B (en) | Methods and apparatuses for compressing parameters of neural networks | |
KR102299958B1 (en) | Systems and methods for image compression at multiple, different bitrates | |
WO2021205065A1 (en) | Training a data coding system comprising a feature extractor neural network | |
TWI806199B (en) | Method for signaling of feature map information, device and computer program | |
CN112668559A (en) | Multi-mode information fusion short video emotion judgment device and method | |
US10972749B2 (en) | Systems and methods for reconstructing frames | |
CN111970509B (en) | Video image processing method, device and system | |
CN112990273B (en) | Compressed domain-oriented video sensitive character recognition method, system and equipment | |
WO2021205066A1 (en) | Training a data coding system for use with machines | |
CN116978011B (en) | Image semantic communication method and system for intelligent target recognition | |
CN115481283A (en) | Audio and video feature extraction method and device, electronic equipment and computer readable storage medium | |
CN113628116B (en) | Training method and device for image processing network, computer equipment and storage medium | |
KR102315077B1 (en) | System and method for the detection of multiple compression of image and video | |
US20220377342A1 (en) | Video encoding and video decoding | |
CN117014693A (en) | Video processing method, device, equipment and storage medium | |
CN116074574A (en) | Video processing method, device, equipment and storage medium | |
CN116778376B (en) | Content security detection model training method, detection method and device | |
US7747093B2 (en) | Method and apparatus for predicting the size of a compressed signal | |
Grycuk et al. | Neural video compression based on SURF scene change detection algorithm | |
KR20240090245A (en) | Scalable video coding system and method for machines | |
WO2023222313A1 (en) | A method, an apparatus and a computer program product for machine learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |