CN112215908B

CN112215908B - Compressed domain-oriented video content comparison system, optimization method and comparison method

Info

Publication number: CN112215908B
Application number: CN202011086137.5A
Authority: CN
Inventors: 李扬曦; 缪亚男; 袁庆升; 胡卫明; 李兵; 刘雨帆; 胡赛军
Original assignee: Institute of Automation of Chinese Academy of Science; National Computer Network and Information Security Management Center
Current assignee: Institute of Automation of Chinese Academy of Science; National Computer Network and Information Security Management Center
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2022-12-02
Anticipated expiration: 2040-10-12
Also published as: CN112215908A

Abstract

The invention belongs to the field of computer vision, and particularly relates to a video content comparison system, an optimization method and a comparison method for a compressed domain, aiming at solving the problem of low efficiency of completing video content comparison by using full decoding information. The comparison system of the invention comprises: the characteristic learning module is used for respectively acquiring characteristic graphs of multiple modes based on multiple kinds of compressed domain information of the input video; the multi-modal compressed domain information fusion module is used for carrying out information fusion on the characteristic graphs of the multiple modalities output by the characteristic learning module to obtain a fusion characteristic vector of the input video; a second module configured to obtain an L1 distance of a fusion feature vector of two input videos; the classifier is a two-classification network and is configured to perform two classifications of the comparison result based on the L1 distance output by the second module. The invention can effectively extract the high-level semantic information of the video content and ensure the high speed and the high performance of the comparison of the video content.

Description

Compressed domain-oriented video content comparison system, optimization method and comparison method

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a video content comparison system, an optimization method and a comparison method for a compressed domain.

Background

In content-based video understanding systems, a large amount of video typically needs to be processed. At present, the internet video flow rate is more than 99% of the internet video flow rate in the encoding standards such as H264, H265 and the like, the volume of the encoded video is greatly reduced by tens of to hundreds of times, but the image information in the video is also converted into indirect information and can be restored into image frame information forming the video only by decoding. Most of the existing algorithms or systems for video recognition, comparison, retrieval, etc. need to decode the video into image frames, and then process and analyze the image frames in the image sequence. However, video decoding is very computation-resource-consuming and time-consuming, which will certainly greatly affect the practicability and flexibility of various application systems, especially for some video retrieval, comparison application systems and occasions requiring real-time processing.

Therefore, research is directed to a compression domain, and a scheme for understanding, comparing and identifying video content under a partial decoding condition is an urgent problem to be solved. Different from the conventional video processing method, the video comparison method facing the compression domain needs to be directly carried out on compressed data which is not decoded or is decoded as little as possible, and additional links of decompression and recompression are omitted, so that the overall processing time of the system is greatly reduced. The video content comparison task is taken as a representative task, how to play the characteristics of high efficiency and roughness of video compression domain information, and a proper network structure is designed, so that the task of efficiently finishing video content understanding is the technical problem to be solved.

Disclosure of Invention

In order to solve the above-mentioned problem in the prior art, that is, to solve the problem that the efficiency of comparing video contents using full video decoding information is not high, a first aspect of the present invention provides a video content comparison system for a compressed domain, which includes a first module, a second module, and a classifier, which are connected in sequence;

the first module comprises a feature learning module and a multi-modal compressed domain information fusion module; the feature learning module is configured to obtain feature maps of multiple modalities based on multiple kinds of compressed domain information of the input video; the multi-modal compressed domain information fusion module is configured to perform information fusion on the multi-modal feature maps output by the feature learning module to obtain a fusion feature vector of the input video;

the second module is configured to obtain an L1 distance of fusion feature vectors of the two input videos;

the classifier is a two-classification network and is configured to perform two classifications of the comparison result based on the L1 distance output by the second module.

In some preferred embodiments, the feature learning module is constructed based on a weight-shared twin convolutional neural network.

In some preferred embodiments, the method for the second module to obtain the L1 distance is:

and performing element-based difference on the fusion feature vectors of the two input videos to obtain a corresponding L1 distance.

In a second aspect of the present invention, a method for optimizing a compressed domain-oriented video content comparison system is provided, where the method is used for optimizing the compressed domain-oriented video content comparison system, and includes:

training the first module based on a preset training sample to obtain an optimized first module;

constructing a new comparison system based on the optimized first module, the optimized second module and the classifier;

and fixing the parameters of the optimized first module based on a preset training sample, and training the classifier in the new comparison system to obtain the optimized comparison system.

In some preferred embodiments, the "training of the first module" is performed using a loss function L of

Wherein N is the number of samples, D _n The Euclidean distance of the fusion feature vectors of the two videos in the nth sample pair is obtained, Y is a label for judging whether the two samples are matched, and m is a preset threshold value.

In some preferred embodiments, the loss function used to "train the classifiers in the new comparison system" is the cross-entropy loss of the classifications.

In some preferred embodiments, the training sample is obtained by:

based on an offline video database, video clipping is carried out on copied video segments existing in different videos according to a label file, similar video segment pairs clipped according to the label file are used as positive samples, 1 video is randomly selected from other remaining video segments, and pairs formed by the videos and the original videos are used as negative samples.

The third aspect of the present invention provides a video content comparison method for a compressed domain, where the comparison method includes:

acquiring a video pair to be compared;

respectively carrying out partial decoding on two videos in the video pair to be compared, and extracting video compression domain information;

obtaining a comparison result through an optimized comparison system;

wherein the content of the first and second substances,

the obtaining method of the optimized comparison system comprises the following steps: and optimizing the video content comparison system facing the compressed domain based on the optimization method of the video content comparison system facing the compressed domain.

In a fourth aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, and the programs are adapted to be loaded and executed by a processor to implement the optimization method for the compressed domain-oriented video content comparison system or the compressed domain-oriented video content comparison method.

In a fifth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the optimization method of the video content comparison system facing the compressed domain or the video content comparison method facing the compressed domain.

The invention has the beneficial effects that:

1. according to the invention, the compressed domain information of the video is fully used, the deep twin neural network is designed, the high-level semantic information of the video content can be effectively extracted, and the high speed and the high performance of the comparison of the video content are ensured. By using the compressed domain information instead of the information of the video full decoding, the calculation amount of the video content understanding task is greatly reduced.

2. The invention designs a multi-modal fusion mode of compressed domain information, so that the information of different modes of a compressed domain is effectively fused, and the representation characteristic of high-level video semantics combined with video spatio-temporal information is constructed. The deep twin neural network effectively uses various coarse compressed domain information, and the video content comparison precision is improved.

3. According to the method, the characteristic of contrast loss in the deep twin neural network is used, namely the characteristic that the characteristic distance of a positive sample is as small as possible, and the characteristic distance of a negative sample is as large as possible, so that the effect similar to an SVM (support vector machine) large-distance classifier is learned by the network, the learned video characteristics are more decisive, and the network performance is more robust.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a block diagram of a compressed domain-oriented video content alignment system according to an embodiment of the present invention;

FIG. 2 is an algorithmic framework schematic of a deep twin neural network;

fig. 3 is a flowchart illustrating an optimization method of a compressed domain-oriented video content comparison system according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict.

The invention relates to a video content comparison system facing a compressed domain, which comprises a first module, a second module and a classifier which are connected in sequence as shown in figure 1;

For the purpose of clearly illustrating the invention, reference will be made to the following detailed description of the various parts of the invention taken in conjunction with the accompanying drawings.

The video content comparison system facing the compressed domain comprises a first module, a second module and a classifier which are connected in sequence.

The embodiment further comprises a video compression domain information extraction module, and before comparing the video contents, the extraction of the video compression domain information needs to be respectively performed on the video pairs to be compared. And partially decoding the video, and extracting video compression domain information, including an I frame, a motion vector and a residual error.

The video compression domain information extraction module in this embodiment uses a core video codec frame of FFmpeg, and adopts a codec mode of an H264 code stream, for example, when decoding an I frame, steps of entropy decoding, inverse quantization, inverse transformation, and the like are provided for the code stream. For a decoded stream with motion vectors present in macroblock prediction, before entropy decoding is performed, it is necessary to first determine a prediction Mode or motion vector MV of a macroblock and a block coding Mode CBP, and then perform entropy decoding on luminance and chrominance, respectively. The source code based on the FFmpeg is designed into a c + + code, unnecessary decoding information and processes are skipped while decoding of a key decoding process, and efficient extraction of compressed domain information is completed. In addition, because the training of the whole network in the embodiment adopts end-to-end training, the mixed compilation of c + + and python needs to be completed in engineering, so that compressed domain information extracted from FFmemg by using c + + can be directly subjected to data exchange in the training using a PyTorch framework.

1. First module

The first module comprises a feature learning module and a multi-modal compressed domain information fusion module, forms a video similarity discrimination network based on a depth twin network, and is used for obtaining the video expressive feature vector from the video compressed domain information of the input video pair.

And the characteristic learning module is configured to respectively acquire characteristic graphs of multiple modes based on multiple kinds of compressed domain information of the input video. As shown in fig. 2, the structure of the module is a twin convolutional neural network structure based on weight sharing, the module takes compressed domain information of a pair of videos as input, respectively learns the compressed domain information by taking a multi-stream convolutional neural network as a branch of the twin network for different compressed domain information, such as I frames and motion vectors, specifically, a resnet34 skeleton is used for the I frames, and a feature map output by layer4 in the resnet structure is used as an output of the learning module for the motion vectors and the resnet18 skeleton is used. The feature learning network structure of the design includes, but is not limited to, the above structure configuration.

And the multi-modal compressed domain information fusion module is configured to perform information fusion on the multi-modal feature maps output by the feature learning module to obtain a fusion feature vector of the input video. The module inputs are feature maps from compressed domain information learning network outputs from different modalities. The method comprises the steps of stacking feature maps for the fusion of multi-modal compressed domain information features, learning the weights of different modal information under the condition that the number of channels is kept unchanged through a convolution layer with the core size of 1, using a kaiming initialization mode for the 1x1 convolution layer, setting the learning rate to be twice of the initial learning rate of a network during training, achieving a faster convergence effect and enabling multi-modal information to be effectively fused. Assuming that two compressed domain information of an I frame and a motion vector are used, the feature learning module outputs graph feature 1 (FeatureMap 1) and graph feature 2 (FeatureMap 2) with sizes of (N, C, T, W, H) and (N, C, T, W, H); then, the obtained product is spliced on a channel to obtain graph characteristics 3 (FeatureMap 3), (N, C2, T, W, H); and then, performing weighted learning on the channel again through conv1x1 convolution to obtain final graph features (FeatureMap), (N, C, T, W, H), and then obtaining a fused unique video-level feature vector through a Flatten operation (leveling operation), thereby completing the fusion of multi-mode information.

2. Second module

A second module configured to obtain an L1 distance of a fused feature vector of the two input videos.

The method for the second module to obtain the L1 distance comprises the following steps: and performing element-based difference on the fusion feature vectors of the two input videos to obtain a corresponding L1 distance.

3. Sorting device

The classifier is a two-classification network and is configured to perform two classifications of the comparison result based on the L1 distance output by the second module. The two-classification network in this embodiment is a full connection layer, and the number of neurons output by the two-classification network is similar or dissimilar, so that whether a video is a copy video can be determined.

A second embodiment of the present invention provides a method for optimizing a video content comparison system oriented to a compressed domain, which is used for optimizing the video content comparison system oriented to the compressed domain.

Training samples need to be constructed before optimization, and the embodiment adopts an off-line sampling method, so that a large number of positive and negative sample pairs required by training can be effectively obtained. The method comprises the steps of carrying out off-line sampling on a public data set VCDB, carrying out video cutting on copied video segments existing in different videos according to a marking file, taking a similar video segment pair cut according to the marking file as a positive sample, randomly selecting 1 video from other residual video segments, and taking a pair formed by the video and an original video as a negative sample. And repeating the method to complete the construction of the data set.

The optimization method of the embodiment, as shown in fig. 3, includes the following steps:

and S100, training the first module based on a preset training sample to obtain an optimized first module.

The obtained training samples are fed into a feature learning module of a first module in batches, and feature graphs of different compressed domain information are obtained according to forward propagation; and then, the feature map is sent to a multi-modal compressed domain information fusion module of the first module to obtain a unique feature vector of the video, and a back propagation training network is carried out by using the contrast loss. Wherein the contrast loss is defined as follows:

wherein, the first and the second end of the pipe are connected with each other,

representing the fusion characteristics of two videos in the nth sample pairVector X ₁ And X ₂ P represents the feature dimension of the fused feature vector, Y is a label of whether two samples match, Y =1 represents that two samples are similar or match, Y =0 represents that two samples do not match, m is a set threshold, N is the number of samples, W is the length of the fused feature vector output by the first module, where 512 is selected.

And S200, constructing a new comparison system based on the optimized first module, the optimized second module and the classifier.

And (5) fixing the parameters of the first module based on the optimized parameters obtained in the step (S100), and constructing and obtaining a new comparison system together with the second module and the classifier.

And S300, fixing the parameters of the optimized first module based on a preset training sample, and training a classifier in the new comparison system to obtain the optimized comparison system.

During the training process of this step, the entire network is propagated backwards using the cross-entropy loss of the classification. Setting the training end condition as iteration times and/or preset convergence, repeating the forward propagation and the backward propagation in the process, setting the iteration times, training the network and the Hash function until the network converges, and stopping training.

It should be noted that the training process of the present embodiment adopts a staged training method, and adopts two loss functions, including the contrast loss of step S100 and the cross-entropy loss of step S300.

This embodiment may also be systematized, for example, by optimizing the video content comparison system for the compressed domain, where the optimizing system includes: the system comprises a first training module, an intermediate system building module and a second training module.

The first training module is configured to perform training of the first module based on a preset training sample to obtain an optimized first module;

the intermediate system construction module is configured to construct a new comparison system based on the optimized first module, the optimized second module and the classifier;

and the second training module is configured to fix the parameters of the optimized first module based on a preset training sample, and train the classifier in the new comparison system to obtain the optimized comparison system.

As shown in fig. 2, after the two video frames are respectively subjected to feature extraction by the feature learning module and multi-modal information fusion by the multi-modal compressed domain information fusion module, the fused feature vectors are compared and compared to obtain a comparison and comparison difference result of the two fused feature vectors, and then the comparison and comparison difference result is classified by the full connection layer to obtain a determination result. When the first module is trained, calculating the contrast loss by using the feature vector after multi-modal information fusion, and then performing reverse optimization; and when the classifier is optimized, keeping the first module parameter unchanged, and performing reverse optimization through cross entropy loss based on the classification result of the training sample.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and the related description of the optimization method described above may refer to the corresponding description in the foregoing system embodiment, and the specific working process and the related description of the optimization system described above may refer to the corresponding description in the foregoing optimization method embodiment, which are not repeated herein.

A video content comparison method for a compressed domain according to a third embodiment of the present invention includes:

acquiring a video pair to be compared;

obtaining a comparison result through an optimized comparison system;

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and the related description of the above-described video content comparison method for a compressed domain may refer to the corresponding description in the embodiments of the video content comparison system for a compressed domain and the optimization method for a video content comparison system for a compressed domain, and are not described herein again.

A storage device according to a fourth embodiment of the present invention stores a plurality of programs, which are suitable for being loaded and executed by a processor to implement the optimization method of the compressed domain-oriented video content comparison system or the compressed domain-oriented video content comparison method.

A processing apparatus according to a fifth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the optimization method of the video content comparison system facing the compressed domain or the video content comparison method facing the compressed domain.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing embodiments, and are not described herein again.

It should be noted that, the video content comparison system for the compressed domain provided in the foregoing embodiment is only illustrated by the division of the foregoing functional modules, and in practical applications, the foregoing functions may be distributed by different functional modules according to needs, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the foregoing functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is apparent to those skilled in the art that the scope of the present invention is not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A video content comparison system facing a compressed domain is characterized by comprising a first module, a second module and a classifier which are connected in sequence;

the first module comprises a feature learning module and a multi-modal compressed domain information fusion module; the feature learning module is configured to respectively acquire feature maps of multiple modes based on multiple compressed domain information of an input video; the multi-modal compressed domain information fusion module is configured to perform information fusion on the feature maps of multiple modalities output by the feature learning module to obtain a fusion feature vector of the input video;

the classifier is a two-classification network and is configured to perform two classifications of comparison results based on the L1 distance output by the second module;

the optimization method of the comparison system comprises the following steps:

wherein, the training of the first module is carried out, and the adopted loss function L is as follows:

wherein N is the number of samples, D _n The Euclidean distance of fusion feature vectors of two videos in the nth sample pair is defined, Y is a label for judging whether the two samples are matched, and m is a preset threshold value;

based on a preset training sample, fixing the parameters of the optimized first module, and training a classifier in the new comparison system to obtain an optimized comparison system; and training the classifier in the new comparison system, wherein the adopted loss function is the classified cross entropy loss.

2. The compressed domain-oriented video content comparison system according to claim 1, wherein the feature learning module is constructed based on a weight-sharing twin convolutional neural network.

3. The system according to claim 1, wherein the second module obtains the L1 distance by:

4. The system according to claim 1, wherein the training samples are obtained by:

5. A video content comparison method facing a compressed domain is characterized in that the comparison method comprises the following steps:

acquiring a video pair to be compared;

obtaining an alignment result by the compressed domain-oriented video content alignment system of any one of claims 1-4.

6. A storage device having a plurality of programs stored therein, wherein the programs are adapted to be loaded and executed by a processor, and the method for matching video contents in compressed domain as claimed in claim 5 is performed.

7. A processing device comprising a processor and a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that said program is adapted to be loaded and executed by a processor to implement the compressed domain oriented video content comparison method of claim 5.