CN112215908A

CN112215908A - Compressed domain-oriented video content comparison system, optimization method and comparison method

Info

Publication number: CN112215908A
Application number: CN202011086137.5A
Authority: CN
Inventors: 李扬曦; 缪亚男; 袁庆升; 胡卫明; 李兵; 刘雨帆; 胡赛军
Original assignee: Institute of Automation of Chinese Academy of Science; National Computer Network and Information Security Management Center
Current assignee: Institute of Automation of Chinese Academy of Science; National Computer Network and Information Security Management Center
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2021-01-12
Anticipated expiration: 2040-10-12
Also published as: CN112215908B

Abstract

The invention belongs to the field of computer vision, and particularly relates to a video content comparison system, an optimization method and a comparison method for a compression domain, aiming at solving the problem of low efficiency of completing the comparison of video content by using full decoding information. The comparison system of the invention comprises: the characteristic learning module is used for respectively acquiring characteristic graphs of multiple modes based on multiple kinds of compressed domain information of the input video; the multi-modal compressed domain information fusion module is used for carrying out information fusion on the characteristic graphs of the multiple modalities output by the characteristic learning module to obtain a fusion characteristic vector of the input video; a second module configured to obtain an L1 distance of a fused feature vector of two input videos; the classifier is a two-classification network and is configured to perform two classifications of the comparison result based on the L1 distance output by the second module. The invention can effectively extract the high-level semantic information of the video content and ensure the high speed and the high performance of the comparison of the video content.

Description

Compressed domain-oriented video content comparison system, optimization method and comparison method

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a video content comparison system, an optimization method and a comparison method for a compressed domain.

Background

In content-based video understanding systems, a large amount of video typically needs to be processed. At present, the internet video flow rate is more than 99% of the internet video flow rate in the encoding standards such as H264, H265 and the like, the volume of the encoded video is greatly reduced by tens of to hundreds of times, but the image information in the video is also converted into indirect information and can be restored into image frame information forming the video only by decoding. Most of the existing algorithms or systems for video recognition, comparison, retrieval, etc. need to decode the video into image frames, and then process and analyze the image frames in the image sequence. However, video decoding is very computation-resource-consuming and time-consuming, which will certainly greatly affect the practicability and flexibility of various application systems, especially for some video retrieval, comparison application systems and occasions requiring real-time processing.

Therefore, research is directed to a compression domain, and a scheme for understanding, comparing and identifying video content under a partial decoding condition is an urgent problem to be solved. Different from the conventional video processing method, the video comparison method facing the compression domain needs to be directly carried out on compressed data which is not decoded or is decoded as little as possible, and additional links of decompression and recompression are omitted, so that the overall processing time of the system is greatly reduced. The video content comparison task is taken as a representative task, how to play the characteristics of high efficiency and roughness of video compression domain information, and a proper network structure is designed, so that the task of efficiently finishing video content understanding is the technical problem to be solved.

Disclosure of Invention

In order to solve the above-mentioned problem in the prior art, that is, to solve the problem that the efficiency of comparing video contents using full video decoding information is not high, a first aspect of the present invention provides a video content comparison system for a compressed domain, which includes a first module, a second module, and a classifier, which are connected in sequence;

the first module comprises a feature learning module and a multi-modal compressed domain information fusion module; the feature learning module is configured to obtain feature maps of multiple modalities based on multiple kinds of compressed domain information of the input video; the multi-modal compressed domain information fusion module is configured to perform information fusion on the multi-modal feature maps output by the feature learning module to obtain a fusion feature vector of the input video;

the second module is configured to obtain an L1 distance of a fusion feature vector of two input videos;

the classifier is a two-classification network and is configured to perform two classifications of the comparison result based on the L1 distance output by the second module.

In some preferred embodiments, the feature learning module is constructed based on a weight-shared twin convolutional neural network.

In some preferred embodiments, the method for the second module to obtain the L1 distance is as follows:

and performing element-based difference on the fusion feature vectors of the two input videos to obtain a corresponding L1 distance.

In a second aspect of the present invention, a method for optimizing a compressed domain-oriented video content comparison system is provided, where the method is used for optimizing the compressed domain-oriented video content comparison system, and includes:

training the first module based on a preset training sample to obtain an optimized first module;

constructing a new comparison system based on the optimized first module, the optimized second module and the classifier;

and fixing the parameters of the optimized first module based on a preset training sample, and training a classifier in the new comparison system to obtain the optimized comparison system.

In some preferred embodiments, the "training of the first module" is performed using a loss function L of

Wherein N is the number of samples, D_nThe Euclidean distance of the fusion feature vectors of the two videos in the nth sample pair is obtained, Y is a label for judging whether the two samples are matched, and m is a preset threshold value.

In some preferred embodiments, the loss function used in training the classifier in the new comparison system is the cross-entropy loss of the classification.

In some preferred embodiments, the training sample is obtained by:

based on an offline video database, video clipping is carried out on copied video segments existing in different videos according to a label file, similar video segment pairs clipped according to the label file are used as positive samples, 1 video is randomly selected from other remaining video segments, and pairs formed by the videos and the original videos are used as negative samples.

The third aspect of the present invention provides a video content comparison method for a compressed domain, where the comparison method includes:

acquiring a video pair to be compared;

respectively carrying out partial decoding on two videos in the video pair to be compared, and extracting video compression domain information;

obtaining a comparison result through an optimized comparison system;

wherein the content of the first and second substances,

the obtaining method of the optimized comparison system comprises the following steps: and optimizing the video content comparison system facing the compressed domain based on the optimization method of the video content comparison system facing the compressed domain.

In a fourth aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, and the programs are adapted to be loaded and executed by a processor to implement the optimization method of the compressed domain-oriented video content comparison system or the compressed domain-oriented video content comparison method.

In a fifth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the optimization method of the video content comparison system facing the compressed domain or the video content comparison method facing the compressed domain.

The invention has the beneficial effects that:

1. according to the invention, the compressed domain information of the video is fully used, the deep twin neural network is designed, the high-level semantic information of the video content can be effectively extracted, and the high speed and the high performance of the comparison of the video content are ensured. By using the compressed domain information instead of the information of the video full decoding, the calculation amount of the video content understanding task is greatly reduced.

2. The invention designs a multi-modal fusion mode of compressed domain information, so that the information of different modes of a compressed domain is effectively fused, and the representation characteristic of high-level video semantics combined with video spatio-temporal information is constructed. The depth twin neural network effectively uses various coarse compressed domain information, and the video content comparison precision is improved.

3. According to the method, the characteristic of contrast loss in the deep twin neural network is used, namely the characteristic that the characteristic distance of a positive sample is as small as possible, and the characteristic distance of a negative sample is as large as possible, so that the effect similar to an SVM (support vector machine) large-distance classifier is learned by the network, the learned video characteristics are more decisive, and the network performance is more robust.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a block diagram of a compressed domain-oriented video content alignment system according to an embodiment of the present invention;

FIG. 2 is an algorithmic framework schematic of a deep twin neural network;

fig. 3 is a flowchart illustrating an optimization method of a compressed domain-oriented video content comparison system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention relates to a video content comparison system facing a compressed domain, which comprises a first module, a second module and a classifier which are connected in sequence as shown in figure 1;

For the purpose of more clearly illustrating the present invention, reference is now made to the following detailed description taken in conjunction with the accompanying drawings.

The video content comparison system facing the compressed domain comprises a first module, a second module and a classifier which are connected in sequence.

The embodiment further includes a video compression domain information extraction module, and before comparing the video contents, the video compression domain information of the video pair to be compared needs to be extracted respectively. And partially decoding the video, and extracting video compression domain information, including an I frame, a motion vector and a residual error.

The video compression domain information extraction module in this embodiment uses a core video codec frame of FFmpeg, and adopts a codec mode of an H264 code stream, for example, when decoding an I frame, steps of entropy decoding, inverse quantization, inverse transformation, and the like are provided for the code stream. For a decoded stream with motion vectors present in macroblock prediction, before entropy decoding is performed, it is necessary to first determine a prediction Mode or motion vector MV of a macroblock and a block coding Mode CBP, and then perform entropy decoding on luminance and chrominance, respectively. The source code based on the FFmpeg is designed into a c + + code, unnecessary decoding information and processes are skipped while decoding of a key decoding process, and efficient extraction of compressed domain information is completed. In addition, because the training of the whole network in the embodiment adopts end-to-end training, the mixed compilation of c + + and python needs to be completed in engineering, so that compressed domain information extracted from FFmemg by using c + + can be directly subjected to data exchange in the training using a PyTorch framework.

1. First module

The first module comprises a feature learning module and a multi-modal compressed domain information fusion module, forms a video similarity discrimination network based on a depth twin network, and is used for obtaining the video expressive feature vector from the video compressed domain information of the input video pair.

And the characteristic learning module is configured to respectively acquire characteristic graphs of multiple modes based on multiple kinds of compressed domain information of the input video. As shown in fig. 2, the structure of the module is a twin convolutional neural network structure based on weight sharing, the module takes compressed domain information of a pair of videos as input, respectively learns the compressed domain information by taking a multi-stream convolutional neural network as a branch of the twin network for different compressed domain information, such as I frames and motion vectors, and specifically uses a resnet34 skeleton for the I frames, and uses a resnet18 skeleton for the motion vectors, and uses a feature map output by layer4 in the resnet structure as output of the learning module. The feature learning network structure of the design includes, but is not limited to, the above structure configuration.

And the multi-modal compressed domain information fusion module is configured to perform information fusion on the multi-modal feature maps output by the feature learning module to obtain a fusion feature vector of the input video. The module inputs feature maps from compressed domain information learning network outputs from different modalities. The fusion of the information features of the multi-modal compressed domain is firstly carried out by stacking feature maps, then the weights of different modal information are learned under the condition that the number of channels is kept unchanged by a convolution layer with the core size of 1, the 1x1 convolution layer uses a kaiming initialization mode, the learning rate is set to be twice of the initial learning rate of a network during training, the faster convergence effect is achieved, and the multi-modal information is effectively fused. Assuming that two compressed domain information of I-frame and motion vector are used, the feature learning module outputs graph feature 1(FeatureMap1), graph feature 2(FeatureMap2) with sizes of (N, C, T, W, H) and (N, C, T, W, H), respectively; then, the obtained product is subjected to on-channel splicing to obtain graph characteristics 3(FeatureMap3), (N, C2, T, W, H); and then, performing weighted learning on the channel again through convolution of conv1x1 to obtain final graph features (FeatureMap) and (N, C, T, W and H), and then obtaining a fused unique video-level feature vector through a Flatten operation (leveling operation), thereby completing fusion of multi-mode information.

2. Second module

A second module configured to obtain an L1 distance of a fused feature vector of two input videos.

The method for the second module to obtain the L1 distance comprises the following steps: and performing element-based difference on the fusion feature vectors of the two input videos to obtain a corresponding L1 distance.

3. Classifier

The classifier is a two-classification network and is configured to perform two classifications of the comparison result based on the L1 distance output by the second module. The two-classification network in this embodiment is a full connection layer, and the number of neurons output by the two-classification network is similar or dissimilar, so that whether a video is a copy video can be determined.

A second embodiment of the present invention provides an optimization method for a compressed domain-oriented video content comparison system, which is used for optimizing the compressed domain-oriented video content comparison system.

Training samples need to be constructed before optimization, and the embodiment adopts an off-line sampling method, so that a large number of positive and negative sample pairs required by training can be effectively obtained. The method comprises the steps of carrying out off-line sampling on a public data set VCDB, carrying out video cutting on copied video segments existing in different videos according to a marking file, taking a similar video segment pair cut according to the marking file as a positive sample, randomly selecting 1 video from other residual video segments, and taking a pair formed by the video and an original video as a negative sample. And repeating the above method to complete the construction of the data set.

The optimization method of the embodiment, as shown in fig. 3, includes the following steps:

and S100, training the first module based on a preset training sample to obtain an optimized first module.

The obtained training samples are fed into a feature learning module of a first module in batches, and feature graphs of different compressed domain information are obtained according to forward propagation; and then, the feature map is sent to a multi-modal compressed domain information fusion module of the first module to obtain a unique feature vector of the video, and a back propagation training network is carried out by using the contrast loss. Wherein the contrast loss is defined as follows:

wherein the content of the first and second substances,

fusion feature vector X representing two videos in nth sample pair₁And X₂P represents the feature dimension of the fused feature vector, Y is a label of whether two samples match, Y ═ 1 represents that two samples are similar or matched, Y ═ 0 represents that two samples do not match, m is a set threshold, N is the number of samples, W is the length of the fused feature vector output by the first module, and 512 is selected here.

And S200, constructing a new comparison system based on the optimized first module, the optimized second module and the classifier.

And (5) fixing the parameters of the first module based on the optimized parameters obtained in the step (S100), and constructing and obtaining a new comparison system together with the second module and the classifier.

And S300, fixing the parameters of the optimized first module based on a preset training sample, and training a classifier in the new comparison system to obtain the optimized comparison system.

During the training process of this step, the entire network is propagated backwards using the cross-entropy loss of the classification. Setting the training end condition as iteration times and/or preset convergence, repeating the forward propagation and the backward propagation in the process, setting the iteration times, training the network and the Hash function until the network converges, and stopping training.

It should be noted that the training process of the present embodiment adopts a staged training method, and adopts two loss functions, including the contrast loss of step S100 and the cross-entropy loss of step S300.

The embodiment may also be systematized, for example, by optimizing an optimization system of a compressed domain-oriented video content comparison system, where the optimization system includes: the system comprises a first training module, an intermediate system building module and a second training module.

The first training module is configured to perform training of the first module based on a preset training sample to obtain an optimized first module;

the intermediate system construction module is configured to construct a new comparison system based on the optimized first module, the optimized second module and the classifier;

and the second training module is configured to fix the parameters of the optimized first module based on a preset training sample, and train the classifier in the new comparison system to obtain the optimized comparison system.

As shown in fig. 2, after the two video frames are respectively subjected to feature extraction by the feature learning module and multi-modal information fusion by the multi-modal compressed domain information fusion module, the fused feature vectors are compared and compared with each other to obtain a comparison and comparison difference result of the two fused feature vectors, and then the comparison and comparison difference result is classified by the full connection layer to obtain a determination result. When the first module is trained, calculating the contrast loss by using the feature vector after multi-modal information fusion, and then performing reverse optimization; and when the classifier is optimized, keeping the first module parameter unchanged, and performing reverse optimization through cross entropy loss based on the classification result of the training sample.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and the related description of the optimization method described above may refer to the corresponding description in the foregoing system embodiment, and the specific working process and the related description of the optimization system described above may refer to the corresponding description in the foregoing optimization method embodiment, which are not repeated herein.

A video content comparison method for a compressed domain according to a third embodiment of the present invention includes:

acquiring a video pair to be compared;

obtaining a comparison result through an optimized comparison system;

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and the related description of the above-described video content comparison method for a compressed domain may refer to the corresponding description in the embodiments of the video content comparison system for a compressed domain and the optimization method for a video content comparison system for a compressed domain, and are not described herein again.

A storage device according to a fourth embodiment of the present invention stores a plurality of programs, which are suitable for being loaded and executed by a processor to implement the optimization method of the compressed domain-oriented video content comparison system or the compressed domain-oriented video content comparison method.

A processing apparatus according to a fifth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the optimization method of the video content comparison system facing the compressed domain or the video content comparison method facing the compressed domain.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing embodiments, and are not described herein again.

It should be noted that, the video content comparison system for the compressed domain provided in the foregoing embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program, when executed by a Central Processing Unit (CPU), performs the above-described functions defined in the method of the present application. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A video content comparison system facing a compressed domain is characterized by comprising a first module, a second module and a classifier which are connected in sequence;

2. The system according to claim 1, wherein the feature learning module is constructed based on a weight-sharing twin convolutional neural network.

3. The system according to claim 1, wherein the second module obtains the L1 distance by:

4. A method for optimizing a compressed domain-oriented video content comparison system, which is used for optimizing the compressed domain-oriented video content comparison system according to any one of claims 1 to 3, and the method comprises:

5. The method as claimed in claim 4, wherein the training of the first module is performed by using a loss function L of

6. The method as claimed in claim 5, wherein the loss function used in training the classifier in the new comparison system is cross entropy loss of the classification.

7. The method for optimizing a compressed domain-oriented video content comparison system according to any one of claims 4-6, wherein the training samples are obtained by:

8. A video content comparison method facing a compressed domain is characterized in that the comparison method comprises the following steps:

acquiring a video pair to be compared;

obtaining a comparison result through an optimized comparison system;

wherein the content of the first and second substances,

the obtaining method of the optimized comparison system comprises the following steps: based on the optimization method of the compressed domain-oriented video content comparison system of any one of claims 4 to 7, the optimization method of the compressed domain-oriented video content comparison system of any one of claims 1 to 3 is obtained.

9. A storage device, having a plurality of programs stored therein, wherein the programs are adapted to be loaded and executed by a processor to implement the optimization method of the compressed domain-oriented video content comparison system according to any one of claims 4 to 7 or the compressed domain-oriented video content comparison method according to claim 8.

10. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to implement the optimization method of the compressed domain-oriented video content comparison system according to any one of claims 4 to 7 or the compressed domain-oriented video content comparison method according to claim 8.