CN114554220A

CN114554220A - Method for over-limit compression and decoding of fixed scene video based on abstract features

Info

Publication number: CN114554220A
Application number: CN202210038155.9A
Authority: CN
Inventors: 黄宏博; 陈伟骏; 孙牧野; 李萌
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2022-01-13
Filing date: 2022-01-13
Publication date: 2022-05-27
Anticipated expiration: 2042-01-13
Also published as: CN114554220B

Abstract

The invention discloses a fixed scene video overrun compression and decoding method based on abstract features, which comprises the following steps: 1) extracting a background image of an original video by adopting a background modeling method and carrying out compression coding; 2) extracting abstract features of the foreground target by adopting a foreground target extraction module comprising algorithms such as example segmentation and key point detection; 3) taking a snapshot of the foreground target and compressing; 4) compressing video background compressed data, foreground target abstract characteristics and snapshot compressed data; 5) decompressing the video compressed data by predecoding; 6) inputting the abstract features of the foreground target and the snapshot of the foreground target into a generator which is trained to generate a countermeasure network; 7) fusing the generated foreground target decoding images of each frame with the background image; 8) and reconstructing the fused video frame to obtain a decoded video. The method has extremely high compression ratio for the fixed scene video, obviously improves the storage efficiency and prolongs the storage time of the monitoring video.

Description

Method for over-limit compression and decoding of fixed scene video based on abstract features

Technical Field

The invention relates to the technical field of deep learning of computer vision, in particular to a fixed scene video overrun compression and decoding method based on abstract features.

Background

Common compression coding for video data mainly removes redundant information based on texture, edge, movement of image blocks and other underlying features, and does not fully consider the high-level abstract features contained in video content. The explosive development of deep learning in the field of computer vision brings technical feasibility to high-level abstract understanding of images and videos. The deep convolutional neural network brings revolutionary change to the extraction of high-level features such as images and videos under the support of big data and high-performance parallel computation. Different from the traditional image feature extraction mode based on manual design, the convolutional neural network can automatically extract high-level features with stronger expression capability from big data. These high-level features play a crucial role in image understanding and video structuring. By means of the high-level feature extraction capability of the deep convolutional neural network model, on the basis of widely available video big data, high-level abstract feature information with higher expressiveness in a video is extracted, a large amount of abstract redundancy existing in the video is removed, the video compression performance can be greatly improved, the storage space and the transmission bandwidth are reduced, and a new idea is brought to better persistent storage and transmission of the video.

Therefore, how to provide a video compression method for improving the compression ratio by extracting high-level abstract feature information from a video is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides an ultralimit compression and decoding method of a fixed scene video based on abstract features, which greatly reduces the storage space by extracting high-level abstract feature information from the video for storage, so as to solve the technical problems.

(II) technical scheme

In order to achieve the purpose, the invention provides the following technical scheme: a method for compressing and decoding a fixed scene video based on abstract features comprises an encoder and a decoder. The method comprises the following steps:

1. and (5) video compression.

And (4) disassembling the original video into image frames, and sending the image frames into an encoder for processing. The encoder comprises two modules: background modeling and foreground object extraction.

And the background modeling module performs foreground subtraction on each frame of video by using a background modeling algorithm based on a Gaussian mixture model to obtain a background image. After all video frames are processed, a plurality of frames of background images are subjected to union set to obtain a single background image, and then discrete cosine transformation, quantization and entropy coding are carried out to obtain video background compressed data.

The foreground target extraction module consists of an example segmentation model and a key point detection model based on a convolutional neural network, and performs object example segmentation and key point detection on the image frame to obtain the abstract features of the foreground target. The foreground target abstract features comprise shape features and key point features of the foreground target.

After processing a complete video frame, performing inter-frame target matching by using a method based on a target detection frame IOU threshold value to obtain the corresponding relation of inter-frame foreground targets, and then extracting a snapshot for each foreground target. The algorithm for extracting the snapshot comprises the following steps: and for the multi-frame shape feature of each foreground object, only keeping the frame shape feature with the highest confidence coefficient output by the example segmentation model, extracting the image of the foreground object from the original video frame by using the shape feature to obtain a snapshot of the foreground object, and performing discrete cosine transform, quantization and entropy coding on the snapshot to obtain foreground object snapshot compressed data. The purpose of taking a snapshot is to save the detail features of the foreground object, such as color texture, etc.

And finally, compressing and packaging the abstract features of the foreground target, the snapshot compressed data and the background compressed data to obtain video compressed data. The video compression is completed.

In the encoder, a background modeling module encodes the background of an original video into a single compressed image, so that the removal of background redundant information is realized; the foreground object extraction module only stores multi-frame abstract features and single-frame snapshot compressed data for each foreground object in the original video by extracting the abstract features and the snapshots of the foreground objects, so that the removal of foreground redundant information is realized. Compared with the traditional video compression coding, the coding mode of the invention greatly reduces the data capacity required to be stored, thereby realizing the over-limit compression.

2. Video predecoding

When a user needs to view a video, video pre-decoding is performed first. And decompressing the video compressed data finally compressed and packaged by the encoder, and recovering the abstract characteristics of the foreground target, the snapshot of the foreground target and the video background image.

3. And (5) video decoding.

The decoder of the invention consists of a convolutional neural network model based on a generation countermeasure network architecture, wherein a generator and a discriminator are included. The input of the generator is a foreground target snapshot and a foreground target abstract characteristic, and the output is a foreground target decoding image; the discriminator is responsible for assisting the generator to improve the quality of the generated image when the generator trains, the decoded image of the foreground object generated by the generator and the image of the foreground object in the real video frame are input, the output is a value between 0 and 1, and the discriminator is represented to judge whether the input image is the generated image (0) or the real image (1).

(1) Decoder training process

The objective function of the training process is: l ═ L_GAN+L_L1+L_VGG

Wherein:

to generate antagonistic losses, I_SAnd I_tRespectively a foreground target snapshot and a real foreground target image, R, to be generated_SAnd R_tIs according to I_SImage and I_tThe keypoints of the image generate a response map for input into the generator.

For the foreground object decoded image generated by the generator, z is random noise.

Wherein:

for the L1 loss, the calculation generator generates the minimum absolute error of the image from the real image.

Wherein:

and for sensing loss, inputting the foreground target decoded image generated by the generator and the real foreground target image into a public VGG pre-training network model, and calculating the minimum square difference of the foreground target decoded image and the real foreground target image in the deep feature map.

After training is over, the decoder only needs to keep the generator.

(2) Decoder decoding process

Reading the multi-frame abstract features and the snapshots of each foreground target, and sending the multi-frame abstract features and the snapshots to a generator in a decoder. The generator model acquires information such as the posture, the skeleton and the like of the target from the multi-frame abstract features, acquires information such as the color, the texture and the like of the target from the snapshot, and generates a foreground target decoding image by fusing the information.

And reading a video background image, and fusing all the generated foreground target decoding images with the background image to obtain a reconstructed video frame. And merging and reconstructing all reconstructed video frames to obtain a decoded video.

(III) advantageous effects

Compared with the prior art, the invention provides a fixed scene video overrun compression and decoding method based on abstract features, which has the following beneficial effects: the method for compressing and decoding the fixed scene video based on the abstract features has a high compression ratio for the fixed scene video, and greatly saves storage space resources. Experiments prove that for fixed scene videos with different lengths and different target numbers, the capacity of the compressed data stored by the method is only 1/40-1/3 of the H264 coded video, and high compression ratio exceeding that of the traditional video compression coding is realized. The invention can be applied to various intelligent monitoring systems, obviously prolongs the storage time of the monitoring video, and can be used for abnormal behavior detection, traffic flow monitoring and the like by extracting the target abstract characteristics in the compression process.

Drawings

Fig. 1 is a block diagram of a fixed scene video over-limit compression and decoding method based on abstract features according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The overall structure of the fixed scene video overrun compression and decoding method provided by the invention is shown in figure 1, and mainly comprises two parts, namely an encoder and a decoder. When in compression, the original video is input into a video encoder to obtain video compression data; when decoding, the video compression data is firstly pre-decoded and then input into a decoder to generate decoded video.

1. Video compression step

Step 1) initializing a convolutional neural network, and loading an instance segmentation model and a key point detection model into a GPU.

And 2) initializing a mixed Gaussian background model.

And 3) reading the ith frame of the original video.

Step 4) inputting the video frame into a Gaussian mixture background model, matching and updating the model weight to obtain a Gaussian background modeling result bg_i。

Step 5) carrying out example segmentation on the current video frame by using an example segmentation model to obtain example segmentation results S of m foreground targets_i＝{box_j，mask_j1,2, …, m }; wherein box_jA rectangular detection frame (x _ min, y _ min, x _ max, y _ max) for the jth foreground object in the video frame, a mask_jThe mask is a mask of the jth foreground object in the video frame, the mask is a binary image with the length and width equal to the video frame, the area where the corresponding object appears is 1, and the other areas are 0. In the subsequent steps, the foreground object detection frame is equal to the box in the step_jThe shape feature of the foreground object is equal to the m ask in this step_jThe foreground object space-time information is equal to the current frame serial number i plus the box in this step_j(i，x _min,y_min,x_max,y_max)。

And 6) performing key point detection on each detected foreground target by using a key point detection model to obtain key point coordinates (x0, y0, x1 and y1..) of the foreground target.

And 7) repeatedly executing the steps 3 to 6 until all video frames are processed.

And step 8) reading foreground target space-time information (i, x _ min, y _ min, x _ max and y _ max) detected by the example segmentation model in the step 5). And performing interframe foreground target matching by using a method based on the IOU threshold of the target detection frame to obtain a plurality of matching lists, wherein each list comprises multi-frame space-time information of each foreground target and is arranged according to a time sequence. For example, p objects appear in the video, and the p objects respectively appear q₁,q₂…q_pObtaining p frames with length q₁,q₂…q_pEach item in the list is spatiotemporal information of the target in a different frame.

Step 9) taking a snapshot of each foreground target, wherein the steps are as follows: according to each of step 8)And (3) foreground object space-time information of a matching list is read, multi-frame shape features of each foreground object are read, then only one frame shape feature with the highest confidence coefficient output by the example segmentation model is reserved, and the shape features are used for extracting the image of the foreground object from the original video frame so as to obtain the snapshot of the foreground object. Discrete cosine transform, quantization and entropy coding are carried out on the snapshot to obtain foreground target snapshot compressed data, and the spatio-temporal information of the snapshot is used as a file name (i)_s,x_min_s,y_min_s,x_max_s,y_max_sJpg) are stored.

Step 10) the snapshot file name (i) of each foreground object_s,x_min_s,y_min_s,x_max_s,y_ max_sJpg) and multi-frame space-time information (i, x _ min, y _ min, x _ max, y _ max), and multi-frame key point coordinates (x0, y0, x1, y1..) are merged, written into a csv file and stored, and the csv file is a foreground target abstract feature file. So far, only the single-frame snapshot + multi-frame target abstract features are reserved for each foreground target.

Step 11) the sequence of background images { bg) of each frame obtained from step 4)_iMerging set bg ═ bg [ { bg ] } by | i ═ 1,2, …, n }₁ U bg₂ U bg₃ ··· U bg_nAnd obtaining a complete video background image, and then performing discrete cosine transform, quantization and entropy coding to obtain video background compressed data.

And step 12) carrying out compression and packaging on the foreground target snapshot compressed data obtained in the step 9), the foreground target abstract feature file obtained in the step 10) and the video background compressed data obtained in the step 11) as a whole to obtain video compressed data.

2. Video decoding step

Step 1) predecoding, namely decompressing video compressed data, and recovering abstract features of a foreground target, a foreground target snapshot and a video background image.

And 2) initializing the convolutional neural network, and loading the trained generator network model into the GPU.

And 3) reading the abstract feature file of the foreground target, and acquiring the filename of the snapshot file of the foreground target and the multi-frame abstract features corresponding to the filename.

And 4) inputting the foreground target snapshot, the abstract characteristics of the foreground target snapshot and the abstract characteristics of the foreground target decoded image to be generated into a generator model to generate the foreground target decoded image.

And 5) repeatedly executing the step 3 to the step 4 until the abstract feature file of the foreground target is completely read and all the foreground targets are completely decoded.

And 6) reading the video background image.

And 7) fusing the foreground target decoded image of each frame with the video background image to generate a reconstructed video frame, and merging all reconstructed video frames to obtain a decoded video until all frames of the video are reconstructed.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for compressing and decoding a fixed scene video with overrun based on abstract features is characterized by comprising the following steps:

compressing the fixed scene video data by using an encoder;

and decoding the compressed video data by using a decoder to obtain a decoded video.

2. The method according to claim 1, wherein the method for compressing and decoding the video with the fixed scene based on the abstract features comprises: in an encoder, a background modeling method is adopted to extract a background image of an original video, and then the extracted background image is compressed and encoded to obtain video background compressed data.

3. The method according to claim 1, wherein the method for compressing and decoding the video with the fixed scene based on the abstract features comprises: in an encoder, a foreground target extraction module comprising an object instance segmentation and key point detection algorithm is adopted to extract the features of a foreground target in a video frame to obtain foreground target abstract features, wherein the foreground target abstract features comprise the shape features and the key point features of the foreground target.

4. The method according to claim 1, wherein the method for compressing and decoding the video with the fixed scene based on the abstract features comprises: and the encoder extracts a snapshot from the foreground target acquired by the foreground target extraction module by using a foreground target snapshot extraction algorithm, and performs compression coding on the snapshot to obtain foreground target snapshot compressed data.

5. The method according to claim 2 or 3, wherein the method for compressing and decoding the video with the fixed scene based on the abstract features comprises: an encoder only stores abstract features and snapshot compressed data for each foreground target in a video by extracting the abstract features and the snapshots of the foreground targets;

and only storing the video background compressed data for the video background. And compressing and packaging the abstract features of the foreground target, the snapshot compressed data and the background compressed data to obtain video compressed data.

6. The method according to claim 5, wherein the method for compressing and decoding the video with the fixed scene based on the abstract features comprises: and when decoding, decompressing the video compressed data, and recovering the abstract characteristics of the foreground target, the snapshot of the foreground target and the background image of the video.

7. The method according to claim 1, wherein the method for compressing and decoding the video with the fixed scene based on the abstract features comprises: in the decoder, a deep learning model based on a Generative countermeasure Network (generic adaptive Network) is adopted for video decoding.

8. The method of claim 3 or 4, wherein the method for compressing and decoding the video with the fixed scene based on the abstract features comprises: and in a decoder, inputting the abstract features of the foreground target and the snapshot of the foreground target into a generator for generating a countermeasure network, and reconstructing to obtain a decoded image of the foreground target.

9. The method according to claim 2 or 8, wherein the method for compressing and decoding the video with the fixed scene based on the abstract features comprises: and the decoder fuses the foreground target decoded image and the background image of each frame in the video to obtain a reconstructed video frame, and merges and reconstructs all the reconstructed video frames to obtain a decoded video.