CN110113616B

CN110113616B - Multi-level monitoring video efficient compression coding and decoding device and method

Info

Publication number: CN110113616B
Application number: CN201910488842.9A
Authority: CN
Inventors: 殷海兵
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-06-05
Filing date: 2019-06-05
Publication date: 2021-06-01
Anticipated expiration: 2039-06-05
Also published as: CN110113616A

Abstract

The invention discloses a device and a method for efficiently compressing, encoding and decoding multi-level monitoring videos, and belongs to the technical field of mass camera metropolitan area level video monitoring application. The method comprises the following steps: (1) specific semantic object coding: detecting a particular semantic object d_nReconstructing the object

Detecting a key point sequence by a tracking technology and a key point detection technology, and transmitting the key point sequence to a decoder; (2) modeling a long-term background frame: setting a plurality of scene categories, and distinguishing the scene categories by using background frame scene index sequence numbers; detecting scene types in the encoder, and transmitting background frame index sequence numbers to a decoder; (3) short-term background frame modeling: obtaining the short-term background frame prediction value of the current frame by adopting a multi-mode prediction method

Transmitting the encoding control parameters to a decoder by optimally selecting the multi-mode reference prediction; (4) and (3) foreground coding: prediction residual

Foreground code stream is generated through HEVC coding, and reconstructed foreground is obtained through decoding

The prediction residual is transmitted to the decoder through a channel.

Description

Multi-level monitoring video efficient compression coding and decoding device and method

Technical Field

The invention relates to the technical field of metropolitan area level video monitoring application of massive cameras, in particular to a device and a method for efficiently compressing, encoding and decoding multi-level monitoring videos.

Background

Most cameras are applied to security protection, and a video signal has the characteristics of: (1) background does not change or changes little in a period of time, and background frame modeling can provide possibility for more efficient coding compared with applications such as broadcast television, video websites and the like; (2) the public security city snow project deploys a large number of cameras, most of data of the cameras are invalid and cannot be seen by people, and most of information is seen by machines. (3) The intelligent security application usually focuses on target objects with specific semantics, such as pedestrians, vehicles, human faces, license plates and the like in a scene, and the specific semantic objects are focused on practical applications such as city-level retrieval, big data analysis and the like.

The prior art has the defects that: early MPEG-4 object-oriented coding techniques also focused on search-oriented applications. However, the development of machine vision and target detection technologies is not mature enough before 2000 years ago, so that the standard cannot be really applied in practice. In recent years, with the development of deep learning technology and the continuous strong computing power of computing platforms, the high-performance detection of target objects with specific semantics becomes possible. In recent years, an end-to-end image coding framework based on deep learning is broken through, and a high-dimensional feature vector expressed by the deep learning can be used as a compact retrieval descriptor, so that the possibility of extracting, representing and coding the depth feature driven by compressing and retrieving double targets can be provided.

But the above work was explored from a different perspective. Aiming at mass camera cluster perception and machine understanding of scenes such as video coding, the video data coding and compressing appeal is greatly different from that of the traditional video coding. How to effectively utilize the multi-dimensional data redundancy of the video data space-time-Camer is not solved on the premise of ensuring the perception understanding efficiency of a machine, and the effective realization of data coding compression is still achieved.

Disclosure of Invention

Aiming at the problems in the prior art, the invention focuses on the characteristics, makes full use of the video data space-time-Camer multi-dimensional data redundancy, and provides a multi-layer efficient video coding algorithm framework in a targeted manner.

A high-efficiency compression coding method for multi-level surveillance videos comprises the following levels:

(1) specific semantic object coding: detecting a particular semantic object d_nReconstructing the object

Finding time domain track by tracking technology, and checking by key pointDetecting key point sequences for all versions of a target object on a time domain track by using a detection technology, and transmitting key sequence structural information to a decoder;

(2) modeling a long-term background frame: setting a plurality of scene categories, and distinguishing the scene categories by using background frame scene index sequence numbers; constructing a background frame a by an offline training method during camera installation_n(ii) a Detecting scene type in encoder to obtain long-term background frame

Transmitting the background frame index sequence number to a decoder;

(3) short-term background frame modeling: suppose that the current frame f_nPreceding adjacent frame f_n-1f_n-2A decoded reconstructed version has been obtained

Obtaining the short-term background frame prediction value of the current frame by adopting a multi-mode prediction method

Selecting a proper reference frame and a weighted prediction coefficient by optimizing and selecting the multi-mode reference prediction, and transmitting an encoding control parameter to a decoder;

(4) and (3) foreground coding: prediction residual

The prediction residual is transmitted to the decoder through a channel.

Further, in the step (1), a deep learning detector is used to detect the specific semantic object d_nThe semantic objects comprise pedestrians, vehicles and human faces, and are subjected to compression, retrieval, dual-target-driven feature extraction, representation and encoding, and deconvolution to obtain decoding reconstruction objects

Further, in the step (4), the prediction residual c is directly predicted by specifying the semantic object_nIs set to 0.

The decoding method of the specific semantic object is to utilize a decoder to decode and obtain key point sequence structured information and an object obtained by deconvolution decoding, and study the decoding version of an adjacent frame target object reconstructed by interpolation through a geometric method

The decoding method of the long-term background frame is that the decoder reconstructs the background frame by utilizing the scene category index number of the long-term background frame

The decoding method of the short-term background frame is that a decoder obtains by using a reference frame and a weighted prediction coefficient

The decoding method of the prediction residual error is that a decoder decodes a foreground code stream

Further, the decoding of the specific semantic object is performed by deconvolution reconstruction of the target object, and finally, the video decoding is as follows:

a multi-level monitoring video high-efficiency compression coding device comprises a specific semantic object coding module, a long-term background frame modeling module, a short-term background frame modeling module and a foreground coding module,

the specific languageSemantic object coding module detects specific semantic object d_nReconstructing the object

Finding a time domain track through a tracking technology, detecting key point sequences for all versions of a target object on the time domain track through a key point detection technology, and transmitting key point sequence structured information to a decoder;

the long-term background frame modeling module sets a plurality of scene categories, and distinguishes the scene categories by using background frame scene index sequence numbers; constructing a background frame a by an offline training method during camera installation_n(ii) a Detecting scene type in encoder to obtain long-term background frame

Transmitting the background frame index sequence number to a decoder;

the short-term background frame modeling module assumes the current frame f_nPreceding adjacent frame f_n-1f_n-2A decoded reconstructed version has been obtained

the foreground coding module predicts the residual error

The prediction residual is transmitted to the decoder through a channel.

Further, the specific semantic object coding moduleDetection of a particular semantic object d using a deep learning detector_nThe semantic objects comprise pedestrians, vehicles and human faces, and are subjected to compression, retrieval, dual-target-driven feature extraction, representation and encoding, and deconvolution to obtain decoding reconstruction objects

A decoding device for processing a multi-level monitoring video high-efficiency compression coding method comprises an object decoding module, a long-term background frame decoding module, a short-term background frame decoding module and a foreground decoding module, wherein the object decoding module obtains key point sequence structural information by decoding through a decoder and an object obtained by deconvolution decoding, and a decoding version of an adjacent frame target object is reconstructed by interpolation through a geometric method

The long-term background frame decoding module reconstructs the long-term background frame by using the scene category index number of the long-term background frame

The short-term background frame decoding module obtains a short-term background frame by using a reference frame and a weighted prediction coefficient

The foreground decoding module decodes the foreground code stream

Further, the final video decoding is as follows:

the invention has the beneficial effects that:

(1) the compression performance is greatly improved;

(2) visual object depth features may support decoding reconstruction and retrieval.

Drawings

FIG. 1 is a block diagram of multi-level efficient predictive coding;

fig. 2 is a block diagram of multi-level efficient predictive decoding.

Detailed Description

The technical scheme of the invention is further explained by combining the drawings in the specification.

The technical problems to be solved by the invention are as follows:

(1) modeling a long and short-term background frame: the video data space-time-Camera multi-dimensional data redundancy in a massive camera cluster is fully utilized,

(2) machine-understood video semantic object depth feature coding: (compression encoding and retrieval dual targets);

(3) semantic object space-time multi-level feature coding;

as shown in fig. 1 and 2, the technical solution of the present invention is as follows:

(1) specific semantic object coding

Detection of a particular semantic object d based on a deep learning detector (SSD or YOLO)_n(such as pedestrians, vehicles, human faces and the like identified by a machine), compressing and retrieving double-target-driven feature extraction, representation and encoding, wherein the feature extraction adopts the unstructured depth features represented by CNN high-dimensional feature vectors, and decoding reconstruction objects are obtained by deconvolution

(black rectangular frame in figure), considering the video signal time domain correlation, a target object is realized in a camera which usually lasts for a period of time, after once coding description, a time domain track can be found through a tracking technology, a key point sequence is detected for all versions of the target object on the time domain track through a key point detection technology, the tracking technology and the key point detection technology both adopt the prior art to detect, and only key points are detectedTransmitting the point sequence structural information to a decoder, decoding by the decoder to obtain key point sequence structural information and objects obtained by deconvolution decoding, and researching a decoding version for reconstructing adjacent frame target objects by interpolation through a geometric method

(2) Long term background frame modeling

In security application, the background of the camera is fixed in most of time, if no moving target exists in the camera in a late night time period, under the condition, if a background frame can be constructed, the compression efficiency can be greatly improved by the differential encoding technology based on the background frame. The distribution characteristics of the pixel brightness and the chromaticity of the background frame are different under different seasons, illumination and weather conditions. Based on the background modeling method, a plurality of scene categories are set according to different combinations of seasons, illumination and weather, and the scene categories are distinguished by using background frame scene index sequence numbers. Constructing a background frame a by an offline training method during camera installation_n. Detecting scene type in encoder to obtain long-term background frame

The background frame index sequence number is transmitted to a decoder, and the decoder can reconstruct the sequence number

(3) Short term background frame modeling

The long-term background frame only describes a common background for a relatively long time, but the actual scene is complex in the daytime, more target motion, occlusion and other areas appear, the coding scheme needs to decode and reconstruct all frames, and in order to fully utilize the short-term time domain correlation, the invention constructs the short-term background frame to utilize the short-term time domain redundancy to the maximum extent. Suppose that the current frame f_nPreceding adjacent frame f_n-1f_n-2A decoded reconstructed version has been obtained

Then adoptMulti-mode prediction methods, e.g. linear weighted prediction, obtaining a prediction of a short-term background frame of a current frame

By optimally selecting multi-mode reference prediction, selecting proper reference frame and weighted prediction coefficient, these coding control parameters are transmitted to a decoder, and the decoding end can obtain the same

(4) Foreground (prediction residual) coding

Current frame f relative to short-term background frame_nThere is also some foreground irregularity, namely the prediction residual

The prediction residual is transmitted to the decoder via a channel, and the decoder can decode the prediction residual

Note that here also the target object regions of specific semantics need to be considered, for which rectangular regions the prediction residual c is directly predicted_nSet to 0, these rectangular regions are decoded by deconvolution reconstruction of the target object, i.e.

(black rectangular box in the figure). Finally, the video is decoded as follows

The invention fuses a multi-level video coding frame of long and short term background frames, foreground coding and semantic object feature coding; modeling by utilizing a long and short-term background frame of video data space-time-Camera multi-dimensional data redundancy in a massive camera cluster; compressing, coding and retrieving the video semantic object depth feature coding driven by the two targets; semantic object space-time multi-level feature coding.

Claims

1. A high-efficiency compression coding method for multi-level surveillance videos is characterized by comprising the following levels:

Transmitting the background frame index sequence number to a decoder;

(4) and (3) foreground coding: prediction residual

The prediction residual is transmitted to the decoder through a channel.

2. The method according to claim 1, wherein the specific semantic object d detected in step (1) is a deep learning detector_nThe semantic objects comprise pedestrians, vehicles and human faces, and are subjected to compression, retrieval, dual-target-driven feature extraction, representation and encoding, and deconvolution to obtain decoding reconstruction objects

3. The method as claimed in claim 1, wherein the prediction residual c is directly encoded by the semantic object specified in step (4)_nIs set to 0.

4. A decoding method for processing the multi-level monitoring video high-efficiency compression coding method of claim 1, characterized in that the decoding method of the specific semantic object is to use a decoder to decode and obtain the key point sequence structured information and deconvolute the object obtained by decoding, and study the decoding version of the target object of the adjacent frame reconstructed by interpolation through a geometric method

5. The decoding method according to claim 4 for processing the multi-level surveillance video high efficiency compression coding method of claim 1, wherein the specific semantic object decoding is a deconvolution reconstruction of the target object, and the final video decoding is as follows:

6. a multi-level monitoring video high-efficiency compression coding device is characterized by comprising a specific semantic object coding module, a long-term background frame modeling module, a short-term background frame modeling module and a foreground coding module,

the specific semantic object coding module detects a specific semantic object d_nReconstructing the object

the long-term background frame modeling module sets a plurality of scene categories, and distinguishes the scene categories by using background frame scene index sequence numbers; constructing a background frame a by an offline training method during camera installation_n(ii) a Detecting scene classes in an encoderObtaining a long-term background frame

Transmitting the background frame index sequence number to a decoder;

the foreground coding module predicts the residual error

The prediction residual is transmitted to the decoder through a channel.

7. The apparatus according to claim 6, wherein the semantic object-specific coding module detects the semantic object d by a deep learning detector_nThe semantic objects comprise pedestrians, vehicles and human faces, and are subjected to compression, retrieval, dual-target-driven feature extraction, representation and encoding, and deconvolution to obtain decoding reconstruction objects

8. According toThe apparatus of claim 6, wherein the semantic object is a direct prediction residual c_nIs set to 0.

9. A decoding device for processing the multi-level monitoring video high-efficiency compression coding method of claim 1 is characterized by comprising an object decoding module, a long-term background frame decoding module, a short-term background frame decoding module and a foreground decoding module, wherein the object decoding module obtains key point sequence structured information by decoding through a decoder and an object obtained by deconvolution decoding, and a decoding version of an adjacent frame target object is reconstructed by interpolation through a geometric method

The foreground decoding module decodes the foreground code stream

10. The decoding apparatus for processing the multi-level surveillance video high efficiency compression encoding method as recited in claim 1, wherein the final video decoding is as follows: