CN112165619A

CN112165619A - Method for compressed storage of surveillance video

Info

Publication number: CN112165619A
Application number: CN202011018250.XA
Authority: CN
Inventors: 谢亚光; 廖义; 陈勇; 李日; 孙波
Original assignee: Hangzhou Arcvideo Technology Co ltd
Current assignee: Hangzhou Arcvideo Technology Co ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2021-01-01

Abstract

The invention discloses a method for compressing and storing monitoring videos. The method specifically comprises the following steps: determining 3 encoding quantization factors Qp1, Qp2, Qp3, Qp1> Qp2> Qp3 in advance; initially, encoding a first frame without any processing, uniformly encoding the first frame by using Qp2, and simultaneously storing YUV data of the first frame in a cache; when the Nth frame is coded, the whole frame is divided into a static background area and a plurality of moving rectangular areas, the moving rectangular areas are further subjected to video analysis to obtain moving areas of sensitive objects with smaller areas, and in the coding process, the sensitive areas are subjected to key coding by adopting quantization step length Qp 3; the quantization step Qp2 is used for the motion region; adopting a quantization step Qp1 for the background static area; the method has the beneficial effects that the N is equal to N +1, the steps are repeated until all video frame coding is finished, and the method has the following beneficial effects: important video elements are guaranteed to be stored in high fidelity, the coding code rate is greatly reduced, the total storage capacity and bandwidth requirements are reduced, and the storage and bandwidth costs are reduced.

Description

Method for compressed storage of surveillance video

Technical Field

The invention relates to the technical field related to video decoding, in particular to a method for compressing and storing monitoring video.

Background

Video monitoring systems have been developed for as short as twenty years, and have changed from analog monitoring to digital monitoring of fire and heat to the rise of network video monitoring. Today, the IP technology gradually unifies the world, and it is necessary to reconsider the development history of the video surveillance system. From the technical point of view, the development of video monitoring systems is divided into a first generation analog video monitoring system (CCTV), a second generation digital video monitoring system (DVR) based on 'PC + multimedia card', and a third generation video monitoring system (IPVS) based on an IP network. At present, all mainstream video monitoring is IP video monitoring, IP cameras are distributed throughout road traffic, subways, shopping malls, communities, office buildings and the like, and a great amount of monitoring videos are generated every day.

At present, a video monitoring camera generally adopts H.264 or H.265 video coding, and then a RTSP transmission protocol is switched on and off and transmitted back to a streaming media server in a networking platform of a central terminal. The streaming server then sends the video to the storage server. The stored video can be generally used for watching the video again at a later date or sent to a video analysis platform for video analysis. The current mainstream service of the surveillance video requires storage for at least 3 months, and some of the mainstream service needs to be stored for even one year or more.

The current pain spot: because the video data volume is huge, and the stored video needs to have higher reservation to some sensitive information, such as human faces, license plates and the like, so that the human eyes can be conveniently seen again to be accurately identified, and the artificial intelligence equipment can be intelligently analyzed to obtain the video. Therefore, when the mainstream video camera is used for coding, no analysis is carried out on the video quality, and a consistent coding method is adopted even if the video picture is static (in the case, the monitoring video is common), or only a small part in the picture has motion and sensitive information. Fixed rate coding is usually adopted, in order to prevent occasional complex scenes, the rate setting is usually high, and the rate of high definition resolution is 4 Mbps. If the code rate is set to be small, the image quality of the video in a possibly complex scene is poor, and the video review and the intelligent identification success rate of key objects are influenced. Calculated by 4MBps, the storage capacity of one camera is 4TByte in three months, and the storage capacity of 1000 cameras is 15000T in one year, which is mass storage, and brings huge storage cost for security construction. Meanwhile, 4Mbps will bring a large cost to the transmission bandwidth.

Disclosure of Invention

The invention provides a method for compressing and storing monitoring videos, which reduces the storage capacity and the bandwidth requirement, in order to overcome the defects in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for compressing and storing monitoring videos specifically comprises the following steps:

(1) determining 3 encoding quantization factors Qp1, Qp2, Qp3, Qp1> Qp2> Qp3 in advance;

(2) initially, encoding a first frame without any processing, uniformly encoding the first frame by using Qp2, and simultaneously storing YUV data of the first frame in a cache;

(3) when the Nth frame is coded, the whole frame is divided into a static background area and a plurality of moving rectangular areas, the moving rectangular areas are further subjected to video analysis to obtain moving areas of sensitive objects with smaller areas, and in the coding process, the sensitive areas are subjected to key coding by adopting quantization step length Qp 3; the quantization step Qp2 is used for the motion region; adopting a quantization step Qp1 for the background static area;

(4) and (4) making N equal to N +1, and repeating the step (3) until all video frame encoding is finished.

The method comprises the steps that each frame of a video image is divided into a static background area and a plurality of moving rectangular areas through a motion detection method, then the moving rectangular areas are further subjected to video analysis to obtain moving areas of sensitive objects with smaller areas, coding of different strategies is carried out according to different areas, in the coding process, key coding is carried out on the sensitive areas, smaller quantization step length can be adopted, and the part of image quality can be kept high without occupying too much code rate; the motion area adopts a common quantization step length, so that the image quality is high and the code rate is not too much; the larger quantization step size is adopted for the static area of the background, so that the code rate is saved more, the image quality is within an acceptable range, and the code rate can be reduced greatly finally by coding. The invention divides the video content into areas with multi-level importance degree through video content analysis, adopts different processing and coding schemes for the video content with different importance levels to ensure that the important video elements are stored with high fidelity, and greatly reduces the coding code rate, thereby reducing the total storage capacity and bandwidth requirement and reducing the storage and bandwidth cost.

Preferably, in the step (3), the specific operation method is as follows:

(31) dividing the current Nth frame into small square blocks with 256 pixels in total of 16x16 from top left to bottom right, and filling the rightmost edge or the bottom edge less than 16 with fixed pixels to be 16x 16;

(32) comparing each square block with a small square with 16x16 pixels at the corresponding position of the cached N-1 frame, calculating meanSAD, if the meanSAD is less than a certain threshold ThrSAD, considering the current block as a static block, otherwise, identifying the current block as a motion block, and finally obtaining all blocks which are identified as motion blocks or static blocks;

(33) combining adjacent motion blocks or rectangular areas into a large motion rectangular block by using a recursive method, wherein all 16x16 blocks in the rectangular area are marked as motion blocks, a static block area is marked as a Level 1 area, and a Level2 area is remained;

(34) carrying out the intelligent object recognition algorithm again on the Level2 area, recognizing a smaller rectangular area of the key object, modifying a block mark corresponding to the rectangular area into Level3, finishing the video image processing of the Nth frame, and storing the input data of the Nth frame into a buffer;

(35) based on the set quantization factor for encoding, encoding is performed with the Level 1 region set to Qp1, the Level2 region set to Qp2, and the Level3 region set to Qp 3.

Preferably, in step (32), the MeanSAD calculation method is as follows:

the brightness values of the 256 pixels added to the small block corresponding to the N frames are respectively: y is_iWhere i is 1,2 … 256, and the pixel value of the N-1 th frame is Z_iWhere i is 1,2 … 256, then:

MeanSAD＝(|Y₁–Z₁|+|Y₂–Z₂|+|Y₃–Z₃|+…+|Y₂₅₆–Z₂₅₆|+128)/256。

preferably, in step (33), the minimum horizontal and vertical distances between the pixels of two adjacent regions are less than a threshold Thr1, and the method for merging the two regions is: and defining the maximum and minimum values of the horizontal and vertical coordinates of the pixels of the two regions as the maximum and minimum values of the horizontal and vertical coordinates after the rectangles are combined, determining a combined rectangle, carrying out recursive combination continuously until the rectangles cannot be combined, and finally combining a plurality of independent and disjoint rectangular regions with the distance exceeding Thr 1.

Preferably, in step (33), a rectangular area including too few pixels is regarded as an invalid area, and all of the corresponding 16 × 16 blocks are set as still blocks again.

The invention has the beneficial effects that: through video content analysis, the video content is divided into regions with multi-level importance, different processing and coding schemes are adopted for the video content with different importance levels, so that important video elements can be stored in high fidelity, the coding code rate is greatly reduced, the total storage capacity and bandwidth requirements are reduced, and the storage and bandwidth cost is reduced.

Drawings

FIG. 1 is a schematic diagram of one embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

In the embodiment shown in fig. 1, a method for compressed storage of surveillance video specifically includes the following steps:

(1) determining 3 encoding quantization factors Qp1, Qp2, Qp3, Qp1> Qp2> Qp3 in advance; these three Qps can be configured, with a typical Qp1 of 35, Qp2 of 30 bits, and Qp3 of 25.

(3) when the N (N >1) th frame is coded, the whole frame is divided into a static background area and a plurality of moving rectangular areas, the moving rectangular areas are further subjected to video analysis to obtain a moving area of a sensitive object with a smaller area, and in the coding process, the sensitive area is subjected to key coding by adopting a quantization step length Qp 3; the quantization step Qp2 is used for the motion region; adopting a quantization step Qp1 for the background static area;

the specific operation method comprises the following steps:

(32) for each square block, carrying out comparison calculation with a small square of 16x16 pixels at the corresponding position of the cached N-1 frame, calculating MeanSAD, if MeanSAD is less than a certain threshold value ThrSAD (typical ThrSAD is 3), considering the current block as a static block, otherwise, identifying the current block as a moving block, and finally obtaining that all blocks are identified as moving blocks or static blocks; the MeanSAD calculation method is as follows: the brightness values of the 256 pixels added to the small block corresponding to the N frames are respectively: y is_iWhere i is 1,2 … 256, and the pixel value of the N-1 th frame is Z_iWhere i is 1,2 … 256, then:

two regions (a 16 × 16 block may also be referred to as a region) are adjacent to each other to indicate that the minimum horizontal and vertical distances between their pixels are both smaller than a certain threshold Thr1(Thr1 may be configured, and is typically 16), and the method for merging the two regions is as follows: and defining the maximum and minimum values of the horizontal and vertical coordinates of the pixels of the two regions as the maximum and minimum values of the horizontal and vertical coordinates after the rectangles are combined, determining a combined rectangle, carrying out recursive combination continuously until the rectangles cannot be combined, and finally combining a plurality of independent and disjoint rectangular regions with the distance exceeding Thr 1. A rectangular area including too few pixels (for example, the number of blocks including 16 × 16 is less than 4) is regarded as an invalid area, and all the corresponding 16 × 16 blocks are set as still blocks again.

(34) Carrying out an intelligent object recognition algorithm (adopting the existing algorithm, the specific algorithm is not in the patent range) again on the Level2 area, recognizing a smaller rectangular area of key objects, wherein the key objects are human faces and license plates, modifying block marks corresponding to the rectangular area into Level3, finishing the video image processing of the N frame, and storing input data of the N frame into a buffer;

(35) based on the set quantization factor for encoding, encoding is performed with the Level 1 region set to Qp1, the Level2 region set to Qp2, and the Level3 region set to Qp 3. The encoding is tried for general video standards such as H264, H265, AVS2, AVS3, etc., but different standards need to configure different Qp1, Qp2, Qp 3.

For each frame of video image, dividing the whole image area into three importance Level levels, wherein Level 1 is a static area which is often a background or a motionless object; level2, common motion areas, such as moving pedestrians, vehicles and the like; and Level3, namely a motion area only containing important sensitive information such as human faces or license plates. The Level3 is a rectangular area, occupies a small area as much as possible, and completely contains sensitive information; level2 is also a rectangular area, but contains a rectangular area of Level 3; level2 is the rectangular region minus the region corresponding to Level3 rectangle.

As shown in fig. 1, the small frame is a Level3 region and is a human face, and the large frame excluding the small frame is a Level2 region and is a region of a general moving object, which is mainly a walking person. The rest part of the image is a static area, namely Level 1. Wherein the person standing on the left side is also regarded as Level 1 because the person does not move. Usually, the area proportion occupied by Level 1 is the largest, and even the whole frame is often Level 1 (for example, when no pedestrian exists at night, the whole frame is Level 1 for a long time). The second is a Level2 region, and the smallest is a Level3 region. Level3 regions usually occupy the smallest area, but contain the most important information.

For the Level 1 and Level2 regions, a low-pass filtering algorithm (such as gaussian filtering, and specifically which filtering algorithm is not in the scope of the patent) is adopted to remove noise in the picture. Especially, the monitoring video at night has more noise, and can be effectively removed by adopting low-pass filtering, so that the image becomes slightly blurred, but useless high-frequency information is removed, and the method is very favorable for coding compression. The filtering strength of the Level 1 area can be increased properly, and the filtering strength of the Level2 area is reduced properly. And for a Level3 area, keeping the original picture unchanged and keeping details as much as possible.

The method is applicable to any video standard, including but not limited to H.264, H.265, AVS2, AVS3, etc. The method does not adopt fixed code rate coding, but adopts a method of fixed quantization step length, and the same quantization step length is used for the same Level. The Level 1 quantization step is large, the Level2 quantization step is medium, and the Level3 quantization step is small. (note: the larger the quantization step, the higher the compression ratio, but the greater the coding loss, and vice versa). The specific quantization step size can be configured as desired. The region which finally reaches Level3 keeps better image quality, the image quality of the Level2 region is not good, the occupied code rate is not large because of using Gaussian filtering and removing useless high-frequency information, the Level 1 region does not use Gaussian filtering and uses larger quantization step length, the picture is static, most coding blocks can use a coding mode of skipping blocks, the occupied code rate is small, the quality loss can be accepted, and the coding calculation amount is small.

The invention is input by using the video stream of the IP monitoring camera on the node 2Ru 2 of the Intel strong rack server, can realize high-definition 30-path high-definition compression coding, averagely reduces the coding code rate by 8-10 times, has basically unchanged subjective image quality, keeps the identification rate of face and license plate information consistent with the original 4Mbps code stream, and can save the storage by 8-10 times.

The method comprises the steps of dividing each frame of a video image into a static background area and a plurality of moving rectangular areas by a motion detection method, further carrying out video analysis on the moving rectangular areas to obtain moving areas of sensitive objects with smaller areas, wherein the sensitive objects comprise human faces, license plates and the like, and further carrying out different image processing on different areas. And denoising the non-sensitive region to remove unimportant high-frequency detail information, and keeping the sensitive object region unchanged. And then coding with different strategies according to different regions. In the encoding process, the sensitive area is subjected to key encoding, a smaller quantization step length can be adopted, and the part of image quality can be kept high without occupying too much code rate; the motion area adopts a common quantization step length, so that the image quality is high and the code rate is not too much; and a larger quantization step size is adopted for a background static area, so that the code rate is saved more, and the image quality is within an acceptable range. Therefore, the code rate can be greatly reduced finally by encoding. The invention divides the video content into areas with multi-level importance degree through video content analysis, adopts different processing and coding schemes for the video content with different importance levels to ensure that the important video elements are stored with high fidelity, and greatly reduces the coding code rate, thereby reducing the total storage capacity and bandwidth requirement and reducing the storage and bandwidth cost.

Claims

1. A method for compressing and storing monitoring videos is characterized by comprising the following steps:

2. The method for compressed storage of surveillance video according to claim 1, wherein in step (3), the specific operation method is as follows:

3. The method for compressed storage of surveillance video according to claim 2, wherein in step (32) the MeanSAD calculation method is as follows:

4. the method for compressed storage of surveillance video according to claim 2, wherein in step (33), the minimum horizontal and vertical distances between the pixels of two adjacent regions are less than a threshold Thr1, and the method for merging the two regions is: and defining the maximum and minimum values of the horizontal and vertical coordinates of the pixels of the two regions as the maximum and minimum values of the horizontal and vertical coordinates after the rectangles are combined, determining a combined rectangle, carrying out recursive combination continuously until the rectangles cannot be combined, and finally combining a plurality of independent and disjoint rectangular regions with the distance exceeding Thr 1.

5. The method for compressed storage of surveillance video as claimed in claim 4, wherein in step (33), the rectangular area containing too few pixels is taken as the invalid area, and all the corresponding 16x16 blocks are set as the still blocks again.