CN111062975B

CN111062975B - Method for accelerating real-time target detection of video frame based on perceptual hash algorithm

Info

Publication number: CN111062975B
Application number: CN201911124925.6A
Authority: CN
Inventors: 陈旋; 王冲; 崇传兵
Original assignee: Jiangsu Aijia Household Products Co Ltd
Current assignee: Jiangsu Aijia Household Products Co Ltd
Priority date: 2019-11-18
Filing date: 2019-11-18
Publication date: 2022-07-08
Anticipated expiration: 2039-11-18
Also published as: CN111062975A

Abstract

The invention relates to a method for accelerating real-time target detection of video frames based on a perceptual hash algorithm, and belongs to the technical field of video processing. The invention is based on the characteristics of short interval time of adjacent frames and subtle picture change, and utilizes the target detection result of the previous frame and the fingerprint information of the picture to use the detection result of the previous frame or reduce the detection area of the current frame, thereby accelerating the target detection speed.

Description

Method for accelerating real-time target detection of video frame based on perceptual hash algorithm

Technical Field

The invention relates to a method for accelerating real-time target detection of a video frame based on a perceptual hash algorithm, and belongs to the technical field of video processing.

Background

Object detection is a computer vision technique for detecting target objects, such as cars, buildings and humans, which are usually identified by pictures or videos. Object detection locates an object in the image and draws a bounding box around the object. This process is generally divided into two steps: the object is classified and the type is determined, and then a box is drawn around the object.

The problem to be solved by video object detection is the correct identification and location of objects for each frame in the video. With respect to image object detection, video is highly redundant, containing a large amount of Temporal locality (i.e. similar at different times) and spatial locality (i.e. similar looking in different scenes), i.e. Temporal Context, information. The context relation of the time sequence is fully utilized, the situation of a large amount of redundancy between continuous frames in the video can be solved, and the detection speed is improved; the detection quality can be improved, and the problems of motion blur, video defocusing, partial shielding, deformation and the like of the video relative to the image are solved.

The existing target detection is mostly finished based on deep learning, the calculation amount is huge, a strong GPU is needed to carry out the target detection, the GPU resource is relatively expensive, and the hardware cost is too high; video target detection views of different scenes are mostly static background images, and moving targets appear in a short time, so that the background images are only continuously detected, and no target exists actually, and GPU resources are wasted.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method for accelerating real-time target detection of video frames is provided based on the characteristic of the video memory, the calculated amount is reduced, GPU resources are saved, the target detection speed is accelerated, limited and expensive system resources can be released, and the target detection time can be shortened.

A method for accelerating real-time target detection of video frames based on a perceptual hash algorithm comprises the following steps:

step 1, obtaining a first frame f1 of a video, calculating a picture fingerprint p1 of f1, and performing target detection to obtain a target box 1;

step 2, obtaining a first frame f2 of the video, and calculating a picture fingerprint p2 of f 2; comparing p1 with p 2;

if p1 is p2, then the target in f2 is considered box1 as well;

if p1 is not equal to p2, the box1 is enlarged by a certain proportion to be used as the box 2;

step 3, deleting areas with the same size and position as box2 in f1 and f2, and calculating the image fingerprints of the deleted areas again to obtain p3 and p 4;

step 4, if p3 is p4, performing target detection on an area with the same size and position as a box2 in the second frame to obtain a target box 3; and if p3 is not equal to p4, performing target detection on the data before f2 deletion to obtain a target box 4.

In one embodiment, the amplification ratio is not particularly limited, and may be 10%, 30%, 50%, 100%, 150%, or the like, and may be set manually according to the target situation.

In one embodiment, the picture fingerprint is calculated by a hash method, and the specific steps include:

s1, reducing the size of the picture;

s2, simplifying colors;

s3, discrete cosine transform processing;

s4, taking the upper left corner of the matrix after discrete cosine transform processing;

s5, calculating the average value of all values in the matrix obtained in S4; then, a hash value of 64 bits of 0 or 1 is set to the matrix obtained in S4, and a value equal to or larger than the average value is set to "1", and a value smaller than the average value is set to "0", thereby obtaining a picture fingerprint.

In one embodiment, S1 refers to a reduction to a size of 8x 8.

In one embodiment, simplifying the color in S2 refers to conversion to 64 levels of gray.

In one embodiment, a 32 × 32 discrete cosine transform is used in S3.

In one embodiment, the upper left corner in S4 is taken as a matrix of 8 × 8.

Advantageous effects

1. Reducing the amount of computation for target detection

2. Object detection with reduced repetition of video frames

3. Reducing system stress on servers

4. The speed of target detection is accelerated, and the time of target detection is shortened

Drawings

FIG. 1 is a perceptual hashing algorithm flow

FIG. 2 is a process flow of the present invention

Detailed Description

The video is composed of a continuous static picture, and the continuously changed static picture forms a dynamic video by utilizing the human visual persistence effect. If the picture is two-dimensional, the picture data records the information of the pixel points and the related position information, and the video is three-dimensional, so that the time information is increased, and the video is more complex. A frame is the smallest component of a video sequence, the most basic unit, a still image is a frame, and a video is an image sequence consisting of consecutive frames. The frame contains all the video information. A shot consists of a series of consecutive frames that generally describe the motion of a body of things in succession in the same scene. Within the same shot, the difference between adjacent frames is not large, and what is often described is the change of the same object subject within the shot, and the change of the object includes continuous actions such as translation and zooming.

The invention is based on the characteristics of short interval time of adjacent frames and subtle picture change, and utilizes the target detection result of the previous frame and the detection result of the previous frame based on the picture fingerprint information or reduces the detection area of the current frame, thereby accelerating the target detection speed.

The technical concept of the method is explained in detail as follows:

1) firstly, reading video data to obtain a first frame f1, firstly, carrying out picture fingerprint calculation (p1) aiming at the frame f1 to obtain a hash value, carrying out target detection (box1) and obtaining a detected target;

2) reading the video data to obtain a next frame f2, carrying out picture fingerprint calculation (p2) aiming at the frame f2, and obtaining a hash value again;

3) comparing picture fingerprints p1 and p2

If the picture fingerprint p2 of the frame f2 is the same as the picture fingerprint p1 of the frame f1, it indicates that the two adjacent pictures are unchanged, and the boundbox of the frame f2 is the boundbox of the frame f1, and it can be detected that the target boundbox of the frame f2 is also the box 1;

if the picture fingerprint p2 of the frame f2 is different from the picture fingerprint p1 of the frame f1, it indicates that the two pictures are changed, but since f1 and f2 are two adjacent pictures, the time interval between the two frames is very short, and the moving distance of the target object is very small, the boundbox box1 of the frame f1 is moderately enlarged to box2, for example, enlarged by 50%;

4) the resulting box2 is enlarged, the box2 area is cropped out for both frame f1 and frame f2, as areas f1r, f2r, and then the picture fingerprints of the remaining areas are recalculated, as p3, p4 respectively

5) Comparing picture fingerprints p3 and p4

If the fingerprints of the pictures are equal, the regions f1r and f2r are considered to have no target, and the target of the frame f2 is in the region box2, target detection is carried out on the small region of the region box2, and the target boundbox box3 can be obtained; thus reducing the amount of computation;

if the picture fingerprints are not equal, the target of the frame f2 is considered to be partially in the f2r, and the target boundbox box4 can be obtained by performing target detection on the frame f 2.

It can be seen from the above process that whether the target in the video changes or not and the preliminary situation of the position change can be preliminarily determined quickly by enlarging the bound box1 to the box2 and then calculating the hash value of the remaining positions, and only when it is confirmed that the change occurs, the appropriate target area is detected again, thus avoiding a large number of calculation processes.

In the above method, the method for calculating the image fingerprint by using the perceptual hash algorithm may adopt the following processes:

1. the size is reduced.

The fastest way to remove high frequencies and details is to reduce the size by keeping the structure bright and dark.

The picture is reduced to a size of 8x8 for a total of 64 pixels. The picture difference caused by different sizes and proportions is abandoned.

2. The color is simplified.

And converting the reduced picture into 64-level gray. That is, all pixels have 64 colors in total.

3. DCT (discrete cosine transform) is calculated.

DCT is the frequency clustering and the ladder shape of the picture decomposition, although JPEG uses 8 × 8 DCT transform, here 32 × 32 DCT transform.

4. The DCT is reduced.

Although the result of DCT is a matrix of 32 x 32 size, we only need to retain the 8x8 matrix in the upper left corner, which part presents the lowest frequencies in the picture.

5. The average value is calculated.

The average of all 64 values was calculated.

6. The DCT is further reduced.

According to the 8-by-8 DCT matrix, a hash value of 64 bits of 0 or 1 is set, wherein the hash value greater than or equal to the DCT mean value is set as '1', and the hash value smaller than the DCT mean value is set as '0'. The results do not tell us about the low frequency of authenticity, but only roughly the relative proportion of the frequency we have with respect to the mean. As long as the overall structure of the picture remains unchanged, the hash result value is unchanged. The influence of gamma correction or color histogram adjustment can be avoided.

7. A hash value is calculated.

Setting 64bit to 64bit long integer, the order of combination is not important as long as it is guaranteed that all pictures are in the same order. The 32 x 32 DCT is converted to a 32 x 32 image.

Claims

1. A method for accelerating real-time target detection of video frames based on a perceptual hash algorithm is characterized by comprising the following steps:

step 2, obtaining a next frame f2 of the video, and calculating a picture fingerprint p2 of f 2; comparing p1 with p 2;

if p1= p2, the target in f2 is considered box1 as well;

step 4, if p3= p4, performing target detection on an area with the same size and position as the box2 in the second frame to obtain a target box 3; if p3 is not equal to p4, performing target detection on the data before f2 deletion to obtain a target box 4;

the picture fingerprint is calculated by a Hash method.

2. The perceptual hashing algorithm-based method of accelerating real-time object detection of video frames according to claim 1, wherein the scaling up is 10%, 30%, 50%, 100%, or 150%.

3. The method for accelerating real-time target detection of video frames based on perceptual hashing algorithm of claim 1, wherein the picture fingerprints are calculated by a hashing method, comprising the specific steps of:

s1, reducing the size of the picture;

s2, simplifying colors;

s3, discrete cosine transform processing;

4. The method for accelerating real-time object detection of video frames based on perceptual hashing algorithm of claim 3, wherein S1 refers to scaling down to a size of 8x 8.

5. The method for accelerating real-time object detection of video frames based on perceptual hashing algorithm of claim 3, wherein the simplified color in S2 is converted into 64-level gray.

6. The method for accelerating real-time object detection of video frames based on perceptual hashing algorithm of claim 3, wherein in S3, 32 x 32 discrete cosine transform is used.

7. The method for accelerating real-time target detection of video frames based on perceptual hashing algorithm of claim 3, wherein the top left corner of S4 is divided into 8x8 matrices.