CN116665101B

CN116665101B - Method for extracting key frames of monitoring video based on contourlet transformation

Info

Publication number: CN116665101B
Application number: CN202310625992.6A
Authority: CN
Inventors: 张云佐; 张嘉煜; 武存宇; 张天; 刘亚猛
Original assignee: Shijiazhuang Tiedao University
Current assignee: Shijiazhuang Tiedao University
Priority date: 2023-05-30
Filing date: 2023-05-30
Publication date: 2024-01-26
Anticipated expiration: 2043-05-30
Also published as: CN116665101A

Abstract

The invention discloses a method for extracting a monitoring video key frame based on contourlet transformation; the method comprises the following steps: carrying out multi-scale and multi-directional decomposition on the video sequence by utilizing the profile wave transformation to obtain an image containing rich direction information and profile information; providing a non-downsampling direction filter combination, filtering and decomposing different direction features, and carrying out inverse contour wave transformation fusion on the decomposed features under different scales to obtain a non-photosensitive contour feature map; constructing a texture enhancement model by using a nonlinear enhancement function, wherein the texture enhancement model is used for enhancing image edge information and enhancing the contrast of a target contour; the method solves the problems that the current method is easily influenced by external illumination condition change and the target direction detail information is not extracted in place, and effectively improves the precision of the key frame extraction method facing the monitoring video field.

Description

Method for extracting key frames of monitoring video based on contourlet transformation

Technical Field

The invention relates to a method for extracting a monitoring video key frame based on contourlet transformation, and belongs to the technical field of computer vision.

Background

In recent years, networks have become popular in daily life, and with the continuous development of internet media technology, network information transmission costs have been reduced. Compared with the traditional communication modes such as short messages, characters and the like, people tend to communicate in a more visual mode such as video, and the continuous development of video technology enriches the communication modes of people.

While the development of emerging technologies is continually bringing convenience to people, video data inevitably presents a trend toward blowout growth. In the field of monitoring video, the manufacturing cost of hardware equipment such as a camera is gradually reduced, so that the video can be deployed at all corners of a city, and meanwhile, a large amount of video data which cannot be processed in time is inevitably generated due to the uninterrupted working mode of the video. Intelligent monitoring video systems are involved in hunting in many areas of people's daily lives.

In the intelligent office field, the monitoring video system has strong practicability in the aspects of production regulation and control, large-scale public facility safety guarantee, remote education and the like. From the viewpoint of urban construction, a powerful technical support is provided for government to promote construction informationized public security protection engineering. In the field of traffic management, the video monitoring system plays an irreplaceable role, and a monitoring camera installed at a traffic junction such as an intersection, a high-speed toll gate and the like can capture the illegal behaviors of motor vehicles, so that traffic accidents are prevented to a certain extent. The monitoring system in the intelligent home field fully plays the self value, and the outdoor monitoring equipment can be used for detecting whether strangers exist at the gate, so that the house safety problem of people is guaranteed; the indoor monitoring equipment can dynamically know and master the condition of the home in real time, and a good living environment is provided for people.

The intelligent monitoring system plays a very important role in various fields, however, the existing monitoring system has the problems that massive data cannot be completely stored, the process of retrieving key information is complex, the labor cost is too high and the like. Key frames are typically formed as a collection of frames of an image or video, and are a collection of images that represent the most significant content in a video. Key frame extraction is a technique that studies how to reflect the primary content of the original video to the greatest extent using the smallest sequence of still images. As a technology capable of reducing redundancy of original video data and selecting core contents therefrom to be recombined into a visual abstract, key frame extraction has attracted extensive attention from domestic and foreign students. Therefore, the technology of extracting static abstracts from video data with complex contents and huge data volume and concentrating the extracted static abstracts into one frame or a plurality of video frame sequence sets capable of completely expressing video contents becomes a current research hot spot. The realization of the technology is beneficial to reducing the storage pressure of massive videos, reducing the workload of video abstract retrieval, and further saving the labor and financial cost.

Although the video retrieval technology is continuously updated and iterated, people can not efficiently and accurately acquire valuable video information in the face of massive video data. The capture characteristics of surveillance video, which typically consist of long, undipped image sequences, result in a significant amount of information redundancy in the video, which often consists of blank content in the original video where no objects appear. Meanwhile, the existing key frame extraction technology of the monitoring video still has a certain limitation, and when the illumination condition changes, the sudden change of the brightness of the monitoring video can cause the reduction of the key frame extraction efficiency. In addition, the existing key frame extraction method also has the problem that the target direction detail information in the monitoring video is not extracted in place. In order to solve the problem, a monitoring video key frame extraction method based on contourlet transformation is provided. Firstly, according to the time-frequency localization analysis capability of the profile wave transformation, carrying out multi-scale and multi-directional decomposition on a video sequence by adopting the profile wave transformation, thereby obtaining an image containing rich directions and profile information; then, a non-downsampling direction filter combination is provided, and a non-photosensitive profile characteristic diagram is obtained by filtering and fusing different direction characteristics; then, constructing a texture enhancement model by using a nonlinear enhancement function, and achieving the purpose of improving the precision of a key frame extraction method from the viewpoint of enhancing the image edge information capability; and finally, constructing a key frame screening model based on structural similarity to select and extract key frames.

Disclosure of Invention

A key frame extraction method based on contourlet transformation for the field of surveillance videos is characterized by at least comprising the following steps:

s1: collecting a monitoring video from an intelligent monitoring video system, and obtaining a video data set;

s2: cutting the monitoring video, performing graying pretreatment and the like to obtain a monitoring video sequence, and adjusting the monitoring video sequence to a preset resolution;

s3: carrying out multi-scale multi-direction decomposition on the processed monitoring video sequence so as to obtain an image containing rich directions and contour information;

s4: selecting an image high-frequency component containing image edge part information to perform non-downsampling direction filter combination decomposition so as to obtain different band-pass direction sub-bands, and performing contour wave reconstruction fusion on the band-pass direction sub-bands under different scales so as to obtain a non-photosensitive contour feature map;

s5: the texture enhancement model is used for enhancing the contour texture of the non-photosensitive contour feature map so as to achieve the purposes of improving the image quality and enhancing the contrast of gray information in the image;

s6: calculating the structural similarity of two adjacent frames and forming a structural similarity curve to replace a real lens dynamic curve, and replacing inflection points in the lens with minimum value points of the lens similarity curve;

s7: all minima in the structural similarity curve are put into the set and named keyframe set.

Further, preprocessing the monitoring video and converting the multi-scale multi-directional profile wave, including:

the contourlet transform is a discrete transform method specific to two-dimensional images that can efficiently process graphic geometries that contain contour information. One of the biggest features of the contourlet, which is different from other transformation methods, is the flexible multi-resolution and the different directionality, which means that it allows different directions within different scales. A non-downsampled contourlet transform is a contourlet transform that has the property of being translationally invariant, typically consisting of a non-downsampled pyramid filter and a non-downsampled direction filter.

The proposed algorithm first performs a non-downsampling pyramid decomposition on the surveillance video frame sequence, thereby obtaining high frequency components and low frequency components of the image. Because the high-frequency component of the image often contains the information of strong image change, the invention selects the high-frequency component containing the information of the image edge part to carry out the combination decomposition of the non-downsampling direction filter so as to obtain different bandpass direction sub-bands.

Next, the non-downsampled pyramid decomposition is performed again on the low frequency component containing the image gray information, and the band pass directional subbands are obtained by performing the above-described repetitive operation on the high frequency component therein.

And then, carrying out contour wave reconstruction fusion on the sub-bands in the band-pass directions under different scales, thereby obtaining a non-photosensitive contour feature map.

Instead of the non-downsampling direction filter in the conventional method, a non-downsampling direction filter combination is used, which combines a quadrature filter, a smooth quadrature filter and a two-dimensional filter based on the McClellan transform.

Further, a texture enhancement model is designed for contour texture enhancement of a non-photosensitive contour feature map, comprising:

because the non-downsampling direction filter combination is used for filtering and acquiring information aiming at high-frequency components of an image, the extracted non-photosensitive contour feature map has the problems of poor image quality and unobvious gray contrast.

Therefore, the invention designs a texture enhancement model for enhancing the contour texture of the non-photosensitive contour feature map so as to achieve the purposes of improving the image quality and enhancing the contrast of gray information in the image. The invention adopts a nonlinear stretching mode to improve the gray contrast of the non-photosensitive profile characteristic diagram.

And carrying out coefficient enhancement on the extracted high-frequency component according to a nonlinear enhancement function so as to achieve the purpose of increasing the contrast of contour texture information in an image, wherein the nonlinear enhancement function is shown as follows:

wherein x represents the gray value of the photosensitive contour feature map before texture enhancement construction, gamma is the final gray value result, r is the parameter for slope control of the nonlinear enhancement function, and alpha has a certain influence on the change of the slope.

According to the formula, if the slope is too high, the contrast of the gray information of the image is too high, and the image is distorted. Otherwise, the problem that the image light and dark contrast is not high, and the enhancement effect of the texture enhancement model is not obvious is caused. In view of the above, the parameter is defined as α=r=0.5 in the present invention.

Further, a key frame screening criterion based on structural similarity is designed, which comprises the following steps:

since motion is a nonlinear process and the dimensions of the image data are high, the time complexity of computing the high dynamic curve is high. Therefore, a linearization description of the target motion process is required to achieve fast and accurate key frame extraction.

It is noted that there are various methods for linearizing the description to reduce the time complexity of the high dynamic curve, and the proposed method uses the structural similarity curve to replace the real lens dynamic curve and uses the minimum point of the lens similarity curve to replace the inflection point in the lens.

In general, the overall structure information of an image is characterized by structure information, and in addition, gray information and brightness information of the image have a great influence on the observation of the image. Such information is often of interest when viewing images.

Assuming that a set of video sequences is S { s=1, 2 …, n }, where x and y represent two adjacent video frames, the brightness, contrast and structural similarity are defined as follows:

wherein mu _x ,μ _y Respectively the average value of the images x and y, reflecting the brightness information thereof; sigma (sigma) _x ,σ _y The variances of the images x and y reflect the contrast information; sigma (sigma) _xy The correlation coefficients of x and y reflect the similarity of the structural information; c (C) ₁ ,C ₂ ,C ₃ The normal number is close to zero, and abnormal results caused by zero denominator are prevented. The structural similarity of x and y obtained by combining these three information is shown below:

SSIM(x,y)＝[l(x,y)] ^α ·[c(x,y)] ^β ·[s(x,y)] ^γ

wherein, α, β, γ are all greater than 0, and are used to adjust the relative importance of the three information, and when α=β=γ, the method can be simplified as:

however, due to the simplification, the minimum value point screened is not completely equivalent to the inflection point of the motion track of the target, so that iteration and optimization are required. The procedure is as follows.

Step 1: and extracting a complete video frame sequence of the monitoring video. The method comprises the steps of firstly cutting an input monitoring video sequence, so as to obtain a complete monitoring video frame sequence.

Step 2: and calculating the structural similarity of two adjacent frames. Structural similarity of two adjacent frames in the set of video sequences S is calculated according to equation (5-6) and stored as defined below as S _SSIM Of the sets, set S _SSIM The definition is as follows:

(2,SSIM(1,2))∪(3,SSIM(2,3))∪...∪(n,SSIM(n-1,n))

step 3: forming a structural similarity curve. The structural similarity curve (Structural Similarity Curve, SSIMC) is based on the set S formed by the structural similarity of two adjacent frames in step 2 _SSIM Drawing toAnd (3) forming the finished product.

Step 4: a final keyframe set is formed. This method places all minima in the SSIMC into set Z and names them as candidate keyframe sets. Since the number of points K of the minimum value of the curve does not necessarily meet the required key frame number K ₀ The key frame secondary extraction is performed according to the following rule.

(1) If K<K ₀ All key frames in the set Z are first extracted as final key frames and stored in the final key frame set F. K is then added using a linear interpolation method ₀ K frames and store them into the set F.

(2) If K=K ₀ And directly extracting the key frames in the candidate key frame set Z to serve as final key frames, and storing the final key frames in the final key frame set F.

(3) If K>K ₀ The peak signal-to-noise ratio (Peak Signal Noise Ratio, PSNR) of adjacent frames in the set of video sequences S is first calculated. Next, an average peak signal-to-noise ratio PSNR of two adjacent frames is calculated _avg And comparing with key frames in candidate key frame set Z if the corresponding frame difference peak signal-to-noise ratio is greater than PSNR _avg It is deleted from the candidate keyframe set and finally the extraction of the K-frame keyframes is completed. PSNR (Power System noise ratio) _avg The calculation formula of (2) is shown as follows:

drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is a frame flow chart of a method for extracting key frames of a surveillance video based on contourlet transformation;

FIG. 2 is a schematic diagram of a structure of a contourlet transformation model according to the present invention;

FIG. 3 is a schematic diagram of a texture enhancement model according to the present invention;

FIG. 4 is a schematic diagram of a video keyframe screening model based on structural similarity according to the present invention;

FIG. 5 is a diagram of different filter characteristics;

FIG. 6 is a diagram showing the effectiveness test effect of the present invention in FIG. 1;

fig. 7 shows the effectiveness test effect of the present invention in fig. 2.

Detailed Description

The technical solutions of the embodiments of the present invention will be further described below with reference to the drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, a frame flow chart of a method for extracting key frames of surveillance video based on contourlet transformation according to a first embodiment of the present invention includes:

the algorithm provided by the invention firstly carries out non-downsampling pyramid decomposition on the monitoring video frame sequence, thereby obtaining the high-frequency component and the low-frequency component of the image.

Because the high-frequency component of the image often contains the information of strong image change, the invention selects the high-frequency component containing the information of the image edge part to carry out the combination decomposition of the non-downsampling direction filter so as to obtain different bandpass direction sub-bands.

Repeating the operation again for the low frequency component containing the image gray information, and adopting the repeated operation for the high frequency component to obtain the band-pass direction sub-band.

The present invention uses a combination of non-downsampling direction filters in place of the non-downsampling direction filters in the conventional approach, which combines a quadrature filter, a smooth quadrature filter, and a two-dimensional filter based on the McClellan transform. Fig. 4 is a feature map of an original image after filtering based on different filters.

wherein mu _x ,μ _y Respectively the average value of the images x and y, reflecting the brightness information thereof; sigma (sigma) _x ,σ _y The variances of the images x and y reflect the contrast information; sigma (sigma) _xy The correlation coefficients of x and y reflect the similarity of the structural information; c (C) ₁ ,C ₂ ,C ₃ Is a normal number close to zero, prevents the denominator from being zeroWhich leads to abnormal results. The structural similarity of x and y obtained by combining these three information is shown below:

SSIM(x,y)＝[l(x,y)] ^α ·[c(x,y)] ^β ·[s(x,y)] ^γ

(2,SSIM(1,2))∪(3,SSIM(2,3))∪...∪(n,SSIM(n-1,n))

step 3: forming a structural similarity curve. The structural similarity curve (Structural Similarity Curve, SSIMC) is based on the set S formed by the structural similarity of two adjacent frames in step 2 _SSIM Drawing to obtain the final product.

(3) If K>K ₀ The peak signal-to-noise ratio (Peak Signal Noise Ratio, PSNR) of adjacent frames in the set of video sequences S is first calculated. Next, an average peak signal-to-noise ratio PSNR of two adjacent frames is calculated _avg And comparing with key frames in candidate key frame set Z if the corresponding frame difference peak signal-to-noise ratio is greater than PSNR _avg It is deleted from the candidate keyframe set and finally the extraction of the K-frame keyframes is completed. PSNR (Power System noise ratio) _avg The calculation formula of (2) is shown in the following formula (5-8):

an embodiment of the invention provides a monitoring video key frame extraction method terminal device based on contourlet transformation, which comprises one or more input devices, one or more output devices, one or more processors and a memory, wherein the memory is used for storing a computer program, and the processor is used for executing the computer program to realize the target detection method facing an unmanned aerial vehicle platform.

An embodiment of the present invention provides a computer readable storage medium storing a computer program, where the computer program when executed by a processor performs the above-mentioned method for extracting keyframes of surveillance video based on contourlet transformation.

To verify the validity of the above embodiment, we compare the present invention with the advanced approach in the key frame extraction approach by calculating the accuracy, recall, and F1 score. Specifically, we use the data set Visor containing rich local motion information and the real world data set abnormal situation data set UCF-Crime. The Visor dataset contains various human activities such as walking, jumping, placing objects and the like, and the human activities are used for judging the extraction capability of the method provided by the chapter to the target detail information. The data set UCF-Crime covers a total of seven real world anomalies. The above-mentioned abnormal behaviors are chosen because they affect the normal life of the real world and thus require accurate key frame extraction.

The present invention will use a tone scale map to represent the precision, recall, and F1 score of each video in both datasets. From table 1 it can be seen that the recall of the proposed method is significantly better than the comparative method, with significant advantages. And meanwhile, the F1 fraction is improved relative to a comparison method. It can be seen from table 2 that the advantages of the proposed method still manifest themselves in the improvement of recall. Meanwhile, the manual destruction process is described in Video17 and Video18, and it can be seen that the method provided by the invention is significantly better than the comparison method in terms of accuracy, recall and F1 fraction. The fire disaster is often accompanied with sudden change of brightness in the monitoring video lens, however, the key frame extraction method for comparison is easily affected by factors such as illumination condition change, and key frames with sudden change of brightness cannot be accurately captured, so that the extraction accuracy is reduced. However, the key frame extraction technology based on the method provided by the invention generates a non-photosensitive outline feature map through fusion of multiple filters, so that the video with abrupt brightness change of the lens in the face of fire and the like has higher accuracy. In summary, the accuracy and recall rate of the key frames extracted by the method of the present invention are better than those of the comparison method, which further proves the effectiveness of the method of the present invention. Meanwhile, the generalization of the method provided by the invention is also verified.

Table 1 comparison of the results of the different methods of data set 1

Table 2 comparison of results for the different methods of dataset 2

Claims

1. The key frame extraction method for the monitoring video field is characterized by at least comprising the following steps:

s3: carrying out multi-scale multi-direction decomposition on the processed monitoring video sequence by utilizing the profile wave transformation, thereby obtaining an image containing rich directions and profile information; the specific operation is as follows: selecting an image high-frequency component containing image edge part information, decomposing the image high-frequency component by using a non-downsampling direction filter combination to replace a traditional non-downsampling direction filter, so as to obtain different band-pass direction sub-bands, and carrying out contour wave reconstruction fusion on the band-pass direction sub-bands under different scales, so as to obtain a non-photosensitive contour feature map; the non-downsampling direction filter combination comprises a quadrature filter, a smooth quadrature filter and a two-dimensional filter based on Maclen transformation;

s4: the texture enhancement model is used for enhancing the contour texture of the non-photosensitive contour feature map so as to achieve the purposes of improving the image quality and enhancing the contrast of gray information in the image;

s5: calculating the structural similarity of two adjacent frames and forming a structural similarity curve to replace a real lens dynamic curve, replacing inflection points in the lens with minimum value points of the structural similarity curve, putting the inflection points into a set Z, and naming the inflection points as a key frame set; the method aims at carrying out linear description on the target motion trail so as to reduce the time complexity of a high-dynamic curve;

s6: performing secondary key frame extraction on the key frame set to ensure the number of curve minimum point numbers K and the required key frame number K ₀ The specific steps are as follows: if K<K ₀ All key frames in the set Z are firstly extracted as final key frames and stored in the final key frame setIn F, K is then added using a linear interpolation method ₀ -K frames and store them into the set F; if K=K ₀ Directly extracting the key frames in the candidate key frame set Z to be used as final key frames, and storing the final key frames in a final key frame set F; if K > K ₀ Then, the peak signal-to-noise ratio PSNR of the adjacent frames in the video sequence set S is calculated first, and then, the average peak signal-to-noise ratio PSNR of the two adjacent frames is calculated _avg And comparing with key frames in candidate key frame set Z if the corresponding frame difference peak signal-to-noise ratio is greater than PSNR _avg It is deleted from the candidate keyframe set and finally the extraction of the K-frame keyframes is completed.

2. The texture enhancement model of claim 1, wherein the non-linear stretching is used to improve the gray contrast of the non-photosensitive profile; and carrying out coefficient enhancement on the extracted high-frequency component according to a nonlinear enhancement function so as to achieve the purpose of increasing the contrast of contour texture information in an image, wherein the nonlinear enhancement function is shown as follows: