CN110602444B

CN110602444B - Video summarization method based on Weber-Fisher's law and time domain masking effect

Info

Publication number: CN110602444B
Application number: CN201910723748.7A
Authority: CN
Inventors: 刘颖; 王玲; 公衍超; 王富平; 薛刚; 梁伟; 卢津; 王昊; 李兴
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2019-08-07
Filing date: 2019-08-07
Publication date: 2020-10-02
Anticipated expiration: 2039-08-07
Also published as: CN110602444A

Abstract

A video abstraction method based on a Weber-Fisher law and a time domain masking effect is composed of Gaussian filtering, region blocking, frame difference Euclidean distance determination, Weber-Fisher model construction, threshold determination, denoising model construction, key frame extraction and key frame synthesis video. And the video area is processed in a blocking mode, so that the problem of missed detection caused by the fact that the target is too far away from the camera is solved. The Euclidean distance frame difference method and the Weber-Fisher model are combined to effectively cope with complex monitoring environments, and the process of repeatedly adjusting the threshold value is avoided. The denoising model combined with the time domain covering effect filters the interference noise and then synthesizes the video, so that the display quality of the synthesized video is improved. The method does not depend on color information in the video frame, and is also effective for monitoring videos at night.

Description

Video summarization method based on Weber-Fisher's law and time domain masking effect

Technical Field

The invention belongs to the technical field of video analysis, and particularly relates to a video summarization method.

Background

With the popularization of communication tools and monitoring equipment in a large quantity and the rapid development of the film and television industry, the generated massive video data not only brings huge pressure to data storage, but also is not beneficial to people to quickly retrieve key video information. The efficiency of searching the key information of the video by utilizing the manual work is low, and the conditions of missing detection and false detection are easy to be influenced by human body sensory fatigue. In order to quickly browse and efficiently utilize the video data, the video summarization technology is very important.

The current video abstract generation method mainly comprises a method based on key frame extraction, space-time transformation based on moving objects and highlight scene identification. The key frame extraction method includes a motion analysis based method, a shot boundary based method, an image content based method and a compressed video stream based method. The frame difference method is one of the commonly used methods based on motion analysis, and the basic principle is that a pixel-based time difference is adopted between two or three adjacent frames of an image sequence to extract a moving target. In the frame difference method, the selection of the threshold is quite critical, the noise influence is difficult to suppress when the threshold is too low, and the detail change in the image is ignored when the threshold is too high.

The disadvantages of the prior art described above are as follows:

(1) the existing video abstraction method is used for processing the global area of a video, and the problem of missed detection caused by too far distance between a moving target and a camera easily occurs.

(2) In the aspect of moving target detection, the existing frame difference method is simple, a fixed threshold is mostly adopted, and the complex practical application environment is difficult to meet.

(3) In the aspect of summary video synthesis, the existing method directly uses frames larger than a threshold value for video synthesis, and the influence caused by noise is not fully considered.

Disclosure of Invention

The technical problem to be solved by the present invention is to overcome the above disadvantages of the prior art, and to provide a video summarization method based on weber-fisher's law and time domain masking effect, which can reduce the interference of noise to the target detection, avoid the problem of missing detection, and process the complex environment video.

The technical scheme adopted for solving the technical problems comprises the following steps:

(1) gauss filtering

And (3) carrying out noise removal on the 1 st frame to the Nth frame of the video by a Gaussian filtering method, wherein N is the total frame number of the video and is a limited positive integer.

(2) Region partitioning

Dividing the 1 st frame to the Nth frame of the video into non-overlapping square pixel blocks according to the sequence from left to right and from top to bottom, wherein the side length of each pixel block is m pixels, rounding the number of the divided blocks in the width direction and the height direction of the video frame, and carrying out scale transformation on the 1 st frame to the Nth frame according to the following formula:

w＝m×s (1)

h＝m×t (2)

where w is the frame width, h is the frame height, s is the number of blocks in the horizontal direction, a positive integer, t is the number of blocks in the vertical direction, a positive integer, and m is an element { 16.

(3) Determining frame difference Euclidean distance

Determining a frame difference Euclidean distance D (k, i, j) of a jth pixel point of an ith block of a kth frame of a video according to formula (3):

wherein k ∈ {1, 1., N-2}, i ∈ {1,..,. s × t }, j ∈ {1,..,. m }, respectively²And i is taken from left to right and from top to bottom in the frame, namely i corresponding to the leftmost block in a frame is taken as 1, i corresponding to the rightmost block is taken as s × t, j is taken from left to right and from top to bottom in the block according to the sequence of pixels, namely j corresponding to the leftmost pixel point in the block is taken as 1, and j corresponding to the rightmost pixel point is taken as m²And x (k, i, j) is a luminance component value of the jth pixel point of the ith block of the kth frame of the video.

The Euclidean distance D (k, i) of the frame difference of the ith block of the kth frame of the video is determined according to the formula (4):

(4) construction of a Weber-Fechner model

Maximum value p of frame difference Euclidean distance of k frame block of video_kDetermined according to equation (5):

p_k＝max{D(k,1),D(k,2),...,D(k,s×t)} (5)

the maximum value alpha of the Euclidean distance of the frame differences of the blocks from the 1 st frame to the N-2 th frame of the video is determined according to the formula (6):

α＝max{p₁,p₂,...,p_N-2} (6)

where alpha is a minimum of 500.

The construction of the weber-fisher model beta is as follows:

β＝algα-b (7)

wherein a belongs to [3,4], b belongs to [5,7 ].

(5) Determining a threshold value

The average value u of the maximum value of the Euclidean distance of the frame difference of the previous n frame blocks of the video is determined according to the formula (8):

where n is e { 15.,. 50 }.

The threshold value T is determined by equation (9):

T＝β×u (9)

(6) construction of denoising model

The absolute value r of the difference between α and u is determined according to equation (10):

r＝|α-u| (10)

wherein r is a minimum of 26.

Constructing a denoising model f as follows:

f＝round(clgr-d) (11)

wherein, round () is function, integer, c belongs to [0.5,0.64], d belongs to [0,0.2 ].

(7) Extracting key frames

1) The maximum value p of the Euclidean distance of the frame difference of the k frame block of the video_kCompared with a threshold value T, if p_kAnd ≧ T, marking the kth frame as 1, otherwise marking it as 0.

2) And for the 1 st frame to the N-2 th frame of the video, sequentially checking frame marks according to the playing sequence of the frames, if the frames marked as 1 continuously appear and the frame number is more than f, taking the frames as key frames and storing the key frames under a specified folder. For frames marked as 0, if the number of consecutive frames is less than or equal to f, saving the frames marked as 0 as key frames under a specified folder when any one of the following is satisfied:

these frames marked 0 appear with consecutive frames more than f from the frame marked 1 immediately before and after the playing order.

And secondly, the 1 st frame in the frames marked as 0 is the 1 st frame of the video, and the continuous frame number of the frame marked as 1 which is the closest to the frame marked as 0 after the playing sequence of the frames marked as 0 is greater than f.

③ the frame marked 0 and the frame marked 1 nearest to the frame marked 0 in the playing sequence appear to have a continuous frame number larger than f, and the last 1 frame in the frames marked 0 is the N-2 frame of the video.

3) And for the N-1 th frame and the N-2 th frame of the video, if the N-2 th frame is judged as a key frame, extracting and storing the N-1 th frame and the N-1 th frame into a specified folder.

(8) Key frame composite video

And (5) combining the key frames stored in the appointed folder in the step (7) into the abstract video according to the playing sequence.

In the region blocking step (2) of the present invention, m is preferably 32.

In the step (4) of constructing the Weber-Fechner model, a is optimally 3.5, and b is optimally 6.

In the step (5) of determining the threshold value of the present invention, n is preferably 30.

In the step (6) of constructing the denoising model, c is optimally 0.58, and d is optimally 0.1.

The invention provides a method for acquiring basic information of a read-in video, performing operations such as Gaussian filtering, gray level conversion and the like on all video frames, and blocking all video frames according to a square; calculating frame difference Euclidean distances of corresponding blocks between adjacent frames, determining the maximum value of the frame difference Euclidean distances of all video frame blocks, and establishing a model by combining a Weber-Fisher law to adaptively determine a threshold; constructing a denoising model according to the time domain masking effect, and extracting related frames as key frames when video frames meet certain conditions; and storing the extracted key frame composite abstract video under a designated folder. The problems that the efficiency of manually searching the key information of the video is low, and missing detection and false detection are caused easily by the influence of human body sensory fatigue are solved, and the retrieval accuracy of the key information of the video is improved.

The invention has the following advantages:

(1) and the video area is subjected to blocking processing, so that the problem of missed detection caused by the fact that the target is too far away from the camera is effectively avoided.

(2) The European distance frame difference method is combined with a model established based on the Weber-Fisher law of human vision to determine reasonable threshold values for each video to detect the moving target, so that the complex actual monitoring environment is effectively responded, and the process of repeatedly adjusting the threshold values due to videos with different contents is avoided.

(3) The denoising model combined with the time domain covering effect filters the interference noise and then synthesizes the video, thereby effectively improving the display quality of the synthesized video.

(4) The invention does not depend on color information in video frames, and is also effective for monitoring videos at night.

Drawings

FIG. 1 is a flowchart of example 1 of the present invention.

Fig. 2 is a partial image of a yard surveillance video.

Fig. 3 is a partial image of an elevator surveillance video.

Fig. 4 is a partial image of a checkout counter surveillance video.

Fig. 5 is a partial image of a road monitoring video.

Fig. 6 is a partial image of a cell doorway monitoring video.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, but the present invention is not limited to the examples.

Example 1

In fig. 1, the video summarization method based on weber-fisher's law and temporal masking effect of the present embodiment is composed of the following steps:

(1) gauss filtering

(2) Region partitioning

w＝m×s (1)

h＝m×t (2)

where w is the frame width, h is the frame height, s is the number of blocks in the horizontal direction, a positive integer, t is the number of blocks in the vertical direction, a positive integer, and m is 32.

(3) Determining frame difference Euclidean distance

wherein k ∈ {1, 1., N-2}, i ∈ {1,..,. s × t }, j ∈ {1,..,. m }, respectively²And i is taken from left to right and from top to bottom in the frame, namely i corresponding to the leftmost block in a frame is taken as 1, i corresponding to the rightmost block is taken as s × t, j is taken from left to right and from top to bottom in the block according to the sequence of pixels, namely j corresponding to the leftmost pixel point in the block is taken as 1, and j corresponding to the rightmost pixel point is taken as m²I.e. j takes the value 32²And x (k, i, j) is the brightness component value of the j pixel point of the ith block of the kth frame of the video.

(4) construction of a Weber-Fechner model

p_k＝max{D(k,1),D(k,2),...,D(k,s×t)} (5)

α＝max{p₁,p₂,...,p_N-2} (6)

where alpha is a minimum of 500.

The construction of the weber-fisher model beta is as follows:

β＝algα-b (7)

where a is equal to [3,4], b is equal to [5,7], and in this embodiment, a is 3.5 and b is 6.

(5) Determining a threshold value

where n ∈ { 15., 50}, in this embodiment, n is taken to be 30.

The threshold value T is determined by equation (9):

T＝β×u (9)

(6) construction of denoising model

r＝|α-u| (10)

wherein r is a minimum of 26.

Constructing a denoising model f as follows:

f＝round(clgr-d) (11)

where round () is a function, taking an integer, c is 0.5,0.64, d is 0,0.2, this embodiment takes c as 0.58 and d as 0.1.

(7) Extracting key frames

1) The maximum value p of the Euclidean distance of the frame difference of the k frame block of the video_kCompared with a threshold value T, if p_kMarking the kth frame as 1 when the number is more than or equal to T, otherwise marking the kth frame as 0;

2) for the 1 st frame to the N-2 th frame of the video, sequentially checking frame marks according to the playing sequence of the frames, if the frames marked as 1 continuously appear and the frame number is more than f, taking the frames as key frames and storing the key frames under an appointed folder; for frames marked as 0, if the number of consecutive frames is less than or equal to f, saving the frames marked as 0 as key frames under a specified folder when any one of the following is satisfied:

the continuous frame numbers of the frames marked as 0 and the frames marked as 1 nearest to the frames in the front-back direction of the playing sequence of the frames are all larger than f;

second, the 1 st frame in the frames marked as 0 is the 1 st frame of the video, and the continuous frame number of the frame marked as 1 which is the latest after the playing sequence of the frames marked as 0 is greater than f;

the continuous frame number of the frame marked as 1 before the frame marked as 0 and the frame marked as 0 closest to the frame in the playing sequence is larger than f, and the last 1 frame in the frames marked as 0 is the N-2 frame of the video;

(8) Key frame composite video

Example 2

The video summarization method based on the weber-fisher law and the time domain masking effect of the embodiment comprises the following steps:

(1) gauss filtering

This procedure is the same as in example 1.

(2) Region partitioning

w＝m×s (1)

h＝m×t (2)

where w is the frame width, h is the frame height, s is the number of blocks in the horizontal direction, a positive integer, t is the number of blocks in the vertical direction, a positive integer, and m is 16.

(3) Determining frame difference Euclidean distance

wherein k ∈ {1, 1., N-2}, i ∈ {1,..,. s × t }, j ∈ {1,..,. m }, respectively²And i is taken from left to right and from top to bottom in the frame, namely i corresponding to the leftmost block in a frame is taken as 1, i corresponding to the rightmost block is taken as s × t, j is taken from left to right and from top to bottom in the block according to the sequence of pixels, namely j corresponding to the leftmost pixel point in the block is taken as 1, and j corresponding to the rightmost pixel point is taken as m²I.e. j has a value of 16²And x (k, i, j) is the brightness component value of the j pixel point of the ith block of the kth frame of the video.

(4) construction of a Weber-Fechner model

p_k＝max{D(k,1),D(k,2),...,D(k,s×t)} (5)

α＝max{p₁,p₂,...,p_N-2} (6)

where alpha is a minimum of 500.

The construction of the weber-fisher model beta is as follows:

β＝algα-b (7)

where a is equal to [3,4], b is equal to [5,7], and in this embodiment, a is 3 and b is 5.

(5) Determining a threshold value

wherein n is 15.

The threshold value T is determined by equation (9):

T＝β×u (9)

(6) construction of denoising model

r＝|α-u| (10)

wherein r is a minimum of 26.

Constructing a denoising model f as follows:

f＝round(clgr-d) (11)

where round () is a function, taking an integer, c is 0.5,0.64, d is 0,0.2, this embodiment takes c as 0.5, d as 0.

The other steps were the same as in example 1.

Example 3

(1) Gauss filtering

This procedure is the same as in example 1.

(2) Region partitioning

w＝m×s (1)

h＝m×t (2)

where w is the frame width, h is the frame height, s is the number of blocks in the horizontal direction, a positive integer, t is the number of blocks in the vertical direction, a positive integer, and m is 64.

(3) Determining frame difference Euclidean distance

wherein k ∈ {1, 1., N-2}, i ∈ {1,..,. s × t }, j ∈ {1,..,. m }, respectively²}，i is taken according to the sequence of the blocks from left to right and from top to bottom in the frame, namely i corresponding to the block at the top left corner in one frame is taken as 1, i corresponding to the block at the bottom right corner is taken as s × t, j is taken according to the sequence of pixels from left to right and from top to bottom in the block, namely j corresponding to the pixel point at the top left corner in one block is taken as 1, and j corresponding to the pixel point at the bottom right corner is taken as m²I.e. j takes the value 64²And x (k, i, j) is the brightness component value of the j pixel point of the ith block of the kth frame of the video.

(4) construction of a Weber-Fechner model

p_k＝max{D(k,1),D(k,2),...,D(k,s×t)} (5)

α＝max{p₁,p₂,...,p_N-2} (6)

where alpha is a minimum of 500.

The construction of the weber-fisher model beta is as follows:

β＝algα-b (7)

where a is equal to [3,4], b is equal to [5,7], and in this embodiment, a is 4 and b is 7.

(5) Determining a threshold value

wherein n is 15.

The threshold value T is determined by equation (9):

T＝β×u (9)

(6) construction of denoising model

r＝|α-u| (10)

wherein r is a minimum of 26.

Constructing a denoising model f as follows:

f＝round(clgr-d) (11)

where round () is a function, taking an integer, c is 0.5,0.64, d is 0,0.2, this embodiment takes c as 0.64 and d as 0.2.

The other steps were the same as in example 1.

In order to verify the beneficial effects of the invention, the inventor conducts experiments on the test video by adopting the video summarization method based on the weber-fisher law and the time domain masking effect in the embodiment 1 of the invention.

1. Conditions of the experiment

The experimental test environment is a wonderful computer of a Windows l0 (64-bit) operating system, which is configured as an InterCorei 7-7700HQ, a 4-core CPU processor and a 16GB memory, and performs experimental operation on a MATLAB2018a platform.

2. Test video introduction

The test video is shot in the daytime and at night, has high definition and fuzziness, and also has complex texture and simple texture. The courtyard video part content is shown in fig. 2, the elevator video part content is shown in fig. 3, the cashier desk monitoring video part content is shown in fig. 4, the road monitoring video part content is shown in fig. 5, and the cell doorway monitoring video part content is shown in fig. 6. The above video is a video after being subjected to region clipping according to a specific position, and the attribute of each video is shown in table 1.

TABLE 1 attributes of test videos

3. Evaluation method

The common evaluation modes of video abstraction include objective evaluation and subjective evaluation. The objective evaluation is to compare the quality of the summarized video by using some evaluation function. Commonly used evaluation functions are accuracy, error rate, precision, recall, and F-score. The subjective evaluation is carried out in a mode of manually scoring the abstract video or evaluating the good and bad grades.

(1) Objective evaluation

The calculation formulas of the precision P, the recall ratio R and the F-score are respectively as follows:

wherein N is_mAS、N_AS、N_USThe number of matched key frames, the number of automatically extracted key frames and the number of manual excerpts of the user are respectively.

A frame difference method and the method of the invention are adopted for carrying out comparison experiments. The results of the experiment are shown in table 2. Table 2 the experimental results were averaged over the 5 test video experimental results described above.

TABLE 2 precision, recall, F-score test results for different methods

As can be seen from Table 2, the three indexes of the precision, the recall rate and the F-score of the method are respectively 86.2%, 78.5% and 79.6%, and the values of the three indexes are all higher than those of the indexes corresponding to the frame difference method. The higher the precision, recall, F-score, indicates the better the composite digest video. Therefore, it can be seen from the above analysis that the method of the present invention is superior to the frame difference method.

(2) Subjective evaluation

10 boys and 10 girls were invited, 20 of which were between 18 and 24 years of age, and testers with normal vision participated in the subjective evaluation experiment. In a suitable indoor environment, the test video is watched at a distance of 75cm from a computer screen, the key information of the video is known, the video abstract synthesized by the method is watched, and evaluation is carried out according to the following standards.

The evaluation grades are divided into: good, normal, poor. The key information of the video summary of a good grade is basically not lost, the key information of the video summary of a general grade is slightly lost, and the key information of the video summary of a poor grade is greatly lost.

The evaluation grade results of each student are converted into percentages, and the subjective evaluation results of each student are shown in table 3.

TABLE 3 subjective evaluation results of users

As can be seen from Table 3, the 5 video summaries have good ratings of 74%, which substantially match the recall rate of 78.5%, and the key frames extracted by the method of the present invention meet the evaluation results of the testers; 5 video summaries are typically rated at 20%; the 5 video summary difference level is 6%.

4. Conclusion

Under the same test data and evaluation standard, compared with a frame difference method, the method disclosed by the invention has the advantages that the generated video abstract comprehensive index F-score, recall rate, precision and subjective quality are higher, the key information of the video can be better reflected, and the quality of the video abstract is improved.

Claims

1. A video abstraction method based on Weber-Fisher's law and time domain masking effect is characterized by comprising the following steps:

(1) gauss filtering

Removing noise from the 1 st frame to the Nth frame of the video by a Gaussian filtering method, wherein N is the total frame number of the video and is a limited positive integer;

(2) region partitioning

w＝m×s (1)

h＝m×t (2)

wherein w is the frame width, h is the frame height, s is the number of blocks in the horizontal direction and is a positive integer, t is the number of blocks in the vertical direction and is a positive integer, and m belongs to { 16.., 64 };

(3) determining frame difference Euclidean distance

wherein k ∈ {1, 1., N-2}, i ∈ {1,..,. s × t }, j ∈ {1,..,. m }, respectively²I is taken from left to right and from top to bottom in the frame, i is 1 corresponding to the top left block, s × t corresponding to the bottom right block, j is taken from left to right in the block, i is 1 corresponding to the top left pixel point, and m corresponding to the bottom right pixel point²X (k, i, j) is the brightness component value of the jth pixel point of the ith block of the kth frame of the video;

(4) construction of a Weber-Fechner model

p_k＝max{D(k,1),D(k,2),...,D(k,s×t)} (5)

α＝max{p₁,p₂,...,p_N-2} (6)

wherein α is a minimum of 500;

the construction of the weber-fisher model beta is as follows:

β＝algα-b (7)

wherein a belongs to [3,4], b belongs to [5,7 ];

(5) determining a threshold value

where n ∈ {15,. 50 };

the threshold value T is determined by equation (9):

T＝β×u (9)

(6) construction of denoising model

r＝|α-u| (10)

wherein r is at least 26;

constructing a denoising model f as follows:

f＝round(clgr-d) (11)

wherein, round () is function, integer, c belongs to [0.5,0.64], d belongs to [0,0.2 ];

(7) extracting key frames

2) for the 1 st frame to the N-2 th frame of the video, sequentially checking frame marks according to the playing sequence of the frames, if the frames marked as 1 continuously appear and the frame number is more than f, taking the frames as key frames and storing the key frames under an appointed folder; for frames marked as 0, if the number of consecutive frames is less than or equal to f, these frames marked as 0 are saved as key frames under the designated folder when any of the following is satisfied:

3) for the N-1 th frame and the N-2 th frame of the video, if the N-1 th frame and the N-2 th frame are judged as key frames, extracting the N-1 th frame and storing the N-1 th frame and the N-1 th frame into a specified folder;

(8) key frame composite video

2. The weber-fisher's law and temporal masking effect based video summarization method of claim 1, wherein: in the region blocking step (2), m is 32.

3. The weber-fisher's law and temporal masking effect based video summarization method of claim 1, wherein: in the step (4) of constructing the Weber-Fechner model, a is 3.5, and b is 6.

4. The weber-fisher's law and temporal masking effect based video summarization method of claim 1, wherein: in the step (5) of determining the threshold, n is 30.

5. The weber-fisher's law and temporal masking effect based video summarization method of claim 1, wherein: in the step (6) of constructing the denoising model, c is 0.58, and d is 0.1.