CN108921130B

CN108921130B - Video key frame extraction method based on saliency region

Info

Publication number: CN108921130B
Application number: CN201810836824.0A
Authority: CN
Inventors: 冯德瀛; 张来刚; 赵颖; 楚晓华
Original assignee: Liaocheng University
Current assignee: Liaocheng University
Priority date: 2018-07-26
Filing date: 2018-07-26
Publication date: 2022-03-01
Anticipated expiration: 2038-07-26
Also published as: CN108921130A

Abstract

The invention discloses a video key frame extraction method based on a significance region in the technical field of computer image processing and pattern recognition, which comprises the steps of firstly carrying out sampling conversion on video data, converting a video into a continuous frame image sequence, then extracting the significance region in each frame image by using a spectrum residual model, then sequencing the significance region according to the area of the significance region in the image, carrying out color vectorization on the significance region sequenced at the front, and finally carrying out similarity measurement between the front frame image and the rear frame image according to the vectorized significance region and determining a video key frame according to the size of the similarity. The key frame image sequence extracted by the invention can effectively reserve the main content of the video.

Description

Video key frame extraction method based on saliency region

Technical Field

The invention mainly relates to the technical field of computer image processing and pattern recognition, in particular to a video key frame extraction method based on a saliency region.

Background

With the large-scale popularization of surveillance cameras in real life, surveillance videos are widely applied, and the data volume of the surveillance videos also shows exponential growth, so that huge challenges are provided for storage, organization and query of the videos. In the field of security monitoring, how to effectively organize, manage and query massive videos becomes a hot problem in current attention and research.

Video is made up of successive frames, one image each representing a segment or content of the video. Due to the temporal and spatial continuity of adjacent frames, there is a large amount of redundant information in the sequence of frame images, which is not conducive to efficient classification and retrieval in video. In response to the above problem, the video key frame extraction technology extracts a plurality of key frames from a video, and the key frames represent main content contained in the video. Conventional video key frame extraction methods are classified into a shot boundary-based method, an image content-based method, a motion analysis-based method, a video clustering-based method, a compressed domain-based method, and the like.

Through the literature search of the prior art, Xiaodi Hou et al adopts a Fourier transform method to obtain frequency domain information of an image in the literature 'Saliency Detection: A Spectral Residual Approach', and then utilizes a spectrum Residual method to detect a salient region in the image in a time domain, but the method is not used for extracting a video key frame. In the patent of 'a video key frame extraction method' (application number: CN201711165320.2, published: 2018, 3, month and 27), Royuan et al detect a moving target by using a Vibe algorithm and an inter-frame difference method, then determine global similarity by using a global feature peak signal-to-noise ratio, then judge local similarity by using SURF features, and finally synthesize the two similarities to obtain a target key frame sequence. The patent mainly extracts a key frame sequence from the perspective of an interframe difference method, and does not relate to the distinguishing of foreground objects and background noise. Similarly, Qian et al in the patent "a video key frame extraction algorithm" (application No.: CN201711047162.0, published: 2018, 3, 23) used an interframe difference method to extract key frames. The method comprises the steps of firstly calculating the size of an effective area of a certain frame of image, then detecting the characteristic information of the area, comparing the characteristic information with previous and next frames, and finally extracting a key frame by calculating the similarity between frames. Although the patent detects the effective image area, only the progressive scanning method is adopted, and the potential foreground object and the interfering background noise are not distinguished.

Disclosure of Invention

The invention provides a video key frame extraction method based on a significance region, which aims at the defects in the prior art, detects the significance region in each frame of image by utilizing a spectrum residual model in a video frame image sequence, determines a potential foreground target, avoids the influence of irrelevant background noise, filters video frame images containing similar contents by judging the color similarity of the significance regions in the previous and next frame images, and determines key frame images containing main contents, thereby laying a foundation for content-based video retrieval.

The invention is realized by the following technical scheme, and the invention specifically comprises the following steps:

firstly, sampling and converting video data, and converting a video into a continuous frame image sequence;

then extracting a salient region in each frame of image by using a spectrum residual error model;

then, sorting the salient regions according to the areas of the salient regions in the image, and carrying out color vectorization on the salient regions which are sorted in front;

and finally, carrying out similarity measurement between the front frame image and the back frame image according to the vectorized salient region, and determining the video key frame according to the size of the similarity.

The sampling conversion of the video data is as follows: the video is composed of continuous images of a frame, the sampling frequency of the video is set according to the total frame number and the frame rate contained in the video, and the video is converted into a group of continuous frame image sequences according to the sampling frequency.

In the video data, the total number of frames is N_TFrame rate of n_fSampling frequency of n_sThen the sampled continuous frame image sequence is

Wherein

The extraction of the salient region by using the spectrum residual error model in each frame of image refers to the following steps: fourier transformation is carried out on each frame of image in a frequency domain, spectrum residual errors are calculated, and then salient regions in the images are extracted in a time domain according to inverse Fourier transformation.

Further, the step of extracting the significant region by using the spectrum residual model includes:

1) for the ith frame image I in the frequency domain_iFourier transform is carried out to obtain a transformed amplitude spectrum A (f) and a phase spectrum P (f), wherein

Further, a logarithmic magnitude spectrum l (f) is calculated, wherein l (f) is log (a (f));

2) setting a local mean filter h of n x n_n(f) Convolving with a logarithmic magnitude spectrum L (f), and calculating a spectrum residual R (f), wherein R (f) L (f) -h_n(f)*L(f)；

3) Inverse Fourier transform of the spectral residual R (f) in the time domain and smoothing with a Gaussian filter g (x) to obtain an image I_iCorresponding region of significance S_iWherein

The sorting according to the areas of the saliency regions in the image and performing color vectorization on the saliency region ranked in the front refers to: each frame of image comprises a plurality of salient regions, and for a salient region with a larger area, more potential foreground objects are contained, and the influence of background noise is inhibited, so that according to the area of the salient region, the salient regions are sequenced in the image from large to small, the similarity of the front and rear frame images can be judged by utilizing the plurality of salient regions sequenced at the front, and for this reason, color vectorization is respectively carried out on R, G, B three channels for each salient region, so as to generate a corresponding color vector.

Further, the step of ordering the salient regions according to their areas in the image and performing color vectorization on the top-ordered salient regions includes:

a) in the ith frame image I_iIn (1), the extracted significant region is expressed as

Wherein

Is an image I_iThe r-th salient region. For significant region S_iSorting according to the order of the areas from large to small, and taking out the front z displays in the front of the sortingCopy region S'_iIs shown as

b) To significant region S'_iMiddle (r) significant region

Histogram statistics is carried out in the interval of R, G, B three channels and 0-255 gray values respectively to generate corresponding color vectors

And

further generating the r significant region

Corresponding color vector

Accordingly, significance region S'_iCorresponding color vector

The similarity measurement is carried out between the previous frame image and the next frame image according to the vectorized significant region, and the determination of the video key frame according to the similarity refers to: in order to determine a key frame in a frame image sequence, similarity measurement is performed between front and rear frame images through color vectors corresponding to salient regions, if the similarity is small, the difference of contents contained in the front and rear frame images is large, and the front and rear frame images can be determined as the key frame, otherwise, if the similarity is large, the difference of contents contained in the front and rear frame images is small, the front and rear frame images can be removed from the frame image sequence, and after the similarity measurement is completed in all the frame images, the remaining frame images are the key frame sequence corresponding to the video.

Further, the step of performing similarity measurement between the previous and subsequent frame images according to the vectorized salient region and determining the video key frame according to the size of the similarity includes:

to judge the ith frame image I_iAnd the (I + 1) th frame image I_i+1In their corresponding color vector V_iAnd V_i+1Cosine similarity measure is performed between them. Due to the vector V_iFrom z component vectors

Formed, and then vector V_iAnd V_i+1The similarity measure between the two is converted into z corresponding component vectors

And

the similarity measure between them can be expressed as

② in z corresponding component vectors

And

after the similarity measurement is finished, z similarity measurement values can be obtained, and if the similarity is larger, the significant regions in the front and rear frame images are illustrated

And

the contents contained are similar, if the similarity is smaller, the description is given

And

the content involved is very different. For frame image I_iAnd I_i+1The overall similarity between the z similarity values, which is the minimum value of the z similarity values, can be expressed as

Therefore, the difference of the content of the previous and next frame images can be reflected;

calculating frame image I_iAnd I_i+1After the overall similarity between the two, a similarity threshold T is set, if sim (I)_i,I_i+1) If T is less than or equal to T, then I is indicated_iAnd I_i+1The content contained is greatly different, I_iAnd I_i+1All remain as key frame images if sim (I)_i,I_i+1) If > T, then I is indicated_i+1And I_iHigh similarity of contained content, I_iRemains as a key frame image, and I_i+1And removing from the frame image sequence. And finally determining the key frame image sequence of the video after traversing all the frame images in sequence.

The invention has the beneficial effects that: the method utilizes the spectrum residual model to extract the salient region from the frame image sequence, detects the potential foreground target, avoids the interference of background noise and is beneficial to judging the content similarity between the previous frame image and the next frame image. By ordering and representing the salient regions in the image as independent color vectors, not only is the color information of each salient region effectively represented, but also the mutual influence among a plurality of salient regions is avoided. The similarity between the images of the previous frame and the image of the next frame is judged through the saliency region after the color vectorization, the maximum difference between the previous frame and the next frame can be reflected, and the extraction of the video key frame is facilitated. Compared with the prior art, the key frame image sequence extracted by the method can effectively reserve the main content of the video.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The invention is further described with reference to the accompanying drawings and specific embodiments. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and these equivalents also fall within the scope of the present application.

The embodiment adopts a video key frame extraction method based on a saliency region, and the specific implementation steps are as follows:

1. and performing sampling conversion on the video data, and converting the video into a continuous frame image sequence.

In the video data, the total number of frames is N_TDue to the frame rate n_f24, so the sampling frequency n_sThe same is set to 24, i.e. the video is sampled every 1s, and the sequence of successive frame images after sampling is

Wherein

2. And extracting a salient region in each frame of image by using a spectrum residual error model.

For a sequence of frame images

Middle image I_iFirst, fourier transform is performed in the frequency domain to calculate an amplitude spectrum a (f), a phase spectrum p (f), and a logarithmic amplitude spectrum l (f). Then, a 3 x 3 local mean filter h is provided_n(f) (n-3), convolved with the log-amplitude spectrum l (f), and the spectrum residual r (f) is calculated. Finally, Fourier inversion is carried out on the spectrum residual error R (f) in a time domain, and a Gaussian filter g (x) is adopted for smoothing, so as to extract an image I_iSignificant region S in (1)_i。

3. And sorting according to the areas of the salient regions in the images, and carrying out color vectorization on the salient regions which are sorted at the front.

Firstly in the image I_iMiddle pair salient region

Arranging according to the order of the areas from large to small, and taking out the first 3 significant areas which are ranked in the front

Then, for the 3 saliency areas, color vectorization is performed sequentially in R, G, B three channels, each channel can generate a 256-dimensional color vector, and each saliency area corresponds to a 768-dimensional color vector. Last S'_iCorresponding to 3 768-dimensional color vectors, expressed as

4. And performing similarity measurement between the front frame image and the back frame image according to the vectorized salient region, and determining the video key frame according to the size of the similarity.

Firstly, in the ith frame image I_iAnd the (I + 1) th frame image I_i+1Corresponding color vector V_iAnd V_i+1Cosine similarity measure is performed between them. V_iFrom 3 component vectors

Is composed of, and then V_iAnd V_i+1The similarity measure between them is converted into 3 corresponding component vectors

And

a measure of similarity between them. Then, the minimum value of the 3 similarity degrees obtained by calculation is taken, and the minimum value is the image I_iAnd I_i+1Content similarity between them. Finally, setting a similarity threshold T equal to 0.8 if I_iAnd I_i+1Similarity between them sim (I)_i,I_i+1) Less than or equal to 0.8, then I_iAnd I_i+1All remain as key frame images, otherwise, if I_iAnd I_i+1Similarity between them sim (I)_i,I_i+1) If greater than 0.8, then I_iReserved as key frame image, I_i+1And removing from the frame image sequence. And traversing all the frame images in sequence to obtain a final key frame image sequence.

The simulation experiment of the method of the invention is as follows:

in the experiment, 5 monitoring cameras are selected, 4 sections of videos are recorded by each monitoring camera, 20 sections of monitoring videos are selected in total, and performance test is performed on the video key frame extraction method based on the saliency region. For the 20 segments of monitoring videos, the monitoring duration, the total video frame number, the video sampling frame number, the key frame number and the ratio of the sampling frame number to the key frame number are respectively given. Table 1 gives detailed test data for 20 segments of surveillance video. As can be seen from table 1, the number of key frames of 20 segments of video is reduced to some extent compared to the number of frames after video sampling. Because the scenes recorded by the 5 monitoring cameras are different, the ratio of the frame number of the video recorded by the 5 monitoring cameras to the key frame number after sampling has a certain difference. However, the key frame sequence extracted from the video may represent the main content of the video.

Key frame extraction performance contrast analysis of table 120 segment monitoring video

Serial number	Camera head	Monitoring duration	Total frame number	Number of frames after sampling	Number of key frames	Ratio of
							1	Camera 1	05:55	8527	355	251	1.4
2	Camera 1	18:07	26104	1088	710	1.5
							3	Camera 1	04:57	7138	297	129	2.3
4	Camera 1	19:05	27493	1146	639	1.8
							5	Camera 2	00:32	771	32	17	1.9
6	Camera 2	23:21	33636	1402	1050	1.3
							7	Camera 2	13:45	19801	825	574	1.4
8	Camera 2	05:50	8402	350	130	2.7
							9	Camera 3	35:12	50690	2112	1045	2.0
10	Camera 3	35:18	50835	2118	811	2.6
							11	Camera 3	35:16	50786	2116	1285	1.6
12	Camera 3	29:10	42019	1751	1045	1.7
							13	Camera 4	23:36	33994	1416	982	1.4
14	Camera 4	09:16	13367	557	390	1.4
							15	Camera 4	09:42	13971	582	460	1.3
16	Camera 4	23:29	33833	1410	958	1.5
							17	Camera 5	23:29	33819	1409	786	1.8
18	Camera 5	23:29	33816	1409	842	1.7
							19	Camera 5	15:01	21640	902	506	1.8
20	Camera 5	15:29	22310	930	419	2.2

Claims

1. The video key frame extraction method based on the saliency region is characterized by specifically comprising the following steps of:

step one, carrying out sampling conversion on video data, and converting a video into a continuous frame image sequence;

secondly, extracting a salient region in each frame of image by using a spectrum residual error model;

thirdly, sorting the salient regions according to the areas of the salient regions in the images, and carrying out color vectorization on the salient regions which are sorted in front;

according to the area of the saliency region, the saliency regions are sequenced from large to small in the image, the similarity of the previous and next frame images can be judged by utilizing a plurality of saliency regions sequenced at the front, and color vectorization is respectively carried out on R, G, B channels aiming at each saliency region to generate a corresponding color vector, and the method specifically comprises the following steps:

1) in the ith frame image I_iIn (1), the extracted significant region is expressed as

Wherein

Is an image I_iMiddle r significant region, for significant region S_iSorting is carried out according to the sequence of the area from large to small, and the front z significant areas S 'in the front sorting are taken out'_iIs shown as

2) To significant region S'_iMiddle (r) significant region

And

further generating the r significant region

Corresponding color vector

Accordingly, significance region S'_iCorresponding color vector

Fourthly, similarity measurement is carried out between the front frame image and the back frame image according to the vectorized saliency region, and a video key frame is determined according to the size of the similarity;

similarity measurement is carried out between front and back frame images through color vectors corresponding to the salient regions, if the similarity is smaller, the difference of contents contained in the front and back frame images is larger, the front and back frame images can be determined as a key frame, otherwise, if the similarity is larger, the difference of contents contained in the front and back frame images is smaller, the front and back frame images are removed from a frame image sequence, and after the similarity measurement is completed in all the frame images, the rest frame images are the key frame sequence corresponding to the video, and the specific steps comprise:

1) to judge the ith frame image I_iAnd the (I + 1) th frame image I_i+1In their corresponding color vector V_iAnd V_i+1Cosine similarity measure between them, due to vector V_iFrom z component vectors

And

the similarity measure between them can be expressed as

2) At z corresponding component vectors

And

after the similarity measurement is finished, z similarity measurement values can be obtained, and if the similarity is larger, the significant areas in the front and rear frame images are indicated

And

the contained contents are similar, if the similarity is smaller, the contents are similarDescription of the invention

And

the content contained is greatly different, and the frame image I_iAnd I_i+1The overall similarity between the z similarity values, which is the minimum value of the z similarity values, can be expressed as

3) in calculating frame image I_iAnd I_i+1After the overall similarity between the two, a similarity threshold T is set, if sim (I)_i,I_i+1) If T is less than or equal to T, then I is indicated_iAnd I_i+1The content contained is greatly different, I_iAnd I_i+1All remain as key frame images if sim (I)_i,I_i+1)>T, then, illustrate I_i+1And I_iHigh similarity of contained content, I_iRemains as a key frame image, and I_i+1And removing the key frame image sequence from the frame image sequence, and finally determining the key frame image sequence of the video after traversing all the frame images in sequence.

2. The method for extracting key frames from video based on salient regions as claimed in claim 1, wherein in step one, the video sampling frequency is set according to the total frame number and frame rate contained in the video, and the video is converted into a group of continuous frame image sequences according to the sampling frequency; in the video data, the total number of frames is N_TFrame rate of n_fSampling frequency of n_sThen the sampled continuous frame image sequence is

Wherein

3. The method as claimed in claim 1, wherein in the second step, fourier transform is performed on each frame of image in the frequency domain, spectral residuals are calculated, and then the salient regions in the image are extracted in the time domain according to inverse fourier transform, and the specific steps include: