CN112770116A

CN112770116A - Method for extracting video key frame by using video compression coding information

Info

Publication number: CN112770116A
Application number: CN202011642920.5A
Authority: CN
Inventors: 艾达; 梁嘉倩
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-05-07
Anticipated expiration: 2040-12-31
Also published as: CN112770116B

Abstract

A method for extracting video key frame by video compression coding information is composed of extracting depth and frame bit number characteristics, shot switching detection and extracting key frame. The invention adopts the coding unit depth information and the frame bit number compression domain characteristics in the video code stream to carry out shot switching detection, obtain shot fragments and carry out key frame extraction. The invention fully utilizes the compressed domain video to process without decompression, reduces the calculation process, shortens the processing time and improves the processing speed. Compared with the existing method, the experimental result shows that the accuracy of the method is improved by 12.1%, the recall rate is improved by 5.3%, the F value is improved by 8.4%, and the extracted key frame can well express the main content of the original video. The method has the advantages of small calculated amount, high efficiency, high accuracy, high processing speed and the like, and can be used for processing the video image.

Description

Method for extracting video key frame by using video compression coding information

Technical collar city

The invention belongs to the technical field of digital video retrieval, and particularly relates to a method for extracting video key frames by using video compression coding information.

Background

With the rapid development of multimedia technology and network technology, video data rapidly grows, unprecedented data appears, and how to effectively manage videos and rapidly acquire important information in the videos becomes a research hotspot. Under the background, key frame extraction becomes an effective way for solving the problem, and by extracting the key frame, the data volume of the video can be greatly reduced, the important information of the original video can be well expressed, the retrieval time is saved, and the video retrieval efficiency is improved.

At present, as for the extraction method of key frames, scholars at home and abroad carry out a great deal of research work, and the methods can be divided into key frame extraction in a pixel domain and key frame extraction in a compression domain according to processed video data objects. The method for extracting the key frame of the pixel domain is carried out after the video is completely decompressed, the calculated amount is large, the efficiency is low, and the real-time requirement is difficult to meet. The compressed domain video processing technology is directly oriented to compressed video data with small data volume, and the video is processed under the condition of no decompression or partial decompression, so that the processing speed of the video can be greatly improved, and therefore, the research on the key frame extraction method on the compressed domain draws wide attention.

Ali Reza et al propose a method for extracting key frames in the h.265/HEVC compressed domain, which uses a normalized histogram of intra-frame prediction modes extracted from the h.265/HEVC coded video to detect similar frames, classifies the similar frames using fuzzy c-means clustering, and extracts key frames. Zhu Zhiming et al proposed a video abstract key frame extraction method of video coding compression domain, which is to count the number of brightness prediction modes of a video coding intra-frame coding PU block at a decoding end, construct a mode feature vector, cluster the mode feature vector by using an adaptive clustering algorithm fused with an iterative self-organizing data analysis algorithm (ISODATA) to obtain candidate key frames, and filter the candidate key frames again through similarity to remove redundant frames to obtain final key frames.

The common point of the methods is that the intra-frame prediction mode value is used as the characteristic, and the experiment only aims at the full intra-frame mode, so that the processing speed of the video frame is low, the processing time is long, and the practicability is not realized.

Disclosure of Invention

The technical problem to be solved by the present invention is to overcome the disadvantages of the above video frame processing method, and provide a method for extracting video key frames by using video compression coding information, which does not need decoding, has small calculation amount, high processing speed and high extraction efficiency.

The technical scheme adopted for solving the technical problems comprises the following steps:

(1) extracting depth and frame bit number features

Determining a rate-distortion cost J of the coding unit according to equation (1):

wherein D_x,yAnd R_x,yRespectively representing the distortion and the coding bit number of the (x, y) th pixel in the coding unit, wherein x belongs to {1,2, …, H }, y belongs to {1,2, …, W }, W multiplied by H is video resolution, lambda is larger than or equal to 0 and is Lagrange coefficient, W and H are finite positive integers, and W is larger than H.

Determining depth feature vector F of coded frame according to equation (2)_n：

F_n＝{f₁,f₂,…,f_α} (2)

Wherein N represents the nth coded frame of the video, N belongs to {1,2, …, N }, N is the total frame number of the video, N is a finite positive integer, round () is an upward rounding function, f_αFor coding depth values of a unit, f_αThe value of (a) is any one of 0, 1,2 and 3.

Determining the number of frame bits R according to equation (3)_n：

(2) Lens switching detection

Counting the frame bit number R of the encoded frame_nAnd drawing a line drawing for analysis, marking the positions which are gradually increased and then gradually reduced as shot switching, wherein 1 shot segment is arranged between two adjacent shot switching, the length of the shot segment is M, the value of M is a limited positive integer, M is less than N, K shot segments are obtained, and the value of K is a limited positive integer.

(3) Extracting key frames

The laplacian matrix L is determined as in equation (4):

wherein F_iAnd F_jThe depth feature vectors for the ith and j-th coded frames, i ∈ {1,2, …, N }, j ∈ {1,2, …, N }, respectively, are represented.

Determining eigenvectors Y corresponding to the first K eigenvalues of L according to the formula (5), and constructing an NxK order matrix Y according to the formula (6):

L×y＝β×D×y (5)

Y＝[y₁,y₂,…,y_K](6) wherein y is₁,y₂,...,y_KSequentially forming N multiplied by 1 order eigenvectors corresponding to the first K eigenvalues.

K-means clustering is carried out on the matrix Y, and the distance d between the clustering center mu and all other frames in the shot is determined according to the formula (7)_m：

d_m＝||y_m-μ||₂ (7)

Wherein M belongs to {1,2, …, M }, M is the length of each shot, M is a finite positive integer, and M is less than N.

Will be a distance d_mThe smallest frame is denoted as the key frame.

In the step (1) of extracting the depth and frame bit number characteristics, the value of W is 176-7680, the value of H is 144-4320, and the value of N is 1000-7000.

In the step (2) of detecting lens switching, the value of K is 5-20.

The invention adopts CU depth value and frame bit number compression domain characteristics in video code stream to carry out shot switching detection, obtains shot fragments, and carries out key frame extraction. The invention fully utilizes the compressed domain video to process without decompression, reduces the calculation process, shortens the processing time and improves the processing speed. Compared with the existing method, the experimental result shows that the accuracy of the method is improved by 12.1%, the recall rate is improved by 5.3%, the F value is improved by 8.4%, and the extracted key frame can well express the main content of the original video. The method has the advantages of small calculated amount, high efficiency, high accuracy, high processing speed and the like, and can be used for processing the video image.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following drawings and examples, but the present invention is not limited to these examples.

Example 1

Taking the video sequence a New Horizon, segment 02 in the international VSUMM dataset as an example, the method for extracting video key frames by using video compression coding information in the embodiment includes the following steps (see fig. 1):

(1) extracting depth and frame bit number features

wherein D_x,yAnd R_x,yRespectively representing the distortion and the coding bit number of the (x, y) th pixel in the coding unit, wherein x belongs to {1,2, …, H }, y belongs to {1,2, …, W }, W x H is video resolution, lambda is greater than or equal to 0 and is Lagrange coefficient, W and H are limited positive integers, W is greater than H, the value of W in the embodiment is 352, and the value of H is 240.

F_n＝{f₁,f₂,…,f_α} (2)

Where N represents the nth encoded frame of the video, N belongs to {1,2, …, N }, N is the total frame number of the video, N is a finite positive integer, N is 1797 in this embodiment, round () is an upward rounding function, f is a positive integer, and N is a positive integer_αFor coding depth values of a unit, f_αIs any one of 0, 1,2 and 3, f_αThe specific value of (c) should be determined according to the value of n.

Determining the number of frame bits R according to equation (3)_n：

(2) Lens switching detection

Counting the frame bit number R of the encoded frame_nDrawing a line graph for analysis, marking the part which is gradually increased and then gradually reduced as lens switching, and setting 1 lens between two adjacent lens switchingThe length of a shot is M, M is a limited positive integer, and M is less than N, to obtain K shot, K is a limited positive integer, K in this embodiment is 13, and M is specifically 376, 232, 128, 108, 80, 76, 72, 80, 116, 120, 68, 72, 108.

(3) Extracting key frames

The laplacian matrix L is determined as in equation (4):

L×y＝β×D×y (5)

Y＝[y₁,y₂,…,y_K] (6)

wherein y is₁,y₂,...,y_KSequentially obtaining N multiplied by 1 order eigenvectors corresponding to the first K eigenvalues, wherein the value of K in the step is the same as that of K in the step (2), and the value of N is the same as that of N in the step (1).

d_m＝||y_m-μ||₂ (7)

Wherein M belongs to {1,2, …, M }, M is the length of each shot, M is a finite positive integer, M is less than N, and the specific value of M is the same as that in step (2).

Will be a distance d_mThe smallest frame is denoted as the key frame.

Example 2

Taking an ocean floor Legacy as an example, the method for extracting video key frames by using video compression coding information in the embodiment includes the following steps:

(1) extracting depth and frame bit number features

wherein D_x,yAnd R_x,yRespectively representing the distortion and the coding bit number of the (x, y) th pixel in the coding unit, wherein x belongs to {1,2, …, H }, y belongs to {1,2, …, W }, W x H is video resolution, lambda is greater than or equal to 0 and is Lagrange coefficient, W and H are limited positive integers, W is greater than H, the value of W in the embodiment is 176, and the value of H is 144.

F_n＝{f₁,f₂,…,f_α} (2)

Where N represents the nth encoded frame of the video, N belongs to {1,2, …, N }, N is the total frame number of the video, N is a finite positive integer, N is 1000 in this embodiment, round () is an upward rounding function, f is a positive integer, and N is a positive integer_αFor coding depth values of a unit, f_αIs 0,1. 2, 3, f_αThe specific value of (c) should be determined according to the value of n.

Determining the number of frame bits R according to equation (3)_n：

(2) Lens switching detection

Counting the frame bit number R of the encoded frame_nAnd drawing a broken line graph for analysis, marking the positions which are gradually increased and then gradually decreased as shot switching, wherein 1 shot segment is arranged between every two adjacent shot switching, the length of each shot segment is M, the value of M is a limited positive integer, M is less than N, K shot segments are obtained, the value of K is a limited positive integer, the value of K in the embodiment is 5, and the specific values of M are 336, 216, 112, 96 and 296.

(3) Extracting key frames

The laplacian matrix L is determined as in equation (4):

wherein F_iAnd F_jRespectively representing the depth of the ith and j coded framesThe eigenvector, i ∈ {1,2, …, N }, j ∈ {1,2, …, N }.

L×y＝β×D×y (5)

Y＝[y₁,y₂,…,y_K] (6)

d_m＝||y_m-μ||₂ (7)

Will be a distance d_mThe smallest frame is denoted as the key frame.

Example 3

Taking an exceptional Terrane of a video sequence as an example, the method for extracting a video key frame by using video compression coding information of the embodiment includes the following steps:

(1) extracting depth and frame bit number features

wherein D_x,yAnd R_x,yRespectively representing the distortion and the coding bit number of the (x, y) th pixel in the coding unit, wherein x belongs to {1,2, …, H }, y belongs to {1,2, …, W }, W x H is video resolution, lambda is greater than or equal to 0 and is Lagrange coefficient, W and H are limited positive integers, W is greater than H, the value of W in the embodiment is 7680, and the value of H is 4320.

Determining depth feature direction of coded frame according to equation (2)Quantity F_n：

F_n＝{f₁,f₂,…,f_α} (2)

Where N represents the nth encoded frame of the video, N belongs to {1,2, …, N }, N is the total frame number of the video, N is a finite positive integer, N is 7000 in this embodiment, round () is an upward rounding function, f is a positive integer_αFor coding depth values of a unit, f_αIs any one of 0, 1,2 and 3, f_αThe specific value of (c) should be determined according to the value of n.

Determining the number of frame bits R according to equation (3)_n：

(2) Lens switching detection

Counting the frame bit number R of the encoded frame_nAnd drawing a broken line graph for analysis, marking the positions which are gradually increased and then gradually decreased as shot switching, wherein 1 shot segment is arranged between every two adjacent shot switching, the length of each shot segment is M, the value of M is a limited positive integer, M is less than N, K shot segments are obtained, the value of K is a limited positive integer, the value of K in the embodiment is 20, and the specific value of M is 156, 196, 596, 1068, 316, 452, 196, 96, 468, 240, 496, 176, 152, 376, 192, 112, 412, 336, 240 and 396.

(3) Extracting key frames

The laplacian matrix L is determined as in equation (4):

L×y＝β×D×y (5)

Y＝[y₁,y₂,…,y_K] (6)

d_m＝||y_m-μ||₂ (7)

Will be a distance d_mThe smallest frame is denoted as the key frame.

In order to verify the beneficial effects of the present invention, the inventor performed a comparison experiment by using the method of extracting video key frames from video compression coding information in embodiment 1 of the present invention and an HEVC intra frame based compressed domain video summary (hereinafter referred to as "comparison file 1") method, and determined the accuracy, recall rate, and F value of the two methods as comprehensive indicators for evaluating the quality of the video summary, where the experiment and calculation results are shown in table 1.

The accuracy is determined as follows:

wherein N is_mNumber of key frames, N, for the experimental method to match the user summary_ASThe number of key frames extracted for the experimental method.

The recall rate is determined as follows:

wherein N is_USKey frame number extracted for user abstract.

The value of F is determined as follows:

TABLE 1 results of the experiment

As can be seen from Table 1, compared with the method of the comparison document 1, the method of the present invention has the advantages of significantly improved effect, wherein the accuracy rate is improved by 12.1%, the recall rate is improved by 5.3%, and the F value is improved by 8.4%.

Claims

1. A method for extracting key frames from video using video compression coding information, comprising the steps of:

(1) Extracting depth and frame bit number features

wherein D_x，yAnd R_x，yRespectively representing the distortion and the coding bit number of the (x, y) th pixel in a coding unit, wherein x belongs to {1,2, …, H }, y belongs to {1,2, …, W }, W multiplied by H is video resolution, lambda is more than or equal to 0 and is Lagrange coefficient, W and H are limited positive integers, and W is more than H;

F_n＝{f₁，f₂，…，f_α} (2)

Wherein N represents the nth coded frame of the video, N belongs to {1,2, …, N }, N is the total frame number of the video, N is a finite positive integer, round () is an upward rounding function, f_αFor coding depth values of a unit, f_αThe value of (a) is any one of 0, 1,2 and 3;

determining the number of frame bits R according to equation (3)_n：

(2) Lens switching detection

Counting the frame bit number R of the encoded frame_nDrawing a broken line graph for analysis, marking the positions which are gradually increased and then gradually reduced as shot switching, wherein 1 shot segment is arranged between every two adjacent shot switching, the length of each shot segment is M, the value of M is a limited positive integer, M is less than N, K shot segments are obtained, and the value of K is a limited positive integer;

(3) extracting key frames

The laplacian matrix L is determined as in equation (4):

wherein F_iAnd F_jDepth feature vectors representing the ith and jth coded frames, respectively, i ∈ {1,2, …, N }, j ∈ {1,2, …, N };

L×y＝β×D×y (5)

Y＝[y₁，y₂，…，y_K] (6)

wherein y is₁，y₂，...，y_KSequentially carrying out Nx 1-order eigenvectors corresponding to the first K eigenvalues;

d_m＝||y_m-μ||₂ (7)

Wherein M belongs to {1,2, …, M }, M is the length of each shot segment, M is a finite positive integer, and M is less than N;

will be a distance d_mThe smallest frame is denoted as the key frame.

2. The method of claim 1, wherein the key frames of the video are extracted from the video compression coding information: in the step (1) of extracting the depth and frame bit number characteristics, the value of W is 176-7680, the value of H is 144-4320, and the value of N is 1000-7000.

3. The method of claim 1, wherein the key frames of the video are extracted from the video compression coding information: in the step (2) of detecting lens switching, the value of K is 5-20.