CN113591587A

CN113591587A - Method for extracting content key frame of motion video

Info

Publication number: CN113591587A
Application number: CN202110749819.8A
Authority: CN
Inventors: 冯子亮; 刘恒宇; 韩震博; 何旭东; 窦芙蓉; 唐玄霜; 张欣; 贺思睿; 何思迪; 张炬
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2021-11-02

Abstract

The invention provides a method for extracting key frames of contents of a motion video, which uses a dynamic space-time slice position selection method to dynamically obtain the position according to an activity intensity diagram of the motion video, thereby improving the extraction effect of key information of the motion video and enabling the extracted slice to more effectively express the key motion information in the motion video; in the distance calculation of the clustering algorithm, the similarity and the time attribute of the slice images are considered, so that the accuracy of identifying the key frames of the motion video is improved; the above measures finally improve the effect of extracting the content key information in the motion video.

Description

Method for extracting content key frame of motion video

Technical Field

The invention relates to the technical field of computer vision, in particular to a method for extracting a content key frame of a motion video.

Background

Digital video has become an important way of information dissemination over networks today. With the great increase of the number of videos on the network, how to quickly and effectively find a required video clip in a large number of videos becomes a focus of people's attention, which is a video content retrieval problem.

Video is composed of continuously changing image frames, and the frames which can effectively represent the main content of the video in the image frames are called video content key frames. The video content key frame extraction technology is an important means for effectively solving the problem of video content retrieval, and has important functions in aspects of video similarity analysis, video content abstraction and the like.

In the motion video analysis, motion type analysis, motion attitude estimation, human behavior recognition and the like are often required, and at the moment, the processing of video key frames is used instead of the processing of all video frames, so that the calculated amount can be effectively reduced, and the efficiency of motion video analysis is improved.

The space-time slicing technology is a technology for abstracting video content according to two dimensions of time and space; specifically, the video is expanded into images in the time dimension; slicing the image in the spatial dimension, extracting one line or one column of information of the image, and finally forming a video space-time slice image as a summary image of the video; the processing of the abstract image can obtain information such as a content key frame of the video.

The usual spatio-temporal slicing process typically slices at fixed positions, which may result in the inability to extract critical information in the video; the time continuity of the slices is not considered when the video slice image is processed, so that the extracted content key frames are not accurate enough and the like.

Aiming at the problems, the invention provides a method for extracting the video content key frame aiming at the motion video based on the dynamic space-time slice clustering, which increases the extraction effect of the key information of the motion video by dynamically calculating the position of the space-time slice; the time attribute of the slice image is considered in the distance calculation of the clustering algorithm, so that the accuracy of key frame identification is improved; therefore, the effect of extracting the key information of the motion video is finally improved.

Disclosure of Invention

A method for extracting key frames of content of a motion video comprises the following steps.

Step 1, calculating an activity intensity map of a sports video:

step 1.1, carrying out gray processing on each frame of image of a motion video;

step 1.2, starting from the 2 nd frame image, calculating the gray difference absolute value of the previous frame image according to pixels to obtain a difference image;

step 1.3, carrying out threshold processing on the difference image, and clearing 0 the pixel gray level of which the gray value is smaller than the threshold;

step 1.4, accumulating the gray value of the difference image according to pixels and normalizing;

and the normalized difference image is called an activity intensity map of the motion video.

The motion video refers to a video with relatively slow background change and relatively large foreground change; the activity intensity map of the motion video can reflect the total effect of foreground target change.

And 2, taking the row with the maximum gray value in the motion video activity intensity map as a space-time slice position row.

Accumulating the gray values of the main area of the activity intensity map according to rows;

and taking the row with the maximum accumulated gray value as a space-time slice position row.

The main area of the activity intensity map refers to a middle area except upper and lower boundaries in a video image, and is a main area of foreground object activity in the video; to prevent foreground changes outside this region from affecting the content key frame extraction, we will ignore changes outside this region.

Optionally, a column with the maximum gray value in the motion video activity intensity map is taken as a space-time slice position column; vertical slicing is subsequently performed column by column.

And 3, performing horizontal slicing operation on each frame of image in the motion video at the position of the space-time slice to obtain a space-time slice image or slice.

The horizontal slicing operation refers to taking one or more lines from an image, and the resulting image is called a spatio-temporal slice image or a slice.

And 4, splicing the time-space slice images to form a transverse video slice image.

And when in splicing, the video is spliced up and down according to the frame sequence when the video is unfolded, and a transverse video slice image from top to bottom in the time direction is formed.

Step 5, using a K-means clustering algorithm and taking each slice image as a basic unit of clustering; the horizontal video slice images are clustered along the temporal direction.

The K mean value clustering algorithm in the step 5 comprises the following steps:

step 5.1, initializing clusters, and uniformly distributing each slice of the transverse video slice images to each cluster or category according to the set number of cluster centers and the time direction;

step 5.2, according to the clustering result, recalculating the clustering center of each category and updating;

step 5.3, recalculating the distance between the slice image and the clustering center between two adjacent clustering centers along the time direction, thereby adjusting the boundary between the two adjacent categories;

and 5.4, repeating the steps 5.2 and 5.3 according to an iterative method of K-means clustering to obtain a clustering result and a key frame candidate frame.

The cluster center, which refers to the center of the category, is a slice image along the time direction and can be represented by a video frame number.

Calculating the distance between the two slice images and the clustering center, namely calculating the distance between the two slice images, wherein the product of the Euclidean distance between the slice images and the time distance (namely the inter-frame distance) of the slice images along the time direction is adopted for representation; when the two slice images are similar, the value becomes small; this value becomes smaller as the inter-slice distance becomes smaller.

And the cluster center of each category is recalculated by adopting an attempt method, selecting any slice of the category as the cluster center, calculating and accumulating the distances between the rest slices of the category and the slice of the cluster center, and taking the cluster center with the minimum accumulated distance as the cluster center of the category.

The method for adjusting the boundary between two adjacent categories or the boundary between two adjacent categories is to consider that the clustering is performed along the time direction, and only the upper slice belongs to the upper clustering center and the lower slice belongs to the lower clustering center, so that the clustering result is the boundary between two categories, and the condition of staggering cannot exist.

The method for adjusting the clustering boundary comprises the steps of firstly, searching slice images in opposite directions from an upper clustering center and a lower clustering center respectively to obtain a first slice image below the upper clustering center and a first slice image above the lower clustering center, and respectively calculating and accumulating the distances between the slice images and the respective clustering centers; then, on the side with the minimum accumulation distance, selecting the next slice image according to the principle of the opposite sum sequence, calculating the distance between the next slice image and the clustering center of the side and accumulating the distance; until all slices between the two cluster centers have been searched, resulting in a new boundary between the two classes.

The clustering result is the boundary and the clustering center of each category along the time direction after the clustering is finished; and at the moment, clustering the frame corresponding to the central slice image, wherein the frame is a candidate frame of the video content key frame.

And 6, combining the classes with less frames and readjusting the class boundaries.

If the number of frames in the clustering result is less than the specified minimum frame number threshold, the category needs to be removed, and the category is merged with the former category and the latter category, and the category can be classified into the former category or the latter category according to the method for adjusting the clustering boundary in the step 5.3.

And in the finally obtained clustering result, a frame sequence formed by the frames of the clustering centers is the final video content key frame sequence.

Compared with the prior art, the invention has the following advantages: the dynamic space-time slice position is used, so that the extraction effect of the key information of the video is improved, and the extracted slice can more effectively express the key motion information in the motion video; the time attribute of the slice image is considered in the distance calculation of the clustering algorithm, so that the accuracy of identifying the key frame of the motion video is improved; the above measures finally improve the effect of extracting the key information of the motion video. Meanwhile, the method has the advantages of easiness in understanding, simplicity in calculation, good robustness and the like.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

Fig. 2 is a schematic diagram of the calculation flow of the motion video activity intensity map in the present invention.

Fig. 3 is a schematic diagram of a transverse video slice image in the present invention.

Detailed Description

In the following, the technical solutions in the examples of the present invention are clearly and completely described with reference to the drawings in the examples of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, and not all embodiments.

A method for extracting a content key frame of a motion video, as shown in fig. 1, includes the following steps.

Step 1, calculating an activity intensity map of the sports video, as shown in fig. 2.

the graying process may use a conventional RGB to grayscale model.

Step 1.2, calculating a difference image frame by frame;

from the 2 nd frame image, the gray value difference with the previous frame image is calculated point by point according to the pixels, and the absolute value is taken to obtain a difference image.

Step 1.3, carrying out threshold processing on the difference image;

the purpose of setting the threshold value is to reduce the influence of video noise in the future, and the threshold value can be set to be 1-5.

In this example, the threshold is set to 3, that is, the difference between the gray levels of two adjacent frames is less than 3, and the difference is cleared to 0.

Step 1.4, accumulating and normalizing all difference images of the video;

accumulating the gray values of the difference image according to the pixels to obtain a maximum value gmax and a minimum value gmin of the gray values; and mapping [ Gmin, Gmax ] between [0,255] in a linear mapping mode to obtain a normalized image.

The normalized image is referred to as the activity intensity map of the motion video.

In this example, let the length of the video image be X =640, the width be Y =480, and the length of the video be L =80 (frame), as shown in (1) in fig. 3.

In this example, the upper and lower boundaries are set to be 5 pixels wide, and the range of the line where the main area of the video image is located is: [5, 474], the line number starts from 0.

Accumulating the gray values of the area in the motion video activity intensity map according to the rows;

In this example, it is assumed that the resulting spatio-temporal slice position row has a value of p =280, as indicated by the dashed position p in fig. 3 (1).

And 3, performing horizontal slicing operation on each frame of image in the motion video.

In this example, assuming that the slice width is 1 pixel, 280 th line of each frame image in the motion video is extracted as the spatio-temporal slice image when slicing, and the size of the slice image is: x1.

And when in splicing, the video is spliced up and down according to the frame sequence when the video is unfolded to form a transverse video slice image from top to bottom according to the time sequence.

In this example, the sizes of the finally formed transverse video slice images are: x L, as shown in fig. 3 (3).

And 5, clustering the transverse video slice images along the time direction by using a K-means clustering algorithm and taking each slice image as a basic unit of clustering.

The purpose of step 5 is to find the boundaries and center of the class from the slice.

And 5.1, initializing clustering, and uniformly distributing each slice of the transverse video slice images.

In this example, the cluster center is set to 4, and the number of video frames is set to 0 to 79.

In this example, the frame numbers after the four categories are uniformly distributed are respectively: [0 to 19], [20 to 39], [40 to 59], and [60 to 79 ].

And 5.2, updating the clustering center according to the clustering result.

The distance between the two slice images i and j is calculated as:

dis(i,j)=abs(i-j)*d(p(i),p(j))；

where i and j are frame numbers of two slice images, abs () is an absolute value, and d (p (i), p (j)) is a euclidean distance between the slice image p (i) and the slice image p (j).

Updating the clustering center by adopting an attempt method, namely attempting to take each slice image in the class as the clustering center, calculating the distances between the rest slices of the class and the slices of the clustering center and accumulating; and taking the clustering center with the minimum accumulated distance as the updated clustering center.

In this example, the frame numbers of the updated cluster centers are respectively: 10, 35, 65, 75 frames.

And 5.3, adjusting the boundary between two adjacent categories according to the new clustering center.

And recalculating the distance from the clustering center of the slice image between two adjacent clustering centers along the time direction, thereby adjusting the boundary between two adjacent categories.

In this example, in the initial situation, the centers of the first class and the second class are respectively the 10 th frame and the 35 th frame, and the boundary between the two classes is the 19 th frame (upper first class) and the 20 th frame (lower second class); and performing opposite and sequential search on all frames between 11 th frames and 34 th frames, namely sequentially searching from 11 frames downwards in the upper first class and sequentially searching from 34 frames upwards in the lower second class until meeting.

Firstly, calculating the distance between the 11 th frame above and the 10 th frame in the center of the first class, and assuming that the distance is 5; the upper cumulative distance is 5.

Then calculating the distance between the lower 34 th frame and the 35 th frame in the center of the second class, and assuming that the distance is 10; the lower cumulative distance is 10.

Since the upper accumulation distance 5 is smaller than the lower accumulation distance 10, the 12 th frame is then taken in order above for calculation.

The distance of the upper 12 th frame from the first class center 10 th frame is calculated, assuming 2, and the upper cumulative distance is 5+2= 7.

Since the upper accumulation distance 7 is still smaller than the lower accumulation distance 10, the 13 th frame is then sequentially taken from above for calculation.

New boundaries of the first class and the second class are finally obtained, which are assumed to be the 23 rd frame (first class) and the 24 th frame (second class).

And continuously calculating the boundaries between the second class and the third class and between the third class and the fourth class.

In this example, the frame numbers of the four classes obtained: [0 to 25], [26 to 28], [29 to 47], and [48 to 79 ].

And 6, merging classes with less frame number and readjusting the class boundaries to obtain final content video key frames.

In this example, the threshold of the minimum number of frames is set to 4, and since the number of frames of the second type is 3, which is smaller than the threshold 4, the merging operation is required.

And 5.3, adjusting the boundary between the two classes according to the step 5.3 to obtain the final frame sequence numbers of the three classes: [0 to 27], [28 to 47], and [48 to 79 ].

The frame numbers of the three cluster centers updated according to the step 5.3 are respectively: 8,37, 73.

This is the final video content key frame sequence.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced, or the order of use of the steps may still be modified; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions. The values of the various thresholds and ranges of the present invention may vary depending on the particular situation.

Claims

1. A method for extracting a content key frame of a motion video is characterized by comprising the following steps:

step 1, calculating an activity intensity map of a sports video;

step 2, taking a row with the maximum gray value in the motion video activity intensity map as a space-time slice position row;

step 3, performing horizontal slicing operation on each frame of video image at the position line of the space-time slice to obtain a space-time slice image;

step 4, splicing the time-space slice images to form transverse video slice images;

step 5, clustering the transverse video slice images along the time direction by using a K-means clustering algorithm;

and 6, merging the classes with less frame number and readjusting the boundary to obtain the final video content key frame.

2. The method of claim 1, said step 1, comprising:

and step 1.4, accumulating the gray values of the difference image according to pixels, and normalizing to obtain an activity intensity image of the motion video.

3. The method of claim 1, said step 2, comprising:

accumulating gray values of the main area of the activity intensity map according to rows; taking the row with the maximum accumulated gray value as a space-time slice position row;

optionally, a column with the maximum gray value in the motion video activity intensity map is taken as a spatiotemporal slice position column; vertical slicing is subsequently performed column by column.

4. The method of claim 1, said step 4, comprising:

for horizontal space-time slice images, splicing up and down according to the frame sequence when the video is unfolded to form horizontal video slice images from top to bottom in the time direction;

optionally, for a vertical space-time slice image, left and right stitching is performed according to a frame sequence when the video is unfolded during stitching, so as to form a vertical video slice image in a time direction from left to right.

5. The method of claim 1, said step 5, comprising:

step 5.1, uniformly distributing each slice of the transverse video slice images to each category according to the set number of the clustering centers;

step 5.3, recalculating the distance between the slice images and the clustering centers and adjusting the boundary of the slice images between the adjacent clustering centers;

step 5.4, repeating the steps 5.2 and 5.3 according to a K-means clustering method until a clustering result and a key frame candidate frame are obtained;

and calculating the distance between the two slice images and the clustering center, namely calculating the distance between the two slice images, wherein the product of the Euclidean distance between the slice images and the inter-frame distance of the slice images along the time direction is used for expressing.