CN108737814B

CN108737814B - Video shot detection method based on dynamic mode decomposition

Info

Publication number: CN108737814B
Application number: CN201810049786.4A
Authority: CN
Inventors: 毕重科; 原野; 张加万
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-01-18
Filing date: 2018-01-18
Publication date: 2021-04-30
Anticipated expiration: 2038-01-18
Also published as: CN108737814A

Abstract

The invention discloses a video shot detection method based on dynamic mode decomposition, which comprises the following steps: step one, acquiring image data from a video lens to establish a time sequence matrix

Solving the time sequence matrix through dynamic mode decomposition to obtain a linear correlation coefficient matrix S of two continuous frames; step three, solving a foreground mode and a background mode in the video according to the linear correlation coefficient matrix S; step four, respectively solving the amplitude A in the foreground mode and the amplitude A in the background mode according to the linear correlation coefficient matrix S_amp(t); step five, judging whether the amplitude of the background mode exceeds a preset threshold value or not; if the amplitude of the background mode exceeds the preset threshold value, outputting the shot boundary, otherwise returning to the step one, expanding the time characteristic weight of the background (or foreground) mode, reducing the influence of noise (brightness, texture and the like) on the shot detection, and effectively reducing missing detection and false detection.

Description

Video shot detection method based on dynamic mode decomposition

Technical Field

The invention relates to the technical field of video shot detection methods, in particular to a video shot detection method based on dynamic mode decomposition.

Background

Most videos such as movies and documentaries are usually composed of scenes, scenes and shots. There are multiple scenes in a scene, and one scene contains multiple shots. Directors typically use shot transitions to more smoothly switch shots, making the video connection more coherent. Common lens transitions include: hard cut, fade in and out and dissolve. Among them, hard shearing is the most commonly used method. The lens transition is very slow in fade-in and fade-out. The scene and the subject in the shot hardly change, and only the luminance changes slowly, and thus it is difficult to detect. Dissolving is the superposition of the image of the subsequent shot on the image of the previous shot. On the two shot boundaries, the visual characteristics of the previous shot and the next shot are overlapped, so that the conventional method generates a wrong detection result.

At present, the method mainly adopted for lens detection is as follows: one is to detect lens boundaries using pixel, pixel block or color histogram comparison. The second is shot detection using video-based content including texture, color and shape. In addition, shot boundaries can be detected by using cosine similarity, weighted edge information, machine learning, and the like. The main problems of the current method are: first, in shot detection of pixel, pixel block, and histogram comparison, it is easy to cause erroneous results due to a drastic change in illumination at non-shot boundaries. Second, when the speed of the object (or camera) is fast, the accuracy of the algorithm is greatly reduced. Thirdly, since images of the front and rear shots overlap in the dissolved shot transition, the shot detection method based on video contents such as texture, color, and shape is low in accuracy. In addition, the accuracy and recall rate for processing both dissolve and fade transitions are low using methods such as cosine similarity, weighted edge information, machine learning, etc.

Therefore, a method is needed for extracting a shot boundary required by a user by using stable shot boundary feature weights to suppress the occurrence of false detection results in a video without shot transitions, and obtaining feature weights which are significantly changed and easily recognized in a video with shot transitions.

The problems of the existing algorithm can be summarized into two points: first, in videos where there is no shot transition, the illumination changes drastically, or the object (or camera) being photographed moves faster. The existing algorithm may have wrong detection, resulting in low accuracy. Second, in videos where there is a shot transition, the color difference in different scenes is too small, the light changes slowly (fades in), or the foreground objects overlap (dissolve). The method can cause a large number of missed detections in the existing algorithm, and the recall rate is low.

Disclosure of Invention

The invention aims to overcome the defects in the background technology, and provides a video lens detection method based on dynamic mode decomposition, which can expand the time characteristic weight of a background (or foreground) mode, reduce the influence of noise (brightness, texture and the like) on lens detection, effectively reduce missed detection and false detection, and simultaneously have higher accuracy and recall rate on the lens boundary detection result.

The technical scheme of the invention is as follows:

a video shot detection method based on dynamic mode decomposition comprises the following steps:

step one, acquiring image data from a video lens to establish a time sequence matrix

Solving the time sequence matrix through dynamic mode decomposition to obtain a linear correlation coefficient matrix S of two continuous frames;

step three, solving a foreground mode and a background mode in the video according to the linear correlation coefficient matrix S;

step four, respectively solving the amplitude A in the foreground mode and the amplitude A in the background mode according to the linear correlation coefficient matrix S_amp(t)；

Step five, judging whether the amplitude of the background mode exceeds a preset threshold value or not; and if the amplitude of the background mode exceeds a preset threshold value, outputting the shot boundary, and otherwise, returning to the step one.

In the second step, the linear correlation coefficient matrix S,

background mode in the third step, i.e.

Background mode in the third step, i.e.

The step four intermediate amplitude, i.e.

Compared with the prior art, the invention has the advantages that:

the invention takes video data as a matrix data sequence, and directly extracts time sequence characteristics from the matrix data, as shown in fig. 4. Then, a background mode and a series of foreground modes in the video are extracted by utilizing dynamic mode decomposition, and the background (or foreground) of each frame can be recovered by using the background (or foreground) mode and amplitude corresponding to the background (or foreground) mode; the present invention can be easily extended to other types of shot transitions.

Second, the present invention uses the amplitude of the background mode to detect shot boundaries, and for video data without shot boundaries, the background mode is relatively constant and the amplitude of the background mode is relatively stable. In contrast, when shot boundaries exist in the video, the background mode and the corresponding amplitude change drastically, as shown in fig. 5; the accuracy of lens video acquisition is improved.

Third, the present invention is directed to a dynamic mode decomposed color space (as shown in FIG. 6), which reduces false detections and false detections.

Fourth, the present invention performs a focus test by taking 150 minutes of video as shown in fig. 2, fig. 3, (including hard cut, dissolve, and fade). The effectiveness of the method is verified, and the recall rate and the accuracy rate of the shot detection are improved; and for arbitrarily complex sequences of movies it has a constant detection quality without the need to adjust the parameters.

Drawings

Fig. 1 is a flowchart of a video shot detection method based on dynamic mode decomposition according to the present invention.

Fig. 2 shows three lens conversion modes in the present invention: hard cut, fade in and out and dissolve.

Fig. 3 is a video shot detection framework in the present invention.

FIG. 4 shows a dynamic mode algorithm of the present invention decomposing a video into a mode matrix, a temporal feature matrix, and a weight matrix.

FIG. 5 shows an analysis of the present invention that if the amplitude of the background mode changes dramatically, the video end frame is a shot boundary.

Fig. 6 shows that the interference of noise on the lens detection can be greatly reduced by using the color space of the dynamic mode decomposition in the invention.

Fig. 7 shows the video without shot boundaries but with a strong illumination change and the amplitude detection result thereof.

Fig. 8 shows a video without shot boundaries but with a foreground object moving rapidly and its amplitude detection results.

FIG. 9 shows the hard-cut lens conversion method and the amplitude detection result thereof.

Fig. 10 shows the switching method of the fade-in/fade-out lens and the amplitude detection result thereof.

Fig. 11 shows the conversion method of the dissolution lens and the amplitude detection result thereof.

FIG. 12 is a diagram illustrating a second step of determining shot boundaries by comparing amplitude characteristics of background patterns according to the preferred embodiment. (a) Showing the presence of hard cropped video data. (b) The amplitude is extracted from the video by a dynamic mode decomposition method. The red line represents the background mode amplitude and the other color lines represent the foreground mode amplitude.

Fig. 13 is a diagram of increasing temporal feature weights using different color spaces according to the third step of the preferred embodiment. (a) Displaying the video data without shot boundaries. (b) Amplitude lines are extracted from the video by a dynamic mode decomposition method. The red line represents the background mode amplitude and the other color lines represent the foreground mode amplitude. (c) Amplitude lines of the HSV color space are extracted from the video by the DMD method. The red line represents the background mode amplitude and the other color lines represent the foreground mode amplitude.

Detailed Description

The invention is further illustrated by the following specific examples and the accompanying drawings. The examples are intended to better enable those skilled in the art to better understand the present invention and are not intended to limit the present invention in any way.

As shown in fig. 1, the present invention provides a video shot detection method based on dynamic mode decomposition, which includes the following steps: step one 101, acquiring image data from a video lens to establish a time sequence matrix

Video is data with strong spatio-temporal correlation. Each video shot can be viewed as a potentially complex non-linear dynamic snapshot. However, dynamic pattern decomposition is a mathematical method whose focus is to find coherent spatio-temporal patterns of high-dimensional data from complex systems with temporal dynamics. The ability to dynamically pattern decompose is to discover and exploit background changes in complex systems. This is the key to solving shot detection.

Video background patterns are extracted using dynamic pattern decomposition. First, a video stream is defined from X₁to X_NIs uniformly sampled data over N frames and the time interval is at. The temporal representation of the video data is:

x_i(i < N) is an image of the ith frame of video. Assuming a linear mapping process, kupmann operator a can be used to map the j-th and j + 1-th time data:

x_i+1＝Ax_i

as a result of this, it is possible to,

the formula can be derived

Where U is a unitary matrix (U ∈ C)^m×l) Sigma diagonal matrix (∑ e C)^l×l) And V is a unitary matrix (V ∈ C)^n-1×l). The parameter l is the rank of the X min.

Step two 102, solving the time sequence matrix through dynamic mode decomposition to obtain a linear correlation coefficient matrix S of two continuous frames;

using similarity transformation (V Σ)^-1) A transformation matrix for S can be derived, represented as:

the basic idea of the dynamic mode decomposition algorithm is as follows:

the eigenvalues of the S matrix are similar to those of the Kupmann operator A, similar to the Ritz values calculated in the Arnoldi algorithm.

Step three 103, solving a foreground mode and a background mode in the video according to the linear correlation coefficient matrix S;

mode of dynamic mode decomposition

Expressed as:

in addition, the eigenvalues are transformed into fourier mode to predict the temporal dynamics:

ω_j＝ln(μ_j)/Δt

ω_jincrease or decrease, ω, of the real-part-corresponding dynamic mode decomposition basis function_jThe imaginary part oscillates in a corresponding dynamic mode. By X_DMD(t)＝A^tx₁Can reconstruct video

Obviously, there will be a corresponding Fourier mode ω_jAt complex spatial pointsNear the origin, i.e. | | ω_jIf the video does not change over time, or changes very slowly, | ≈ 0. Thus, the background mode represents a relatively static scene in the video, while the foreground mode represents a plurality of objects or scenes in relative motion in the video. A background mode and a foreground mode can be obtained, i.e.

Step four 104, respectively solving the amplitude A in the foreground mode and the amplitude A in the background mode according to the linear correlation coefficient matrix S_amp(t)；

In the above-mentioned A_ampIn the formula (t), λ is obtained by calculating the temporal background (foreground) mode of the spatio-temporal feature matrix, which is the feature of the background and foreground in the video. The weight matrix represents the weight of each frame in the image in different modes, including the weight of the background mode and the weight of the foreground mode. In general, the amplitude varies with time.

Step five 105, judging whether the amplitude of the background mode exceeds a preset threshold value; and if the amplitude of the background mode exceeds a preset threshold value, outputting the shot boundary, and otherwise, returning to the step one.

Step five 106 is to judge the shot boundary by comparing the amplitude characteristics of the background mode. Dynamic mode decomposition may extract background and foreground modes in a video. Background mode represents a relatively static background or scene in a video. Foreground mode represents relatively moving or changing objects or scenes in the same video. In general, a video may extract one background mode and several foreground modes. Amplitude is a temporal feature weight of the background (or foreground) mode, which represents the degree of variation of the background (or foreground) in the video. Background patterns and amplitudes vary dramatically in video data with borders. And comparing the background mode with a threshold, judging that a shot boundary exists if the background mode exceeds the threshold, and indicating that the shot boundary does not exist in the video if the background mode does not exceed the threshold. The threshold is an average value obtained by calculating the amplitude of 30 sets of background patterns in which no shot boundary video exists.

Fig. 7 shows a set of video shots containing nine frames of images. There are no shot boundaries in these nine frames and only part of the illumination changes. Using dynamic mode to decompose the video, it can be seen that the background mode is relatively constant from 434 frames to 442 frames. However, the result of using the pixel blocks recognizes the 437 th frame and the 440 th frame errors as shot boundaries.

Fig. 8 shows a set of video shots containing 9 frames of images. In this video, always owl flies quickly from the shot but there is no shot boundary. The video was decomposed using dynamic mode and the amplitude of the background mode was found to be relatively stable. The results of the calculations using the color block algorithm misidentify 6730 and 6732 frames as shot boundaries. The reason is that the image characteristics of the owl are drastically changed in the two images.

Fig. 9 shows a set of video shots containing 9 frames. This is a video shot transition effect that involves hard cuts. The last frame 1785 is a shot boundary. By processing the video through dynamic mode decomposition, it can be seen that the amplitude of the background mode from 1777 to 1784 frames is very stable, while the amplitude of the background mode from 1777 to 1785 frames is very sharp. Therefore, it is determined that the 1785 th frame is a shot boundary.

Fig. 10 shows a set of fade-in and fade-out video shots containing 9 frames. Over time, the video image gradually darkens. By computing the video using dynamic mode decomposition, it can be seen that the background mode amplitude of the video from 1921 to 1927 frames is relatively constant over time. Whereas the background pattern amplitude of frames 1921 to 1928 changes very sharply. It can be proven that frame 1928 is a shot boundary.

Fig. 11 shows a group of dissolved video shots. It can be seen that the amplitude of the background mode of the video does not change dramatically by calculating the 447 th frame to the 454 th frame, but a significant change is found when calculating the amplitude of the background mode from the 447 th frame to the 455 th frame. Thus, we can determine that a shot boundary exists at frame 455.

Fig. 12 shows a set of video shots containing twelve frame images. There is no shot boundary in the first eleven frames, and the process of flushing the coast with seawater is depicted in the video. In this video, the sky is background and does not change, and therefore the shots do not switch either. Using dynamic mode decomposition of the video, it can be seen that the amplitude of the background mode over time is constant, corresponding to the pattern of the background, i.e., the sky. The next set of foreground patterns are waves, i.e. foreground elements in the shot, as the camera moves. However, there is a shot boundary in the last frame. It can be seen that the amplitude of the 5162 to 5172 frames is very flat. However, the amplitudes of 5162 to 5173 frames are very sharp. Thus, frame 5713 is a shot boundary.

As shown in fig. 13, the original video frame luminance variation is not significant. However, we convert the RGB image into a grayscale image. The amplitude of the background pattern calculated from the change in brightness changes dramatically. Therefore, we need to reduce the effect of brightness variation by other color spaces. The HSV color space is selected and only the hue and saturation channels are used to calculate the background features. This can reduce noise of lens detection. In fig. 12, it can be seen that the amplitude of the background pattern in the video without shot boundaries is very flat when changing the color space to HSV. The detection condition of the drastic change of the amplitude of the background mode is still satisfied at the shot boundary. The interference of noise to the lens detection is eliminated by utilizing the color space, and the accuracy and the recall rate of the lens detection are improved.

The invention utilizes different color spaces to increase the time characteristic weight, thereby reducing the influence of the brightness on the lens detection. The traditional approach is to compare the luminance difference between adjacent video frames to determine the shot boundaries. However, when the illumination changes drastically, the accuracy of the algorithm is greatly reduced. Therefore, when the brightness of two adjacent frames changes obviously during the transition of the video shot, the conventional shot detection algorithm generates noise. The problem of brightness noise still exists when the dynamic mode decomposition method is used for solving the shot boundary. The background features are calculated using a dynamic mode decomposition method by converting an original video image of an RGB color space into a grayscale image. When there are no shot transitions in the video but the brightness changes slightly, the background features sometimes change very dramatically.

It should be understood that the embodiments and examples discussed herein are illustrative only and that modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the purview of the appended claims.

Claims

1. A video shot detection method based on dynamic mode decomposition is characterized by comprising the following steps:

step one, acquiring time sequence image data from a video lens to establish a time sequence matrix

Step two, solving the time sequence matrix through dynamic mode decomposition to obtain a linear correlation coefficient matrix S of two continuous frames, namely

Wherein U is a unitary matrix, a sigma-diagonal matrix, V is a unitary matrix,

is a timing matrix;

solving a series of foreground modes in the video, namely scenes in relative motion, and background modes in the video, namely relatively static scenes according to the linear correlation coefficient matrix S;

step four, respectively solving the amplitude A in the foreground mode and the amplitude A in the background mode according to the linear correlation coefficient matrix S_amp(t) that is

A_amp(t) denotes amplitude, λ denotes characteristics of background and foreground of the video, α denotes a λ corresponding coefficient, and r denotes a dynamic mode μ to be used_jNumber of (a)_jIs mu_jThe corresponding coefficient, t, represents time;

step five, judging whether the amplitude of the background mode exceeds a preset threshold value or not; and if the amplitude of the background mode exceeds a preset threshold value, outputting a video end frame as a shot boundary, and otherwise, returning to the step one.

2. The method according to claim 1, wherein the video shot detection method based on dynamic mode decomposition,

background mode in the third step, i.e.

Where r denotes the dynamic mode to be used

Number of (a)_jIn order to correspond to the parameters of the system,

representing dynamic patterns, omega, of dynamic pattern decomposition_jThe real part corresponding to the increase or decrease, ω, of the dynamic mode decomposition basis function_jThe imaginary part corresponds to the dynamic mode oscillation when | | w_j＝p| | ≈ 0, p is a low-rank mode, and t represents the time of the timing matrix.

3. The method as claimed in claim 1, wherein the foreground mode in the third step is foreground mode

Where r denotes the dynamic mode to be used

Number of (a)_jIs composed of

In response to the coefficient(s) of the coefficient(s),