CN107766838B

CN107766838B - Video scene switching detection method

Info

Publication number: CN107766838B
Application number: CN201711089563.2A
Authority: CN
Inventors: 苏许臣; 朱立松; 黄建杰
Original assignee: Cntv Wuxi Co ltd
Current assignee: Cntv Wuxi Co ltd
Priority date: 2017-11-08
Filing date: 2017-11-08
Publication date: 2021-06-01
Anticipated expiration: 2037-11-08
Also published as: CN107766838A

Abstract

The invention discloses a video scene switching detection method, which belongs to the technical field of multimedia information processing, and completes detection through a video scene switching detection model, wherein the detection comprises training of the video scene switching detection model and application of the video scene switching detection model. The invention adopts a deep learning algorithm, and the discrimination threshold value of the model is automatically adjusted to be optimal in the training process, so that the threshold value is not required to be set; the model input increases the frame difference of two frames, so that the convergence speed of the model is higher; because the model adopts the batch normalization technology to prevent over-fitting training, the generalization capability of the model is improved.

Description

Video scene switching detection method

Technical Field

The invention relates to a video detection method, in particular to a video scene switching detection method, and belongs to the technical field of multimedia information processing.

Background

A video is generally composed of a plurality of scenes, one scene is composed of a plurality of video frames, video scene detection refers to finding out the frames and frame positions of one video, wherein scene switching occurs, the obtained positions can be used for fast and precise editing of the video, and a frame sequence composed of the obtained frames can be used for roughly describing the whole video content.

At present, the conventional video scene detection method generally adopts a mode of manually extracting features, such as calculating color histogram similarity of adjacent frames, or directly calculating frame difference, or detecting scene switching by using the change degree feature VH of the high-frequency subband coefficient of each frame in a video scene, wherein an algorithm such as three-dimensional wavelet transform is required for calculating the high-frequency subband coefficient, for example, chinese patent application No. 200810118534.9, these techniques all calculate a feature value and then compare with a threshold value, and if the feature value is greater than the threshold value or less than the threshold value, the frame is determined to be switched. There are also adaptive threshold algorithms based on the above technology, such as the method for detecting video scene change based on adaptive threshold described in chinese patent application No. 201410466385.0, but the size of the sliding window and the preset value B still need to be manually set.

At present, the traditional video scene detection method adopts the classical mathematical algorithm to extract features, the design of the algorithm is complex, the quality of the algorithm determines the final accuracy, in addition, the traditional algorithm can not avoid the setting of various thresholds, such as the threshold of the similarity, the threshold of a sliding window and the like, the setting of the thresholds needs to be obtained by experience, and the quality of the setting of the thresholds also determines the detection accuracy.

Disclosure of Invention

The invention mainly aims to provide a video scene switching detection method, which trains a large number of switching frame pairs and non-switching frame pairs prepared in advance into models, extracts adjacent frames of a video to be detected and sequentially inputs the extracted adjacent frames into the trained models, finds out the positions of all the switching frames according to the output of the models, does not need to specify any threshold value, and has high accuracy.

The purpose of the invention can be achieved by adopting the following technical scheme:

a video scene switching detection method is characterized in that detection is completed through a video scene switching detection model, and the detection comprises training of the video scene switching detection model and application of the video scene switching detection model.

Further, the training of the video scene change detection model includes the following steps:

step 11: defining parameters of a video scene switching detection model;

step 12: constructing a video scene switching detection model;

step 13: defining a loss function, and adopting cross entropy as the loss function;

step 14: defining an optimizer and adopting an Adam optimization algorithm;

step 15: defining an evaluation function to calculate the discrimination accuracy of the video scene switching detection model;

step 16: training and evaluating the video scene switching detection model, and storing the parameters once every 20 times of training.

Further, the application of the video scene change detection model comprises the following steps:

step 21: sequentially reading a frame of a video to be detected, and resize to 96x 96;

step 22: inputting the current frame and the previous frame into the trained video scene switching detection model to obtain the output result of the video scene switching detection model;

step 23: and if the output result of the video scene switching detection model is a switching frame, outputting the current frame sequence number and storing the frame.

Further, the video scene switching detection model comprises a PAD layer, a plurality of convolution groups, a Reshape layer, a full link layer 512, a full link layer 2 and a Softmax layer.

Further, the convolution groups include a convolution group of 9 × 9 × 32, a convolution group of 3 × 3 × 64, and a convolution group of 5 × 5 × 128.

Further, each convolution group comprises a convolution layer, a Relu layer, a pooling layer and a batch normalization layer.

Further, the convolution kernel size of the convolution group 9 × 9 × 32 is 9 × 9, and the output feature number is 32;

the convolution kernel of the convolution group 3 × 3 × 64 is 3 × 3, and the output feature number is 64;

the convolution kernel of the convolution group 5 × 5 × 128 is 5 × 5, and the output feature number is 128;

the step size of the pooling layer is 2x 2.

Further, the input of the video scene cut detection model is a pair of image frames, denoted X1 and X2, respectively, with the size of the image being 96 × 96 × 3.

Further, the detecting of the video scene switching detection model includes: inputting X1, X2 and X1-X2 into a PAD layer, superposing three images together on the PAD layer to form a 96 × 96 × 9 matrix, and outputting the matrix after being subjected to a convolution group of 9 × 9 × 32 to form a 48 × 48 × 32 matrix; and finally, calculating the probability of outputting switching frames and non-switching frames by using a Softmax layer, and taking the larger one of the switching frames and the non-switching frames to represent the final judgment output result.

The invention has the beneficial technical effects that: according to the video scene switching detection method, the video scene switching detection method provided by the invention adopts a deep learning algorithm, and the discrimination threshold value of the model is automatically adjusted to be optimal from the training process, so that the threshold value is not required to be set; the model input increases the frame difference of two frames, so that the convergence speed of the model is higher; because the model adopts the batch normalization technology to prevent over-fitting training, the generalization capability of the model is improved.

Drawings

Fig. 1 is a schematic diagram of a model structure of a preferred embodiment of a video scene change detection method according to the present invention;

FIG. 2 is a schematic diagram of a convolution group model in accordance with a preferred embodiment of the video scene change detection method of the present invention;

fig. 3 is a flowchart of a model application of a video scene change detection method according to a preferred embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention more clear and definite for those skilled in the art, the present invention is further described in detail below with reference to the examples and the accompanying drawings, but the embodiments of the present invention are not limited thereto.

As shown in fig. 1, fig. 2, and fig. 3, in the video scene switching detection method provided in this embodiment, detection is completed through a video scene switching detection model, which includes training of the video scene switching detection model and application of the video scene switching detection model; the training of the video scene switching detection model comprises the following steps:

step 11: defining parameters of a video scene switching detection model;

step 12: constructing a video scene switching detection model;

step 14: defining an optimizer and adopting an Adam optimization algorithm;

Further, in this embodiment, as shown in fig. 1 and fig. 2, the video scene change detection model includes a PAD layer, a plurality of convolution groups, a Reshape layer, a full-link layer 512, a full-link layer 2, and a Softmax layer; the convolution groups include a convolution group of 9 × 9 × 32, a convolution group of 3 × 3 × 64, and a convolution group of 5 × 5 × 128; each convolution group comprises a convolution layer, a Relu layer, a pooling layer and a batch normalization layer.

Further, in the present embodiment, as shown in fig. 1, the convolution kernel size of the convolution group 9 × 9 × 32 is 9 × 9, and the output feature number is 32;

the step size of the pooling layer is 2x 2.

Further, in the present embodiment, the input of the video scene change detection model is a pair of image frames, which are respectively denoted as X1 and X2, and the size of the image is 96 × 96 × 3; the detection of the video scene switching detection model comprises the following steps: inputting X1, X2 and X1-X2 into a PAD layer, superposing three images together on the PAD layer to form a 96 × 96 × 9 matrix, and outputting the matrix after being subjected to a convolution group of 9 × 9 × 32 to form a 48 × 48 × 32 matrix; and finally, calculating the probability of outputting switching frames and non-switching frames by using a Softmax layer, and taking the larger one of the switching frames and the non-switching frames to represent the final judgment output result.

Further, in the present embodiment, the composition of the model is first described. As shown in fig. 1, the input to the model is a pair of image frames, denoted X1, X2, respectively, the size of the image being 96 × 96 × 3(3 representing the number of channels). Inputting X1, X2 and X1-X2 into a PAD layer, superposing three images together at the PAD layer to form a 96X9 matrix, and passing through a convolution group which comprises a convolution layer, a relu layer, a max-posing pooling layer and a batch normalization layer (batch normalization), wherein the convolution kernel size of the convolution layer is 9X 9, the output characteristic number is 32, and the step size of the pooling layer is 2X 2, so that the output becomes a 48X 32 matrix after passing through the convolution group; then, after passing through the second convolution group (the convolution kernel of which is 3 × 3 and the output characteristic number is 64), a 24 × 24 × 64 matrix is output, then, after passing through the third convolution group (the convolution kernel of which is 5 × 5 and the output characteristic number is 128), a 12 × 12 × 128 matrix is output, then, the matrix is flattened into a one-dimensional matrix 1 × 18432(18432 is 12 × 12 × 128) through a reshape layer, then, the output becomes 1 × 2 through two full-connection layers, and finally, the output layer is an output layer, the probabilities of outputting the two types are calculated by softmax, respectively represent the probabilities of a switching frame and a non-switching frame, and the larger one of the two types is taken to represent the final judgment output result, for example, the output [0.886,0.114] represents the switching frame.

In summary, in this embodiment, according to the video scene switching detection method of this embodiment, the video scene switching detection method provided in this embodiment adopts a deep learning algorithm, and the discrimination threshold of the model is automatically adjusted to be optimal from the training process, so that no threshold needs to be set; the model input increases the frame difference of two frames, so that the convergence speed of the model is higher; because the model adopts the batch normalization technology to prevent over-fitting training, the generalization capability of the model is improved.

The above description is only for the purpose of illustrating the present invention and is not intended to limit the scope of the present invention, and any person skilled in the art can substitute or change the technical solution of the present invention and its conception within the scope of the present invention.

Claims

1. A video scene switching detection method is characterized in that detection is completed through a video scene switching detection model, and the detection comprises training of the video scene switching detection model and application of the video scene switching detection model;

the training of the video scene switching detection model comprises the following steps:

step 11: defining parameters of a video scene switching detection model;

step 12: constructing a video scene switching detection model;

step 14: defining an optimizer and adopting an Adam optimization algorithm;

2. The method according to claim 1, wherein the application of the video scene cut detection model comprises the following steps:

3. The method according to claim 1, wherein the video scene cut detection model comprises a PAD layer, a plurality of convolution groups, a Reshape layer, a full link layer 512, a full link layer 2, and a Softmax layer.

4. The method of claim 3, wherein said convolution group comprises convolution group 9 x 32, convolution group 3 x 64 and convolution group 5 x 128.

5. The method of claim 3, wherein each convolution group comprises a convolution layer, a Relu layer, a pooling layer, and a batch normalization layer.

6. The method of claim 5, wherein the convolution group has a convolution kernel size of 9 x 32 of 9 x9, and an output feature number of 32;

the step size of the pooling layer is 2x 2.

7. A method as claimed in claim 3, wherein the input of the video scene cut detection model is a pair of image frames, denoted X1 and X2, respectively, the size of the image being 96X 3.

8. The method according to claim 1, wherein the detecting of the video scene cut detection model comprises: inputting X1, X2 and X1-X2 into a PAD layer, superposing three images together on the PAD layer to form a 96 × 96 × 9 matrix, and outputting the matrix after being subjected to a convolution group of 9 × 9 × 32 to form a 48 × 48 × 32 matrix; and finally, calculating the probability of outputting switching frames and non-switching frames by using a Softmax layer, and taking the larger one of the switching frames and the non-switching frames to represent the final judgment output result.