CN111079864A

CN111079864A - Short video classification method and system based on optimized video key frame extraction

Info

Publication number: CN111079864A
Application number: CN201911420703.9A
Authority: CN
Inventors: 刘昱龙; 范俊; 顾湘余; 李文杰
Original assignee: Hangzhou Quwei Science & Technology Co ltd
Current assignee: Hangzhou Quwei Science & Technology Co ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-04-28

Abstract

The invention discloses a short video classification method and a system based on optimized video key frame extraction, wherein the classification method comprises the following steps: s1, extracting short video dense frames; s2, calculating the information content of each image frame in the dense frame; s3, selecting a plurality of image frames with the largest information amount as key frames of the short video; and S4, splicing the key frames to generate a short video tensor, inputting the short video tensor into a 3D-CNN classification model, performing feature learning on the short video tensor based on the 3D-CNN classification model, and outputting a short video category. The invention screens the video frames based on the information quantity, avoids the problem that the extracted frames generate motion blur due to video jitter or are pure color due to picture switching, and improves the accuracy of classification.

Description

Short video classification method and system based on optimized video key frame extraction

Technical Field

The invention relates to the field of short video processing, in particular to a short video classification method and system based on optimized video key frame extraction.

Background

In recent years, short videos have been widely used and spread as carriers of information because the contents of short video expression are richer and more intuitive. However, some lawbreakers can use short videos to transmit unhealthy or illegal videos to gain privacy, so that target videos should be classified and illegal videos filtered before short videos are released on a short video platform. Meanwhile, due to different individual interests and hobbies, videos recommended to each user are different, so that the short video classification is indispensable for different categories to be recommended to different users.

The current video classification processing flow is as follows: 1. extracting video frames, 2, classifying the frames by using a machine learning or deep learning method, and 3, outputting the learned categories as final video categories. For step 1, the current method also focuses on extracting the video at equal intervals or according to the time difference. For step 2, along with the increase of data volume, the algorithm accuracy can be effectively improved by using a deep learning method, and because the video frames are input in multiple ways, the applications of classifying the video frames by using a Long Short Term Memory Network (LSTM) are more at present.

The original technology aiming at short video classification has the following problems:

(1) in the process of extracting video frames, the randomness of the video frames extracted at equal intervals is too high, and generally, motion blur of the extracted frames caused by video jitter or pure color of the extracted frames caused by picture switching can occur, and if the frames are also put into a subsequent model for training or prediction, the accuracy of classification can be influenced.

(2) Because the video frame sequence is a two-dimensional image, if the LSTM method is called and a one-dimensional vector needs to be input, a two-dimensional to one-dimensional mapping relation exists between the video frame sequence and the LSTM, the currently adopted method is that each frame of image is input into a mainstream neural network architecture (resnet-50, vgg-16) to output the last full connection layer as a one-dimensional vector, and finally a plurality of one-dimensional vectors with sequence are input into the LSTM to be classified. It is easy to find that this mainstream method utilizes 2 network models, which consumes a lot of computing resources and seriously affects the classification time efficiency.

The invention patent application with publication number CN 109977773 a discloses a human behavior recognition method and system based on multi-target detection 30CNN, and the method comprises the following steps: 1) preprocessing a video, and converting a video stream into image frames; 2) the current relatively mature SSD detection technology is adopted to calibrate and cut the target object in the video; 3) establishing a feature extraction network structure of image frame data and calibration cutting data; 4) establishing a feature fusion model, and fusing the two features extracted in the step 3); 5) classifying by using a Softmax regression model classifier; 6) and fine-tuning the trained model according to the actual application scene or the public data set. According to the method, 3D-CNN (3D volumetric Neural Networks) is adopted to replace feature extraction + LSTM to directly classify the input in the text, so that resource loss is reduced, time efficiency is improved, and the processing process is real-time.

However, the above patent application still has the problem that the extracted frame generates motion blur due to video jitter or is pure color due to picture switching, resulting in low classification accuracy. Therefore, how to improve the classification accuracy by intercepting the video frames is a problem to be solved in the field.

Disclosure of Invention

The invention aims to provide a method and a system for classifying short video based on optimized video key frame extraction aiming at the defects of the prior art. The invention screens the video frames based on the information quantity, avoids the problem that the extracted frames generate motion blur due to video jitter or are pure color due to picture switching, and improves the accuracy of classification.

In order to achieve the purpose, the invention adopts the following technical scheme:

a short video classification method based on optimized video key frame extraction comprises the following steps:

s1, extracting short video dense frames;

s2, calculating the information content of each image frame in the dense frame;

s3, selecting a plurality of image frames with the largest information amount as key frames of the short video;

and S4, splicing the key frames to generate a short video tensor, inputting the short video tensor into a 3D-CNN classification model, performing feature learning on the short video tensor based on the 3D-CNN classification model, and outputting a short video category.

Further, the number of the dense frames is m times of the number of the key frames, and m is larger than or equal to 2.

Further, the step S2 is specifically:

s21, graying each image frame in the dense frame, wherein three color channels of the color image are represented by R, G, B, respectively, and the grayscale map Grad is:

Grad(i,j)＝0.299*R(i,j)+0.587*G(i,j)+0.114*B(i,j)

s22, calculating the information entropy of the grayed image frame:

wherein, P_(i)The probability of a certain pixel value i appearing in an image is shown, and the value range of the image pixel value is 0-255.

Further, the short video tensor is N × W × H × C, where N is the number of frames of the key frame, W corresponds to the width of each image frame, H corresponds to the height of each image frame, and C corresponds to the number of channels of each image frame.

Further, the 3D-CNN comprises a hard connecting layer, three 3D convolution layers, two down-sampling layers, a full connecting layer and an output layer; the hard connecting layer generates a plurality of channel information by processing key frames; the 3D convolutional layer is used for extracting various features; the down-sampling layer component is used for reducing the dimension of the features; the full connection layer is used for combining the two-dimensional features into one-dimensional features; the output layer comprises a Softmax classifier for classifying the output for the short video based on the one-dimensional features.

The invention also provides a short video classification system based on optimized video key frame extraction, which comprises the following steps:

the frame cutting module is used for extracting short video dense frames;

the information quantity calculating module is used for calculating the information quantity of each image frame in the dense frame;

the key frame selecting module is used for selecting a plurality of image frames with the largest information amount as key frames of the short video;

and the classification module is used for splicing the key frames to generate a short video tensor, inputting the short video tensor into a 3D-CNN classification model, performing feature learning on the short video tensor based on the 3D-CNN classification model, and outputting a short video category.

Further, the information amount calculation module includes:

the graying module is used for graying each image frame in the dense frames, three color channels of the color image are respectively represented by R, G, B, and then the grayscale image Grad is:

Grad(i,j)＝0.299LR(i,j)+0.587*G(i,j)+0.114*B(i,j)

the calculation module is used for calculating the information entropy of the grayed image frame:

Compared with the prior art, the invention has the following effects:

(1) according to the method, the dense frames are extracted, the video frames are screened based on the information content, so that the extracted short video key frames contain rich information, the problem that the extracted frames generate motion blur due to video jitter or the extracted frames are pure color due to picture switching is avoided, and the classification accuracy of short video classification based on the key frames is improved;

(2) the method screens the dense frames to screen out a preset number of key frames, improves the classification accuracy, does not need to increase the number of key frame feature extractions, and has high classification efficiency;

(3) the invention uses 3D-CNN to replace feature extraction + LSTM to directly classify the short video, namely, one model is adopted to simultaneously realize feature extraction and classification, thereby reducing resource loss and improving time efficiency, and leading the processing process to achieve real time;

(4) the invention adopts 3D-CNN to extract the space-time characteristics of the image frame so as to comprehensively obtain the information characteristics of the short video.

Drawings

Fig. 1 is a flowchart of a short video classification method based on optimized video key frame extraction according to an embodiment;

FIG. 2 is a schematic diagram of a 3D-CNN network architecture;

fig. 3 is a structural diagram of a short video classification system based on optimized video key frame extraction according to the second embodiment.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.

Example one

As shown in fig. 1, the present embodiment provides a method for classifying short video based on optimized video key frame extraction, including:

s1, extracting short video dense frames;

the 30fps, 60fps short video in the general sense means that the short video is composed of 30 or 60 pictures in 1 second, and 300 or 600 images if a short video of 10 seconds is used. If all image frames in the short video are processed, the performance of the system is undoubtedly affected, and the conventional short video frame-cutting process is generally to perform frame-cutting processing on the short video according to a certain time interval, decompose the short video into a plurality of image frames, for example, extract the image frames according to an interval of 1 second. As described above, the randomness of the video frames extracted at equal intervals in the video frame extraction process is too high, and there is usually a problem that the extracted frames generate motion blur due to video jitter or are pure color due to picture switching, and if these frames are also put into a subsequent model for training or prediction, the classification accuracy will be affected. Therefore, the short videos are classified by the short video classification method, the short videos are subjected to frame truncation processing like the traditional video classification method, but the image frames of the frame truncation are further screened by the short video classification method, the image frames extracted from one short video are subjected to subsequent analysis, and the influence of fuzzy frames or pure color frames on the short video classification is avoided. Meanwhile, in order to avoid the situation that the information of the short video cannot be well represented due to too small number of the screened image frames, the invention firstly extracts the dense frames of the short video, wherein the dense frames refer to frame cutting of the short video according to a smaller interval time.

In particular, the number n of dense frames is typically determined by the number of key frames needed to represent short video information. For example, for a short video with a period of T seconds, if N key picture frames are needed to represent the information of the short video, the prior art extracts N picture frames at intervals of T/N seconds, and directly uses the extracted N picture frames as key frames for representing the information of the short video. For the present application, the number N of dense frames is usually several times of the number of key frames, and N may be 3N, so that N picture frames may be extracted at intervals of T/N seconds.

S2, calculating the information content of each image frame in the dense frame;

in order to avoid the influence of fuzzy frames or pure color frames on the short video classification, the invention calculates the information content of each image frame in the dense frames. In particular, the present invention employs information entropy to characterize the amount of information of an image frame. Before calculating the information entropy, since the image frames of the video are usually color images, the original image needs to be grayed. Typically, a color image seen by the human eye is composed of three color components of RGB, each of which typically has a value from 0 to 255. The image frame of the present invention is also usually in RGB format, but RGB does not reflect the morphological features of the image, and only color blending is performed in optical principle. The color value of each pixel in a grayscale image is also called grayscale, and refers to the color depth of a point in a black-and-white image, which generally ranges from 0 to 255, with white being 255 and black being 0. The gray level value refers to the degree of color shading, and the gray level histogram refers to the number of pixels having the gray level value counted corresponding to each gray level value in a digital image. The gray scale is no color, and the RGB color components are all equal. For example, if the three RGB quantities are the same, an image with 256 levels of gray scale is obtained, such as: RGB (100,100,100) represents a gray scale of 100 and RGB (50,50,50) represents a gray scale of 50.

The mainstream graying methods at present include: maximum, mean, and weighted mean. Assuming that the generated gray scale image is represented by Grad and the three color channels of the color image are represented by R, G and B, respectively, then

Maximum method:

Grad(i,j)＝max(R(i,j),G(i,j),B(i,j))

average value method:

Crad(i，j)＝(R(i，j)+G(i，j)+B(i，j))/3

weighted average method:

Grad(i，j)＝0.299*R(i，j)+0.587*G(i，j)+0.114*B(i，j)

since the human eye is most sensitive to green and least sensitive to blue, the image is usually grayed by a weighted average method. The present invention does not limit the specific graying method, and preferably, the present invention grays an image by using a weighted average method.

Therefore, based on the grayed image, the information entropy calculation formula of the calculated image frame is as follows:

The information entropy calculation formula shows that the closer the probability of the occurrence of the gray value is, the larger the image information amount is, and the richer the image content is.

the invention adopts the information entropy to represent the information quantity of the image frame, and the larger the information entropy is, the larger the information quantity contained in the image frame is. In order to avoid the influence of fuzzy frames or pure color frames and the like on short video classification, reduce the data processing amount of feature extraction and improve the classification efficiency of short videos, dense frames are screened based on the information entropy, and a plurality of image frames with the largest information amount are selected as key frames of the short videos.

Specifically, the information entropy set EN ═ { E ] can be obtained by calculating the information entropy of n image frames₁，E₂，…，E_nIn which E₁，E₂，…，E_nEntropy represents information of 1 st to nth video frames, where j is 1, 2.. n, where j is the number of image frames. After the information amount of all the image frames is calculated, the information entropy sets are reordered according to the sequence of the information amount from large to small, and the number set PN ═ P of the corresponding image frame is output according to the reordered information entropy sets₁，P₂，…，P_n}. For example, the information entropy set of the information entropies of the 1 st video frame to the nth video frame is EN ═ {3, 6, 2, 5}, where the information entropy of the 1 st image frame is 3, the information entropy of the 2 nd image frame is 6, the information entropy of the 3 rd image frame is 2, and the information entropy of the 4 th image frame is 5. After the information entropies of the image frames are calculated, the information entropies are sorted according to the values from large to small, the sorted information entropy sets are {6, 5, 3 and 2}, and the information entropies of the 2 nd, 4 th, 1 st and 3 rd image frames are sequentially corresponding to each other. Therefore, PN ═ {2,4,1,3 }. Therefore, the image frames with the largest information amount are selected as the key frames of the short video, and specifically, the image frames corresponding to the image frame numbers of the top N in the PN sequence are selected. And if the number of the key frames is 2, taking the frames corresponding to the first two elements in the PN. Similarly, for the sequence PN with the length of N, the image frame corresponding to the frame number with the element number smaller than N in the set is taken.

With the development of a deep learning Convolutional Neural Network (CNN) in the image field, the existing video feature extraction generally adopts 2D-CNN to perform feature extraction of key frames, and then combines the features of the key frames together through a fusion algorithm. For 2D-CNN, each image frame of a short video is usually treated as a feature map, and therefore, the 2D-CNN inputs F ═ (W × H × C), where W corresponds to the width of each image frame, H corresponds to the height of each image frame, and C corresponds to the number of channels of each image frame, outputs feature vectors of each image frame, and combines the feature vectors into features of the short video.

However, the 2D-CNN performs independent feature extraction on each image frame as a static picture without considering motion information in a time dimension, and therefore, the present invention performs spatio-temporal feature extraction on the image frames by using the 3D-CNN to comprehensively obtain information features of short videos.

Therefore, after extracting N key frames, the invention splices the selected N video frames into a tensor of size NxWxHxC, wherein N is the frame number of the key frames, W corresponds to the width of each image frame, H corresponds to the height of each image frame, and C corresponds to the channel number of each image frame. Therefore, one short video corresponds to one tensor of size N × W × H × C. And the obtained short video tensor is used as the input of the 3D-CNN classification model.

The method utilizes the deep learning model 3D-CNN to extract the short video characteristics, and classifies the short videos based on the extracted characteristics. The 3D-CNN classification model is specifically generated as follows:

constructing a 3D-CNN convolutional neural network; training the 3D-CNN convolutional neural network through short video sample data to obtain a 3D-CNN classification model; and extracting short video characteristics based on the 3D-CNN classification model, and outputting short video categories.

The 3D-CNN basic structural components are a 3D convolution layer, a hard wired hardwired layer, a down-sampling layer, a full connection layer and an output layer. Wherein the hardwired layer generates a plurality of channel information by processing the original frame and then processes the plurality of channels. The 3D convolutional layer is a three-dimensional convolutional kernel, which is responsible for extracting a variety of features. The down-sampling layer component is responsible for reducing the dimension of the feature graph, and the full-connection layer component is responsible for combining the two-dimensional features into one-dimensional features and outputting the one-dimensional features by classification of the final output layer. In 3D-CNN, a hard-wired hardwired layer, a fully-connected layer and an output layer are typically included. The number of 3D convolutional layers and downsampling layers may be selected according to actual circumstances, and is not limited herein. The downsampled layer is typically placed after the convolutional layers, with the fully-connected layer between the last convolutional layer and the output layer.

FIG. 2 shows that the 3D-CNN includes three 3D convolutional layers and two downsampled layers. Channel information of each image frame is first extracted through a hardwired layer, and then convolution is respectively carried out on each channel by adopting a first 3D convolution kernel of 7 multiplied by 3, wherein 7 multiplied by 7 is a space dimension, and 3 is a time dimension. Downsampling is then performed with a 2x2 window through the first downsampling layer. The downsampled features are convolved separately for each channel by a 7 × 6 × 3 second 3D convolution kernel, where 7 × 6 is the spatial dimension and 3 is the temporal dimension. And then downsampled with a 3 x 3 window through a second downsampling layer. Again, convolution only in the spatial dimension is used, with the third 3D convolution kernel of 7 x 4 in FIG. 2. And finally, fully connecting each feature generated by the third 3D convolution kernel with all features in the second down-sampling layer by using a full-connection layer to generate a feature vector of the short video. And finally, inputting the generated feature vectors into an output layer for classification. The output layer comprises a Softmax classifier and classifies and outputs the input short videos.

And after the 3D-CNN is constructed, training the 3D-CNN based on the training data to obtain a 3D-CNN classification model. The method loads the short video data labeled with the category information, optimizes the 3D-CNN classification model through the loss function of the 3D-CNN classification model, and trains to generate the 3D-CNN classification model. And after training is generated, performing feature learning on the short video tensor, and outputting the short video category.

Example two

As shown in fig. 3, the present embodiment provides a short video classification system based on optimized video key frame extraction, including:

the frame cutting module is used for extracting short video dense frames;

Maximum method:

Grad(i,j)＝max(R(i,j),G(i,j),B(i,j)}

average value method:

Grad(i，j)＝(R(i，j)+G(i，j)+B(i，j))/3

weighted average method:

Grad(i，j)＝0.299*R(i，j)+0.587*G(i，j)+0.114*B(i，j)

The 3D-CNN basic structural components are a 3D convolution layer, a hard wired hardwired layer, a down-sampling layer, a full connection layer and an output layer. Wherein the hardwired layer generates a plurality of channel information by processing the original frame and then processes the plurality of channels. The 3D convolutional layer is a three-dimensional convolutional kernel, which is responsible for extracting a variety of features. The down-sampling layer component is responsible for reducing the dimension of the feature graph, and the full-connection layer component is responsible for combining the two-dimensional features into one-dimensional features and outputting the one-dimensional features by classification of the final output layer. In 3D-CNN, a hard-wired hardwired layer, a fully-connected layer and an output layer are typically included. The number of 3D convolutional layers and downsampling layers may be selected according to actual circumstances, and is not limited herein. The downsampled layer is typically placed after the convolutional layers, with the fully-connected layer between the last convolutional layer and the output layer. The output layer comprises a Softmax classifier and classifies and outputs the input short videos.

Therefore, the short video classification method and system based on optimized video key frame extraction provided by the invention are combined with the display form of a specific client, firstly, dense frames are extracted, and the video frames are screened based on the information quantity, so that the extracted short video key frames contain rich information, the problem that the extracted frames generate motion blur due to video jitter or the extracted frames are pure color due to picture switching is avoided, and the classification accuracy of short video classification based on the key frames is improved; dense frames are screened, a preset number of key frames are screened, the classification accuracy is improved, the number of key frame feature extraction is not required to be increased, and the classification efficiency is high; the short videos are directly classified by using the 3D-CNN to replace feature extraction and LSTM, namely, the feature extraction and classification are simultaneously realized by adopting one model, so that the resource loss is reduced, the time efficiency is improved, and the real-time processing process is realized; and extracting the space-time characteristics of the image frame by adopting the 3D-CNN so as to comprehensively obtain the information characteristics of the short video.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A short video classification method based on optimized video key frame extraction is characterized by comprising the following steps:

s1, extracting short video dense frames;

s2, calculating the information content of each image frame in the dense frame;

2. The method of claim 1, wherein the number of dense frames is m times the number of key frames, and m ≧ 2.

3. The method for classifying short video according to claim 1, wherein the step S2 specifically comprises:

Grad(i，j)＝0.299*R(i，j)+0.587*G(i，j)+0.114*B(i，j)

s22, calculating the information entropy of the grayed image frame:

4. The method of claim 1, wherein the short video tensor is of size nxwxhxc, where N is the number of frames of the key frame, W corresponds to the width of each image frame, H corresponds to the height of each image frame, and C corresponds to the number of channels of each image frame.

5. The short video classification method according to claim 4, wherein the 3D-CNN comprises a hard-wired layer, three 3D convolutional layers, two downsampled layers, a full-connected layer, and an output layer; the hard connecting layer generates a plurality of channel information by processing key frames; the 3D convolutional layer is used for extracting various features; the down-sampling layer component is used for reducing the dimension of the features; the full connection layer is used for combining the two-dimensional features into one-dimensional features; the output layer comprises a Softmax classifier for classifying the output for the short video based on the one-dimensional features.

6. A system for classifying short video based on optimized video keyframe extraction, comprising:

the frame cutting module is used for extracting short video dense frames;

7. The short video classification system according to claim 6, characterized in that the number of dense frames is m times the number of key frames, m ≧ 2.

8. The short video classification system according to claim 6, wherein the information amount calculation module comprises:

Grad(i，j)＝0.299*R(i，j)+0.587*G(i，j)+0.114*B(i，j)

9. The short video classification system of claim 6, wherein the short video tensor is of size nxwxhxc, where N is the number of frames of the key frame, W corresponds to the width of each image frame, H corresponds to the height of each image frame, and C corresponds to the number of channels of each image frame.

10. The short video classification system according to claim 9, wherein the 3D-CNN comprises one hard-wired layer, three 3D convolutional layers, two downsampled layers, one fully connected layer, one output layer; the hard connecting layer generates a plurality of channel information by processing key frames; the 3D convolutional layer is used for extracting various features; the down-sampling layer component is used for reducing the dimension of the features; the full connection layer is used for combining the two-dimensional features into one-dimensional features; the output layer comprises a Softmax classifier for classifying the output for the short video based on the one-dimensional features.