CN111079864A - Short video classification method and system based on optimized video key frame extraction - Google Patents

Short video classification method and system based on optimized video key frame extraction Download PDF

Info

Publication number
CN111079864A
CN111079864A CN201911420703.9A CN201911420703A CN111079864A CN 111079864 A CN111079864 A CN 111079864A CN 201911420703 A CN201911420703 A CN 201911420703A CN 111079864 A CN111079864 A CN 111079864A
Authority
CN
China
Prior art keywords
short video
frames
image
frame
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911420703.9A
Other languages
Chinese (zh)
Inventor
刘昱龙
范俊
顾湘余
李文杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Quwei Science & Technology Co ltd
Original Assignee
Hangzhou Quwei Science & Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Quwei Science & Technology Co ltd filed Critical Hangzhou Quwei Science & Technology Co ltd
Priority to CN201911420703.9A priority Critical patent/CN111079864A/en
Publication of CN111079864A publication Critical patent/CN111079864A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a short video classification method and a system based on optimized video key frame extraction, wherein the classification method comprises the following steps: s1, extracting short video dense frames; s2, calculating the information content of each image frame in the dense frame; s3, selecting a plurality of image frames with the largest information amount as key frames of the short video; and S4, splicing the key frames to generate a short video tensor, inputting the short video tensor into a 3D-CNN classification model, performing feature learning on the short video tensor based on the 3D-CNN classification model, and outputting a short video category. The invention screens the video frames based on the information quantity, avoids the problem that the extracted frames generate motion blur due to video jitter or are pure color due to picture switching, and improves the accuracy of classification.

Description

Short video classification method and system based on optimized video key frame extraction
Technical Field
The invention relates to the field of short video processing, in particular to a short video classification method and system based on optimized video key frame extraction.
Background
In recent years, short videos have been widely used and spread as carriers of information because the contents of short video expression are richer and more intuitive. However, some lawbreakers can use short videos to transmit unhealthy or illegal videos to gain privacy, so that target videos should be classified and illegal videos filtered before short videos are released on a short video platform. Meanwhile, due to different individual interests and hobbies, videos recommended to each user are different, so that the short video classification is indispensable for different categories to be recommended to different users.
The current video classification processing flow is as follows: 1. extracting video frames, 2, classifying the frames by using a machine learning or deep learning method, and 3, outputting the learned categories as final video categories. For step 1, the current method also focuses on extracting the video at equal intervals or according to the time difference. For step 2, along with the increase of data volume, the algorithm accuracy can be effectively improved by using a deep learning method, and because the video frames are input in multiple ways, the applications of classifying the video frames by using a Long Short Term Memory Network (LSTM) are more at present.
The original technology aiming at short video classification has the following problems:
(1) in the process of extracting video frames, the randomness of the video frames extracted at equal intervals is too high, and generally, motion blur of the extracted frames caused by video jitter or pure color of the extracted frames caused by picture switching can occur, and if the frames are also put into a subsequent model for training or prediction, the accuracy of classification can be influenced.
(2) Because the video frame sequence is a two-dimensional image, if the LSTM method is called and a one-dimensional vector needs to be input, a two-dimensional to one-dimensional mapping relation exists between the video frame sequence and the LSTM, the currently adopted method is that each frame of image is input into a mainstream neural network architecture (resnet-50, vgg-16) to output the last full connection layer as a one-dimensional vector, and finally a plurality of one-dimensional vectors with sequence are input into the LSTM to be classified. It is easy to find that this mainstream method utilizes 2 network models, which consumes a lot of computing resources and seriously affects the classification time efficiency.
The invention patent application with publication number CN 109977773 a discloses a human behavior recognition method and system based on multi-target detection 30CNN, and the method comprises the following steps: 1) preprocessing a video, and converting a video stream into image frames; 2) the current relatively mature SSD detection technology is adopted to calibrate and cut the target object in the video; 3) establishing a feature extraction network structure of image frame data and calibration cutting data; 4) establishing a feature fusion model, and fusing the two features extracted in the step 3); 5) classifying by using a Softmax regression model classifier; 6) and fine-tuning the trained model according to the actual application scene or the public data set. According to the method, 3D-CNN (3D volumetric Neural Networks) is adopted to replace feature extraction + LSTM to directly classify the input in the text, so that resource loss is reduced, time efficiency is improved, and the processing process is real-time.
However, the above patent application still has the problem that the extracted frame generates motion blur due to video jitter or is pure color due to picture switching, resulting in low classification accuracy. Therefore, how to improve the classification accuracy by intercepting the video frames is a problem to be solved in the field.
Disclosure of Invention
The invention aims to provide a method and a system for classifying short video based on optimized video key frame extraction aiming at the defects of the prior art. The invention screens the video frames based on the information quantity, avoids the problem that the extracted frames generate motion blur due to video jitter or are pure color due to picture switching, and improves the accuracy of classification.
In order to achieve the purpose, the invention adopts the following technical scheme:
a short video classification method based on optimized video key frame extraction comprises the following steps:
s1, extracting short video dense frames;
s2, calculating the information content of each image frame in the dense frame;
s3, selecting a plurality of image frames with the largest information amount as key frames of the short video;
and S4, splicing the key frames to generate a short video tensor, inputting the short video tensor into a 3D-CNN classification model, performing feature learning on the short video tensor based on the 3D-CNN classification model, and outputting a short video category.
Further, the number of the dense frames is m times of the number of the key frames, and m is larger than or equal to 2.
Further, the step S2 is specifically:
s21, graying each image frame in the dense frame, wherein three color channels of the color image are represented by R, G, B, respectively, and the grayscale map Grad is:
Grad(i,j)=0.299*R(i,j)+0.587*G(i,j)+0.114*B(i,j)
s22, calculating the information entropy of the grayed image frame:
Figure BDA0002352310050000031
wherein, P(i)The probability of a certain pixel value i appearing in an image is shown, and the value range of the image pixel value is 0-255.
Further, the short video tensor is N × W × H × C, where N is the number of frames of the key frame, W corresponds to the width of each image frame, H corresponds to the height of each image frame, and C corresponds to the number of channels of each image frame.
Further, the 3D-CNN comprises a hard connecting layer, three 3D convolution layers, two down-sampling layers, a full connecting layer and an output layer; the hard connecting layer generates a plurality of channel information by processing key frames; the 3D convolutional layer is used for extracting various features; the down-sampling layer component is used for reducing the dimension of the features; the full connection layer is used for combining the two-dimensional features into one-dimensional features; the output layer comprises a Softmax classifier for classifying the output for the short video based on the one-dimensional features.
The invention also provides a short video classification system based on optimized video key frame extraction, which comprises the following steps:
the frame cutting module is used for extracting short video dense frames;
the information quantity calculating module is used for calculating the information quantity of each image frame in the dense frame;
the key frame selecting module is used for selecting a plurality of image frames with the largest information amount as key frames of the short video;
and the classification module is used for splicing the key frames to generate a short video tensor, inputting the short video tensor into a 3D-CNN classification model, performing feature learning on the short video tensor based on the 3D-CNN classification model, and outputting a short video category.
Further, the number of the dense frames is m times of the number of the key frames, and m is larger than or equal to 2.
Further, the information amount calculation module includes:
the graying module is used for graying each image frame in the dense frames, three color channels of the color image are respectively represented by R, G, B, and then the grayscale image Grad is:
Grad(i,j)=0.299LR(i,j)+0.587*G(i,j)+0.114*B(i,j)
the calculation module is used for calculating the information entropy of the grayed image frame:
Figure BDA0002352310050000041
wherein, P(i)The probability of a certain pixel value i appearing in an image is shown, and the value range of the image pixel value is 0-255.
Further, the short video tensor is N × W × H × C, where N is the number of frames of the key frame, W corresponds to the width of each image frame, H corresponds to the height of each image frame, and C corresponds to the number of channels of each image frame.
Further, the 3D-CNN comprises a hard connecting layer, three 3D convolution layers, two down-sampling layers, a full connecting layer and an output layer; the hard connecting layer generates a plurality of channel information by processing key frames; the 3D convolutional layer is used for extracting various features; the down-sampling layer component is used for reducing the dimension of the features; the full connection layer is used for combining the two-dimensional features into one-dimensional features; the output layer comprises a Softmax classifier for classifying the output for the short video based on the one-dimensional features.
Compared with the prior art, the invention has the following effects:
(1) according to the method, the dense frames are extracted, the video frames are screened based on the information content, so that the extracted short video key frames contain rich information, the problem that the extracted frames generate motion blur due to video jitter or the extracted frames are pure color due to picture switching is avoided, and the classification accuracy of short video classification based on the key frames is improved;
(2) the method screens the dense frames to screen out a preset number of key frames, improves the classification accuracy, does not need to increase the number of key frame feature extractions, and has high classification efficiency;
(3) the invention uses 3D-CNN to replace feature extraction + LSTM to directly classify the short video, namely, one model is adopted to simultaneously realize feature extraction and classification, thereby reducing resource loss and improving time efficiency, and leading the processing process to achieve real time;
(4) the invention adopts 3D-CNN to extract the space-time characteristics of the image frame so as to comprehensively obtain the information characteristics of the short video.
Drawings
Fig. 1 is a flowchart of a short video classification method based on optimized video key frame extraction according to an embodiment;
FIG. 2 is a schematic diagram of a 3D-CNN network architecture;
fig. 3 is a structural diagram of a short video classification system based on optimized video key frame extraction according to the second embodiment.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.
Example one
As shown in fig. 1, the present embodiment provides a method for classifying short video based on optimized video key frame extraction, including:
s1, extracting short video dense frames;
the 30fps, 60fps short video in the general sense means that the short video is composed of 30 or 60 pictures in 1 second, and 300 or 600 images if a short video of 10 seconds is used. If all image frames in the short video are processed, the performance of the system is undoubtedly affected, and the conventional short video frame-cutting process is generally to perform frame-cutting processing on the short video according to a certain time interval, decompose the short video into a plurality of image frames, for example, extract the image frames according to an interval of 1 second. As described above, the randomness of the video frames extracted at equal intervals in the video frame extraction process is too high, and there is usually a problem that the extracted frames generate motion blur due to video jitter or are pure color due to picture switching, and if these frames are also put into a subsequent model for training or prediction, the classification accuracy will be affected. Therefore, the short videos are classified by the short video classification method, the short videos are subjected to frame truncation processing like the traditional video classification method, but the image frames of the frame truncation are further screened by the short video classification method, the image frames extracted from one short video are subjected to subsequent analysis, and the influence of fuzzy frames or pure color frames on the short video classification is avoided. Meanwhile, in order to avoid the situation that the information of the short video cannot be well represented due to too small number of the screened image frames, the invention firstly extracts the dense frames of the short video, wherein the dense frames refer to frame cutting of the short video according to a smaller interval time.
In particular, the number n of dense frames is typically determined by the number of key frames needed to represent short video information. For example, for a short video with a period of T seconds, if N key picture frames are needed to represent the information of the short video, the prior art extracts N picture frames at intervals of T/N seconds, and directly uses the extracted N picture frames as key frames for representing the information of the short video. For the present application, the number N of dense frames is usually several times of the number of key frames, and N may be 3N, so that N picture frames may be extracted at intervals of T/N seconds.
S2, calculating the information content of each image frame in the dense frame;
in order to avoid the influence of fuzzy frames or pure color frames on the short video classification, the invention calculates the information content of each image frame in the dense frames. In particular, the present invention employs information entropy to characterize the amount of information of an image frame. Before calculating the information entropy, since the image frames of the video are usually color images, the original image needs to be grayed. Typically, a color image seen by the human eye is composed of three color components of RGB, each of which typically has a value from 0 to 255. The image frame of the present invention is also usually in RGB format, but RGB does not reflect the morphological features of the image, and only color blending is performed in optical principle. The color value of each pixel in a grayscale image is also called grayscale, and refers to the color depth of a point in a black-and-white image, which generally ranges from 0 to 255, with white being 255 and black being 0. The gray level value refers to the degree of color shading, and the gray level histogram refers to the number of pixels having the gray level value counted corresponding to each gray level value in a digital image. The gray scale is no color, and the RGB color components are all equal. For example, if the three RGB quantities are the same, an image with 256 levels of gray scale is obtained, such as: RGB (100,100,100) represents a gray scale of 100 and RGB (50,50,50) represents a gray scale of 50.
The mainstream graying methods at present include: maximum, mean, and weighted mean. Assuming that the generated gray scale image is represented by Grad and the three color channels of the color image are represented by R, G and B, respectively, then
Maximum method:
Grad(i,j)=max(R(i,j),G(i,j),B(i,j))
average value method:
Crad(i,j)=(R(i,j)+G(i,j)+B(i,j))/3
weighted average method:
Grad(i,j)=0.299*R(i,j)+0.587*G(i,j)+0.114*B(i,j)
since the human eye is most sensitive to green and least sensitive to blue, the image is usually grayed by a weighted average method. The present invention does not limit the specific graying method, and preferably, the present invention grays an image by using a weighted average method.
Therefore, based on the grayed image, the information entropy calculation formula of the calculated image frame is as follows:
Figure BDA0002352310050000071
wherein, P(i)The probability of a certain pixel value i appearing in an image is shown, and the value range of the image pixel value is 0-255.
The information entropy calculation formula shows that the closer the probability of the occurrence of the gray value is, the larger the image information amount is, and the richer the image content is.
S3, selecting a plurality of image frames with the largest information amount as key frames of the short video;
the invention adopts the information entropy to represent the information quantity of the image frame, and the larger the information entropy is, the larger the information quantity contained in the image frame is. In order to avoid the influence of fuzzy frames or pure color frames and the like on short video classification, reduce the data processing amount of feature extraction and improve the classification efficiency of short videos, dense frames are screened based on the information entropy, and a plurality of image frames with the largest information amount are selected as key frames of the short videos.
Specifically, the information entropy set EN ═ { E ] can be obtained by calculating the information entropy of n image frames1,E2,…,EnIn which E1,E2,…,EnEntropy represents information of 1 st to nth video frames, where j is 1, 2.. n, where j is the number of image frames. After the information amount of all the image frames is calculated, the information entropy sets are reordered according to the sequence of the information amount from large to small, and the number set PN ═ P of the corresponding image frame is output according to the reordered information entropy sets1,P2,…,Pn}. For example, the information entropy set of the information entropies of the 1 st video frame to the nth video frame is EN ═ {3, 6, 2, 5}, where the information entropy of the 1 st image frame is 3, the information entropy of the 2 nd image frame is 6, the information entropy of the 3 rd image frame is 2, and the information entropy of the 4 th image frame is 5. After the information entropies of the image frames are calculated, the information entropies are sorted according to the values from large to small, the sorted information entropy sets are {6, 5, 3 and 2}, and the information entropies of the 2 nd, 4 th, 1 st and 3 rd image frames are sequentially corresponding to each other. Therefore, PN ═ {2,4,1,3 }. Therefore, the image frames with the largest information amount are selected as the key frames of the short video, and specifically, the image frames corresponding to the image frame numbers of the top N in the PN sequence are selected. And if the number of the key frames is 2, taking the frames corresponding to the first two elements in the PN. Similarly, for the sequence PN with the length of N, the image frame corresponding to the frame number with the element number smaller than N in the set is taken.
And S4, splicing the key frames to generate a short video tensor, inputting the short video tensor into a 3D-CNN classification model, performing feature learning on the short video tensor based on the 3D-CNN classification model, and outputting a short video category.
With the development of a deep learning Convolutional Neural Network (CNN) in the image field, the existing video feature extraction generally adopts 2D-CNN to perform feature extraction of key frames, and then combines the features of the key frames together through a fusion algorithm. For 2D-CNN, each image frame of a short video is usually treated as a feature map, and therefore, the 2D-CNN inputs F ═ (W × H × C), where W corresponds to the width of each image frame, H corresponds to the height of each image frame, and C corresponds to the number of channels of each image frame, outputs feature vectors of each image frame, and combines the feature vectors into features of the short video.
However, the 2D-CNN performs independent feature extraction on each image frame as a static picture without considering motion information in a time dimension, and therefore, the present invention performs spatio-temporal feature extraction on the image frames by using the 3D-CNN to comprehensively obtain information features of short videos.
Therefore, after extracting N key frames, the invention splices the selected N video frames into a tensor of size NxWxHxC, wherein N is the frame number of the key frames, W corresponds to the width of each image frame, H corresponds to the height of each image frame, and C corresponds to the channel number of each image frame. Therefore, one short video corresponds to one tensor of size N × W × H × C. And the obtained short video tensor is used as the input of the 3D-CNN classification model.
The method utilizes the deep learning model 3D-CNN to extract the short video characteristics, and classifies the short videos based on the extracted characteristics. The 3D-CNN classification model is specifically generated as follows:
constructing a 3D-CNN convolutional neural network; training the 3D-CNN convolutional neural network through short video sample data to obtain a 3D-CNN classification model; and extracting short video characteristics based on the 3D-CNN classification model, and outputting short video categories.
The 3D-CNN basic structural components are a 3D convolution layer, a hard wired hardwired layer, a down-sampling layer, a full connection layer and an output layer. Wherein the hardwired layer generates a plurality of channel information by processing the original frame and then processes the plurality of channels. The 3D convolutional layer is a three-dimensional convolutional kernel, which is responsible for extracting a variety of features. The down-sampling layer component is responsible for reducing the dimension of the feature graph, and the full-connection layer component is responsible for combining the two-dimensional features into one-dimensional features and outputting the one-dimensional features by classification of the final output layer. In 3D-CNN, a hard-wired hardwired layer, a fully-connected layer and an output layer are typically included. The number of 3D convolutional layers and downsampling layers may be selected according to actual circumstances, and is not limited herein. The downsampled layer is typically placed after the convolutional layers, with the fully-connected layer between the last convolutional layer and the output layer.
FIG. 2 shows that the 3D-CNN includes three 3D convolutional layers and two downsampled layers. Channel information of each image frame is first extracted through a hardwired layer, and then convolution is respectively carried out on each channel by adopting a first 3D convolution kernel of 7 multiplied by 3, wherein 7 multiplied by 7 is a space dimension, and 3 is a time dimension. Downsampling is then performed with a 2x2 window through the first downsampling layer. The downsampled features are convolved separately for each channel by a 7 × 6 × 3 second 3D convolution kernel, where 7 × 6 is the spatial dimension and 3 is the temporal dimension. And then downsampled with a 3 x 3 window through a second downsampling layer. Again, convolution only in the spatial dimension is used, with the third 3D convolution kernel of 7 x 4 in FIG. 2. And finally, fully connecting each feature generated by the third 3D convolution kernel with all features in the second down-sampling layer by using a full-connection layer to generate a feature vector of the short video. And finally, inputting the generated feature vectors into an output layer for classification. The output layer comprises a Softmax classifier and classifies and outputs the input short videos.
And after the 3D-CNN is constructed, training the 3D-CNN based on the training data to obtain a 3D-CNN classification model. The method loads the short video data labeled with the category information, optimizes the 3D-CNN classification model through the loss function of the 3D-CNN classification model, and trains to generate the 3D-CNN classification model. And after training is generated, performing feature learning on the short video tensor, and outputting the short video category.
Example two
As shown in fig. 3, the present embodiment provides a short video classification system based on optimized video key frame extraction, including:
the frame cutting module is used for extracting short video dense frames;
the 30fps, 60fps short video in the general sense means that the short video is composed of 30 or 60 pictures in 1 second, and 300 or 600 images if a short video of 10 seconds is used. If all image frames in the short video are processed, the performance of the system is undoubtedly affected, and the conventional short video frame-cutting process is generally to perform frame-cutting processing on the short video according to a certain time interval, decompose the short video into a plurality of image frames, for example, extract the image frames according to an interval of 1 second. As described above, the randomness of the video frames extracted at equal intervals in the video frame extraction process is too high, and there is usually a problem that the extracted frames generate motion blur due to video jitter or are pure color due to picture switching, and if these frames are also put into a subsequent model for training or prediction, the classification accuracy will be affected. Therefore, the short videos are classified by the short video classification method, the short videos are subjected to frame truncation processing like the traditional video classification method, but the image frames of the frame truncation are further screened by the short video classification method, the image frames extracted from one short video are subjected to subsequent analysis, and the influence of fuzzy frames or pure color frames on the short video classification is avoided. Meanwhile, in order to avoid the situation that the information of the short video cannot be well represented due to too small number of the screened image frames, the invention firstly extracts the dense frames of the short video, wherein the dense frames refer to frame cutting of the short video according to a smaller interval time.
In particular, the number n of dense frames is typically determined by the number of key frames needed to represent short video information. For example, for a short video with a period of T seconds, if N key picture frames are needed to represent the information of the short video, the prior art extracts N picture frames at intervals of T/N seconds, and directly uses the extracted N picture frames as key frames for representing the information of the short video. For the present application, the number N of dense frames is usually several times of the number of key frames, and N may be 3N, so that N picture frames may be extracted at intervals of T/N seconds.
The information quantity calculating module is used for calculating the information quantity of each image frame in the dense frame;
in order to avoid the influence of fuzzy frames or pure color frames on the short video classification, the invention calculates the information content of each image frame in the dense frames. In particular, the present invention employs information entropy to characterize the amount of information of an image frame. Before calculating the information entropy, since the image frames of the video are usually color images, the original image needs to be grayed. Typically, a color image seen by the human eye is composed of three color components of RGB, each of which typically has a value from 0 to 255. The image frame of the present invention is also usually in RGB format, but RGB does not reflect the morphological features of the image, and only color blending is performed in optical principle. The color value of each pixel in a grayscale image is also called grayscale, and refers to the color depth of a point in a black-and-white image, which generally ranges from 0 to 255, with white being 255 and black being 0. The gray level value refers to the degree of color shading, and the gray level histogram refers to the number of pixels having the gray level value counted corresponding to each gray level value in a digital image. The gray scale is no color, and the RGB color components are all equal. For example, if the three RGB quantities are the same, an image with 256 levels of gray scale is obtained, such as: RGB (100,100,100) represents a gray scale of 100 and RGB (50,50,50) represents a gray scale of 50.
The mainstream graying methods at present include: maximum, mean, and weighted mean. Assuming that the generated gray scale image is represented by Grad and the three color channels of the color image are represented by R, G and B, respectively, then
Maximum method:
Grad(i,j)=max(R(i,j),G(i,j),B(i,j)}
average value method:
Grad(i,j)=(R(i,j)+G(i,j)+B(i,j))/3
weighted average method:
Grad(i,j)=0.299*R(i,j)+0.587*G(i,j)+0.114*B(i,j)
since the human eye is most sensitive to green and least sensitive to blue, the image is usually grayed by a weighted average method. The present invention does not limit the specific graying method, and preferably, the present invention grays an image by using a weighted average method.
Therefore, based on the grayed image, the information entropy calculation formula of the calculated image frame is as follows:
Figure BDA0002352310050000111
wherein, P(i)The probability of a certain pixel value i appearing in an image is shown, and the value range of the image pixel value is 0-255.
The information entropy calculation formula shows that the closer the probability of the occurrence of the gray value is, the larger the image information amount is, and the richer the image content is.
The key frame selecting module is used for selecting a plurality of image frames with the largest information amount as key frames of the short video;
the invention adopts the information entropy to represent the information quantity of the image frame, and the larger the information entropy is, the larger the information quantity contained in the image frame is. In order to avoid the influence of fuzzy frames or pure color frames and the like on short video classification, reduce the data processing amount of feature extraction and improve the classification efficiency of short videos, dense frames are screened based on the information entropy, and a plurality of image frames with the largest information amount are selected as key frames of the short videos.
Specifically, the information entropy set EN ═ { E ] can be obtained by calculating the information entropy of n image frames1,E2,…,EnIn which E1,E2,…,EnEntropy represents information of 1 st to nth video frames, where j is 1, 2.. n, where j is the number of image frames. After the information amount of all the image frames is calculated, the information entropy sets are reordered according to the sequence of the information amount from large to small, and the number set PN ═ P of the corresponding image frame is output according to the reordered information entropy sets1,P2,…,Pn}. For example, the information entropy set of the information entropies of the 1 st video frame to the nth video frame is EN ═ {3, 6, 2, 5}, where the information entropy of the 1 st image frame is 3, the information entropy of the 2 nd image frame is 6, the information entropy of the 3 rd image frame is 2, and the information entropy of the 4 th image frame is 5. After the information entropies of the image frames are calculated, the information entropies are sorted according to the values from large to small, the sorted information entropy sets are {6, 5, 3 and 2}, and the information entropies of the 2 nd, 4 th, 1 st and 3 rd image frames are sequentially corresponding to each other. Therefore, PN ═ {2,4,1,3 }. Therefore, the image frames with the largest information amount are selected as the key frames of the short video, and specifically, the image frames corresponding to the image frame numbers of the top N in the PN sequence are selected. And if the number of the key frames is 2, taking the frames corresponding to the first two elements in the PN. Similarly, for the sequence PN with the length of N, the image frame corresponding to the frame number with the element number smaller than N in the set is taken.
And the classification module is used for splicing the key frames to generate a short video tensor, inputting the short video tensor into a 3D-CNN classification model, performing feature learning on the short video tensor based on the 3D-CNN classification model, and outputting a short video category.
With the development of a deep learning Convolutional Neural Network (CNN) in the image field, the existing video feature extraction generally adopts 2D-CNN to perform feature extraction of key frames, and then combines the features of the key frames together through a fusion algorithm. For 2D-CNN, each image frame of a short video is usually treated as a feature map, and therefore, the 2D-CNN inputs F ═ (W × H × C), where W corresponds to the width of each image frame, H corresponds to the height of each image frame, and C corresponds to the number of channels of each image frame, outputs feature vectors of each image frame, and combines the feature vectors into features of the short video.
However, the 2D-CNN performs independent feature extraction on each image frame as a static picture without considering motion information in a time dimension, and therefore, the present invention performs spatio-temporal feature extraction on the image frames by using the 3D-CNN to comprehensively obtain information features of short videos.
Therefore, after extracting N key frames, the invention splices the selected N video frames into a tensor of size NxWxHxC, wherein N is the frame number of the key frames, W corresponds to the width of each image frame, H corresponds to the height of each image frame, and C corresponds to the channel number of each image frame. Therefore, one short video corresponds to one tensor of size N × W × H × C. And the obtained short video tensor is used as the input of the 3D-CNN classification model.
The method utilizes the deep learning model 3D-CNN to extract the short video characteristics, and classifies the short videos based on the extracted characteristics. The 3D-CNN classification model is specifically generated as follows:
constructing a 3D-CNN convolutional neural network; training the 3D-CNN convolutional neural network through short video sample data to obtain a 3D-CNN classification model; and extracting short video characteristics based on the 3D-CNN classification model, and outputting short video categories.
The 3D-CNN basic structural components are a 3D convolution layer, a hard wired hardwired layer, a down-sampling layer, a full connection layer and an output layer. Wherein the hardwired layer generates a plurality of channel information by processing the original frame and then processes the plurality of channels. The 3D convolutional layer is a three-dimensional convolutional kernel, which is responsible for extracting a variety of features. The down-sampling layer component is responsible for reducing the dimension of the feature graph, and the full-connection layer component is responsible for combining the two-dimensional features into one-dimensional features and outputting the one-dimensional features by classification of the final output layer. In 3D-CNN, a hard-wired hardwired layer, a fully-connected layer and an output layer are typically included. The number of 3D convolutional layers and downsampling layers may be selected according to actual circumstances, and is not limited herein. The downsampled layer is typically placed after the convolutional layers, with the fully-connected layer between the last convolutional layer and the output layer. The output layer comprises a Softmax classifier and classifies and outputs the input short videos.
And after the 3D-CNN is constructed, training the 3D-CNN based on the training data to obtain a 3D-CNN classification model. The method loads the short video data labeled with the category information, optimizes the 3D-CNN classification model through the loss function of the 3D-CNN classification model, and trains to generate the 3D-CNN classification model. And after training is generated, performing feature learning on the short video tensor, and outputting the short video category.
Therefore, the short video classification method and system based on optimized video key frame extraction provided by the invention are combined with the display form of a specific client, firstly, dense frames are extracted, and the video frames are screened based on the information quantity, so that the extracted short video key frames contain rich information, the problem that the extracted frames generate motion blur due to video jitter or the extracted frames are pure color due to picture switching is avoided, and the classification accuracy of short video classification based on the key frames is improved; dense frames are screened, a preset number of key frames are screened, the classification accuracy is improved, the number of key frame feature extraction is not required to be increased, and the classification efficiency is high; the short videos are directly classified by using the 3D-CNN to replace feature extraction and LSTM, namely, the feature extraction and classification are simultaneously realized by adopting one model, so that the resource loss is reduced, the time efficiency is improved, and the real-time processing process is realized; and extracting the space-time characteristics of the image frame by adopting the 3D-CNN so as to comprehensively obtain the information characteristics of the short video.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A short video classification method based on optimized video key frame extraction is characterized by comprising the following steps:
s1, extracting short video dense frames;
s2, calculating the information content of each image frame in the dense frame;
s3, selecting a plurality of image frames with the largest information amount as key frames of the short video;
and S4, splicing the key frames to generate a short video tensor, inputting the short video tensor into a 3D-CNN classification model, performing feature learning on the short video tensor based on the 3D-CNN classification model, and outputting a short video category.
2. The method of claim 1, wherein the number of dense frames is m times the number of key frames, and m ≧ 2.
3. The method for classifying short video according to claim 1, wherein the step S2 specifically comprises:
s21, graying each image frame in the dense frame, wherein three color channels of the color image are represented by R, G, B, respectively, and the grayscale map Grad is:
Grad(i,j)=0.299*R(i,j)+0.587*G(i,j)+0.114*B(i,j)
s22, calculating the information entropy of the grayed image frame:
Figure FDA0002352310040000011
wherein, P(i)The probability of a certain pixel value i appearing in an image is shown, and the value range of the image pixel value is 0-255.
4. The method of claim 1, wherein the short video tensor is of size nxwxhxc, where N is the number of frames of the key frame, W corresponds to the width of each image frame, H corresponds to the height of each image frame, and C corresponds to the number of channels of each image frame.
5. The short video classification method according to claim 4, wherein the 3D-CNN comprises a hard-wired layer, three 3D convolutional layers, two downsampled layers, a full-connected layer, and an output layer; the hard connecting layer generates a plurality of channel information by processing key frames; the 3D convolutional layer is used for extracting various features; the down-sampling layer component is used for reducing the dimension of the features; the full connection layer is used for combining the two-dimensional features into one-dimensional features; the output layer comprises a Softmax classifier for classifying the output for the short video based on the one-dimensional features.
6. A system for classifying short video based on optimized video keyframe extraction, comprising:
the frame cutting module is used for extracting short video dense frames;
the information quantity calculating module is used for calculating the information quantity of each image frame in the dense frame;
the key frame selecting module is used for selecting a plurality of image frames with the largest information amount as key frames of the short video;
and the classification module is used for splicing the key frames to generate a short video tensor, inputting the short video tensor into a 3D-CNN classification model, performing feature learning on the short video tensor based on the 3D-CNN classification model, and outputting a short video category.
7. The short video classification system according to claim 6, characterized in that the number of dense frames is m times the number of key frames, m ≧ 2.
8. The short video classification system according to claim 6, wherein the information amount calculation module comprises:
the graying module is used for graying each image frame in the dense frames, three color channels of the color image are respectively represented by R, G, B, and then the grayscale image Grad is:
Grad(i,j)=0.299*R(i,j)+0.587*G(i,j)+0.114*B(i,j)
the calculation module is used for calculating the information entropy of the grayed image frame:
Figure FDA0002352310040000021
wherein, P(i)The probability of a certain pixel value i appearing in an image is shown, and the value range of the image pixel value is 0-255.
9. The short video classification system of claim 6, wherein the short video tensor is of size nxwxhxc, where N is the number of frames of the key frame, W corresponds to the width of each image frame, H corresponds to the height of each image frame, and C corresponds to the number of channels of each image frame.
10. The short video classification system according to claim 9, wherein the 3D-CNN comprises one hard-wired layer, three 3D convolutional layers, two downsampled layers, one fully connected layer, one output layer; the hard connecting layer generates a plurality of channel information by processing key frames; the 3D convolutional layer is used for extracting various features; the down-sampling layer component is used for reducing the dimension of the features; the full connection layer is used for combining the two-dimensional features into one-dimensional features; the output layer comprises a Softmax classifier for classifying the output for the short video based on the one-dimensional features.
CN201911420703.9A 2019-12-31 2019-12-31 Short video classification method and system based on optimized video key frame extraction Pending CN111079864A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911420703.9A CN111079864A (en) 2019-12-31 2019-12-31 Short video classification method and system based on optimized video key frame extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911420703.9A CN111079864A (en) 2019-12-31 2019-12-31 Short video classification method and system based on optimized video key frame extraction

Publications (1)

Publication Number Publication Date
CN111079864A true CN111079864A (en) 2020-04-28

Family

ID=70321331

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911420703.9A Pending CN111079864A (en) 2019-12-31 2019-12-31 Short video classification method and system based on optimized video key frame extraction

Country Status (1)

Country Link
CN (1) CN111079864A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541390A (en) * 2020-10-30 2021-03-23 四川天翼网络服务有限公司 Frame-extracting dynamic scheduling method and system for violation analysis of examination video
CN112601068A (en) * 2020-12-15 2021-04-02 济南浪潮高新科技投资发展有限公司 Video data augmentation method, device and computer readable medium
JPWO2021229693A1 (en) * 2020-05-12 2021-11-18
CN113870259A (en) * 2021-12-02 2021-12-31 天津御锦人工智能医疗科技有限公司 Multi-modal medical data fusion assessment method, device, equipment and storage medium
CN115086472A (en) * 2022-06-13 2022-09-20 泰州亚东广告传媒有限公司 Mobile phone APP management platform based on key frame information

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104966104A (en) * 2015-06-30 2015-10-07 孙建德 Three-dimensional convolutional neural network based video classifying method
CN107220585A (en) * 2017-03-31 2017-09-29 南京邮电大学 A kind of video key frame extracting method based on multiple features fusion clustering shots
US20180032846A1 (en) * 2016-08-01 2018-02-01 Nvidia Corporation Fusing multilayer and multimodal deep neural networks for video classification
WO2018166288A1 (en) * 2017-03-15 2018-09-20 北京京东尚科信息技术有限公司 Information presentation method and device
CN110414617A (en) * 2019-08-02 2019-11-05 北京奇艺世纪科技有限公司 A kind of video feature extraction method and device, video classification methods and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104966104A (en) * 2015-06-30 2015-10-07 孙建德 Three-dimensional convolutional neural network based video classifying method
US20180032846A1 (en) * 2016-08-01 2018-02-01 Nvidia Corporation Fusing multilayer and multimodal deep neural networks for video classification
WO2018166288A1 (en) * 2017-03-15 2018-09-20 北京京东尚科信息技术有限公司 Information presentation method and device
CN107220585A (en) * 2017-03-31 2017-09-29 南京邮电大学 A kind of video key frame extracting method based on multiple features fusion clustering shots
CN110414617A (en) * 2019-08-02 2019-11-05 北京奇艺世纪科技有限公司 A kind of video feature extraction method and device, video classification methods and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
李丹锦;: "基于人脸多模态的视频分类算法的设计与实现" *
杨曙光;: "一种改进的深度学习视频分类方法" *
潘磊,吴小俊,尤媛媛: "基于聚类的视频镜头分割和关键帧提取" *
裴颂文;杨保国;顾春华;: "融合的三维卷积神经网络的视频流分类研究" *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPWO2021229693A1 (en) * 2020-05-12 2021-11-18
JP7364061B2 (en) 2020-05-12 2023-10-18 日本電信電話株式会社 Learning devices, learning methods and learning programs
CN112541390A (en) * 2020-10-30 2021-03-23 四川天翼网络服务有限公司 Frame-extracting dynamic scheduling method and system for violation analysis of examination video
CN112541390B (en) * 2020-10-30 2023-04-25 四川天翼网络股份有限公司 Frame extraction dynamic scheduling method and system for examination video violation analysis
CN112601068A (en) * 2020-12-15 2021-04-02 济南浪潮高新科技投资发展有限公司 Video data augmentation method, device and computer readable medium
CN112601068B (en) * 2020-12-15 2023-01-24 山东浪潮科学研究院有限公司 Video data augmentation method, device and computer readable medium
CN113870259A (en) * 2021-12-02 2021-12-31 天津御锦人工智能医疗科技有限公司 Multi-modal medical data fusion assessment method, device, equipment and storage medium
CN113870259B (en) * 2021-12-02 2022-04-01 天津御锦人工智能医疗科技有限公司 Multi-modal medical data fusion assessment method, device, equipment and storage medium
WO2023098524A1 (en) * 2021-12-02 2023-06-08 天津御锦人工智能医疗科技有限公司 Multi-modal medical data fusion evaluation method and apparatus, device, and storage medium
CN115086472A (en) * 2022-06-13 2022-09-20 泰州亚东广告传媒有限公司 Mobile phone APP management platform based on key frame information
CN115086472B (en) * 2022-06-13 2023-04-18 广东天讯达资讯科技股份有限公司 Mobile phone APP management system based on key frame information

Similar Documents

Publication Publication Date Title
Liu et al. Robust video super-resolution with learned temporal dynamics
CN111079864A (en) Short video classification method and system based on optimized video key frame extraction
EP4109392A1 (en) Image processing method and image processing device
US11741578B2 (en) Method, system, and computer-readable medium for improving quality of low-light images
US20230214976A1 (en) Image fusion method and apparatus and training method and apparatus for image fusion model
CN111292264A (en) Image high dynamic range reconstruction method based on deep learning
CN111612722B (en) Low-illumination image processing method based on simplified Unet full-convolution neural network
CN113762138B (en) Identification method, device, computer equipment and storage medium for fake face pictures
CN111985281B (en) Image generation model generation method and device and image generation method and device
CN112906649A (en) Video segmentation method, device, computer device and medium
CN112307982B (en) Human body behavior recognition method based on staggered attention-enhancing network
AU2015201623A1 (en) Choosing optimal images with preference distributions
CN111047543A (en) Image enhancement method, device and storage medium
CN112602088A (en) Method, system and computer readable medium for improving quality of low light image
CN110929099A (en) Short video frame semantic extraction method and system based on multitask learning
CN114627034A (en) Image enhancement method, training method of image enhancement model and related equipment
CN114627269A (en) Virtual reality security protection monitoring platform based on degree of depth learning target detection
Diaz-Ramirez et al. Real-time haze removal in monocular images using locally adaptive processing
Parekh et al. A survey of image enhancement and object detection methods
CN115457015A (en) Image no-reference quality evaluation method and device based on visual interactive perception double-flow network
CN112084371B (en) Movie multi-label classification method and device, electronic equipment and storage medium
CN115471413A (en) Image processing method and device, computer readable storage medium and electronic device
CN110489584B (en) Image classification method and system based on dense connection MobileNet model
CN114299105A (en) Image processing method, image processing device, computer equipment and storage medium
Fu et al. Low-light image enhancement base on brightness attention mechanism generative adversarial networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination