CN113326760A

CN113326760A - Video classification method and device

Info

Publication number: CN113326760A
Application number: CN202110578272.XA
Authority: CN
Inventors: 马进
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2021-08-31
Anticipated expiration: 2041-05-26
Also published as: CN113326760B

Abstract

The embodiment of the application provides a video classification method and a video classification device, wherein the video classification method comprises the following steps: the method comprises the steps of obtaining a video to be classified, carrying out fragment division on the video to be classified, generating a plurality of corresponding video fragments, carrying out feature extraction on video frames contained in the video fragments, respectively generating corresponding first video features for each video fragment, carrying out dimensionality reduction on the first video features, generating second video features, respectively determining video categories corresponding to the video fragments according to the second video features, classifying the video to be classified according to the video categories corresponding to the video fragments, and generating corresponding classification results.

Description

Video classification method and device

Technical Field

The embodiment of the application relates to the technical field of video processing, in particular to a video classification method. One or more embodiments of the present application also relate to a video classification apparatus, a computing device, and a computer-readable storage medium.

Background

With the continuous development of internet technology, the field of multimedia information processing gets more and more attention, and with the acceleration of life rhythm, most users tend to browse short videos published and shared by different social platforms by utilizing fragmented time, and the short videos are also integrated into the aspects of user life. Therefore, it is becoming more and more important how short videos can be more effectively managed to provide more precise services to users.

Because the video is composed of a plurality of video frames, pictures in the video do not change within a certain time, for example, every T pictures in the video change once. However, if the length of T exceeds 50ms, the user can obviously feel that the video picture is discontinuous, which brings a poor video viewing experience to the user, and therefore, such videos need to be detected and identified during video management, and currently, the degree of change between consecutive frames of a video is mostly analyzed manually or by an image technology, and the accuracy of the judgment result generated by judging whether the video to be classified belongs to the target video category is often low, so an effective method is urgently needed to overcome such problems.

Disclosure of Invention

In view of the above, the present application provides a video classification method. One or more embodiments of the present application also relate to a video classification apparatus, a computing device, and a computer-readable storage medium, so as to solve the problem in the prior art that the accuracy of the determination result is low when the video to be classified belongs to the target video category or not is determined by analyzing the degree of change between consecutive frames of the video manually or by using an image technique.

According to a first aspect of embodiments of the present application, there is provided a video classification method, including:

acquiring a video to be classified, and performing fragment division on the video to be classified to generate a plurality of corresponding video fragments;

extracting the characteristics of video frames contained in the plurality of video clips, and respectively generating corresponding first video characteristics for each video clip;

performing dimensionality reduction processing on the first video features to generate second video features, and respectively determining video categories corresponding to the video clips according to the second video features;

and classifying the videos to be classified according to the video categories respectively corresponding to the plurality of video clips to generate corresponding classification results.

According to a second aspect of embodiments of the present application, there is provided a video classification apparatus including:

the device comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is configured to acquire a video to be classified, perform fragment division on the video to be classified and generate a plurality of corresponding video fragments;

the feature extraction module is configured to perform feature extraction on video frames contained in the plurality of video clips, and generate corresponding first video features for each video clip respectively;

the dimension reduction processing module is configured to perform dimension reduction processing on the first video features, generate second video features, and respectively determine video categories corresponding to the video clips according to the second video features;

and the generation module is configured to classify the video to be classified according to the video categories respectively corresponding to the plurality of video clips, and generate corresponding classification results.

According to a third aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, wherein the processor implements the steps of the video classification method when executing the computer-executable instructions.

According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the video classification method.

An embodiment of the application realizes a video classification method and a video classification device, wherein the video classification method comprises the steps of obtaining a video to be classified, carrying out fragment division on the video to be classified, generating a plurality of corresponding video fragments, carrying out feature extraction on video frames contained in the video fragments, generating corresponding first video features for each video fragment, carrying out dimension reduction processing on the first video features, generating second video features, determining video categories corresponding to the video fragments according to the second video features, classifying the video to be classified according to the video categories corresponding to the video fragments, and generating corresponding classification results.

According to the embodiment of the application, the video segment of the video to be classified is divided, the characteristics of the video frames contained in the video segment are extracted, the video category of each video segment is determined according to the characteristic extraction result, and the video category of the video to be classified is comprehensively judged based on the video categories of different video segments, so that the accuracy of the video classification result of the video to be classified is improved.

Drawings

Fig. 1 is a flowchart of a video classification method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a video classification process provided in one embodiment of the present application;

fig. 3 is a flowchart of the video classification method applied to the short video classification of the self-media domain according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a video classification apparatus according to an embodiment of the present application;

fig. 5 is a block diagram of a computing device according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

First, the noun terms to which one or more embodiments of the present application relate are explained.

A convolutional neural network: convolutional Neural Networks (CNN) are a class of feed forward Neural Networks (fed forward Neural Networks) that contain convolution computations and have a deep structure, and are representative algorithms for deep learning (deep learning).

PPT video: the pictures in the video do not change within a certain time, for example, every T pictures in the video change once. Typically, the length of T exceeds 50ms, and the human eye can obviously feel the discontinuity of the video picture.

In the present application, a video classification method is provided. One or more embodiments of the present application are also directed to a video classification apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

The video classification method provided by the embodiment of the application can be applied to any field needing to classify videos, such as classification of live videos in a live broadcast field, classification of movie and television videos in a movie and television field, classification of recorded short videos in a self-media field and the like; for convenience of understanding, the embodiments of the present application describe in detail the application of the video classification method to the classification of recorded short videos from the media field, but are not limited thereto.

Then, under the condition that the video classification method is applied to the classification of the recorded short videos from the media field as an example, the video to be classified acquired in the video classification method can be understood as the recorded short video to be subjected to video classification.

In specific implementation, the video to be classified according to the embodiment of the present application may be presented on clients such as a large-scale video playing device, a game console, a desktop computer, a smart phone, a tablet computer, an MP3(Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4) player, a laptop, an e-book reader, and other display terminals.

Referring to fig. 1, fig. 1 shows a flowchart of a video classification method according to an embodiment of the present application, including the following steps:

102, acquiring a video to be classified, and performing segment division on the video to be classified to generate a plurality of corresponding video segments.

Specifically, the video to be classified is a video that needs to be classified. The video classification described in the embodiment of the application determines whether the video to be classified is a PPT video, and the PPT video is a video in which a user can obviously feel that the video picture is discontinuous. Because the video is composed of a plurality of video frames, pictures in the video do not change within a certain time, for example, every T pictures in the video change once. However, if the length of T exceeds 50ms, the user may obviously feel the discontinuity of the video frame, and such a video may affect the video viewing experience of the user, so that the embodiment of the application needs to identify and detect the video to be classified to determine whether the video to be classified is the PPT video.

In practical applications, the classification standard of the video to be classified may be determined according to actual requirements, and is not limited herein.

Specifically, the video to be classified is subjected to segment division, that is, the video to be classified is divided into a plurality of different video segments, or a plurality of different video segments are extracted from the video to be classified. In the embodiment of the present application, the video to be classified may be divided into a plurality of video segments according to the same or different time lengths, or a plurality of different video segments may be extracted from the video to be classified according to the same or different time lengths, and in the extracted plurality of different video segments, any two or more video segments may exist and contain the same video frame, that is, any two or more extracted video segments may exist in an overlapping portion.

After the video to be classified is subjected to segment division, video categories corresponding to a plurality of generated video segments can be determined, and the video category of the video to be classified is determined according to the video categories corresponding to different video segments.

In specific implementation, the video to be classified is segmented to generate a plurality of corresponding video segments, that is, video frames contained in the video to be classified are sampled to generate a plurality of video frame sets of the video to be classified, wherein each video frame set corresponds to one video segment of the video to be classified, and each video frame set contains a plurality of video frames of the video to be classified.

Further, sampling video frames contained in the video to be classified to generate a plurality of video frame sets of the video to be classified, namely sliding sampling windows in the video frames of the video to be classified based on preset sliding step lengths, and taking a plurality of video frames extracted from each sampling window in the sliding process as the video frame sets of the video to be classified to generate the plurality of video frame sets of the video to be classified.

Specifically, since the video to be classified is composed of a plurality of video frames, the video to be classified is segmented, specifically, the video frames included in the video to be classified are sampled, so as to generate a plurality of video frame sets of the video to be classified, the video frames included in the video frame sets constitute the video segments of the video to be classified, that is, each video frame set corresponds to one video segment of the video to be classified, and each video frame set includes a plurality of video frames of the video to be classified.

For example, if the video to be classified includes 1000 frames of video frames, sampling the video frames included in the video to be classified, specifically, the sampling mode may be to take 100 consecutive frames of video frames as one sampling unit, set a sampling interval to 100 frames, and then start continuous sampling from the start frame of the video to be classified, and 10 sets of video frames, that is, 10 video segments, may be obtained through this sampling mode; if the sampling interval is set to a value smaller than 100 frames, video frames in any two or more video frame sets in the video frame sets obtained by sampling overlap, and the number of the video frame sets obtained by sampling is also larger than 10.

Further, the video to be classified may be sampled through a sampling window, that is, the sampling window starts from an initial sliding position and slides in the video frame of the video to be classified according to a preset sliding step length, so as to use the video frame extracted by each sampling window in the sliding process as the video clip of the video to be classified.

The start sliding position may be a position where a start video frame of the video to be classified is located, a position where an end video frame of the video to be classified is located, or a position where other videos of the video to be classified are located.

Starting from the initial sliding position, sliding the sampling window according to the arrangement sequence of the video frames in the video to be classified and a preset sliding step length; the arrangement sequence is the arrangement sequence of the video frames between the start video frame and the end video frame of the video to be classified. Generally, video frames are arranged in a certain order to form a video expressing a certain meaning.

In addition, the size of the sampling window is not limited in the embodiment of the application, and in practical application, when the sampling window is used for sampling video frames, the size of the sampling window may be a fixed value, or may be constantly changed in the sliding process of the sampling window. If the size of the sampling window changes in the sliding process, the size of the sampling window can be changed randomly, or the sampling window is set to change according to a certain rule, and the size can be determined according to actual requirements, and is not limited herein; if the video frame sampling is performed on the video to be classified by using the sampling window with a fixed size, the width of the sampling window can be set to 5000ms, the sliding step length can be set to 2500ms, and the video segments obtained by sampling in such a sampling mode have overlapping parts.

In practical applications, the specific sampling manner may be determined according to actual requirements, and is not limited herein.

After the video to be classified is subjected to segment division, the video category corresponding to the generated video segment can be determined, so that the video category of the video to be classified is determined according to the video categories corresponding to different video segments.

According to the embodiment of the application, video frame sampling is carried out on the video to be classified through the sampling window with variable size or fixed size, namely video fragment division is carried out, the division process of the video fragments is simplified, and the efficiency of video classification is improved.

And 104, performing feature extraction on the video frames contained in the plurality of video segments, and respectively generating corresponding first video features for each video segment.

Specifically, after the video to be classified is segmented to generate corresponding video segments, the features of each video frame included in each video segment in the plurality of video segments can be extracted, and first video features corresponding to each video segment are generated, so that the video category of each video segment is determined by using the extracted first video features corresponding to each video segment, and the video category of the video to be classified is comprehensively judged according to the video category of each video segment.

In specific implementation, the feature extraction is performed on the video frames included in the plurality of video segments, specifically, the video frames included in the plurality of video segments may be input into a classification model, and the feature extraction is performed on the video frames by a feature extraction module of the classification model.

Specifically, the classification model may be a 3D convolutional neural network model, the feature extraction module that inputs the plurality of video segments into the classification model may be a convolution layer that inputs video frames included in the plurality of video segments into the 3D convolutional neural network model, extracts temporal and spatial features of each video frame included in each of the plurality of video segments by using a 3D convolutional kernel, convolves the extracted features by using different convolutional kernels, and adds convolution results as an output of the convolution layer, that is, generates the first video feature.

Because the 3D convolutional neural network model can process a plurality of frames of video frames simultaneously, the model is used for extracting the features, thereby being beneficial to improving the processing speed of video classification and ensuring the comprehensiveness and the accuracy of the extracted relevant features among the video frames, and further being beneficial to ensuring the accuracy of the generated video classification result.

In specific implementation, the classification model is trained in the following way:

obtaining a training sample set of a pre-training classification model, wherein the training sample set comprises at least two videos and a video category corresponding to each video;

and taking the at least two videos as training samples, taking the video category as a sample label, and training the pre-training account classification model to obtain the classification model.

Specifically, in the embodiment of the present application, a pre-training classification model may be trained by using positive and negative samples, where a positive sample is a video with a sample label of 1, a negative sample is a video with a sample label of 0, 1 represents that the video belongs to a target video category (belonging to a PPT video), and 0 represents that the video does not belong to a target video category (not belonging to a PPT video), and the obtained training sample set includes the positive and negative samples.

In practical application, in the training process of the pre-trained classification model, a video may be segmented, then a plurality of video segments generated by the segmentation are input into the pre-trained classification model, so as to perform feature extraction on video frames included in each of the plurality of video segments through a feature extraction module of the pre-trained classification model, generate a feature F1, perform dimension reduction on the feature F1 through a dimension reduction processing module of the pre-trained classification model, generate a feature F2, classify the video according to the feature F2, determine the accuracy of the classification result by comparing the classification result with a label corresponding to the video, specifically calculate a loss value between the classification result and the label corresponding to the video, and adjust a model parameter according to the loss value, thereby obtaining the classification model.

After the classification model is obtained through training, in the application process of the classification model, the video category corresponding to the video clip of the video to be classified can be determined according to the same processing mode.

In addition, after a training sample set is obtained, the training sample set can be divided into a training set and a test set, after a pre-training classification model is trained by the training set, the classification capability of the classification model can be measured by the test set, and specifically, the classification capability of the classification model can be evaluated according to indexes such as accuracy, accuracy and recall, namely, the proportion of the predicted correct number in the positive examples and the negative examples in the total number is used as a judgment basis, the predicted correct proportion in the samples predicted as the positive examples is used as a judgment basis, or the proportion of the predicted correct positive examples in the samples actually used as the positive examples is used as a total actual positive example sample to evaluate the classification capability of the classification model, and model parameters of the classification model are adjusted according to the evaluation result to improve the accuracy of an output result of the classification model.

The embodiment of the application directly utilizes the convolutional neural network to extract the video features of the video to be classified. And the convolutional neural network has stronger capacity of extracting local features, global features and semantic features in the video frames of the video to be classified, so that the algorithm based on the convolutional neural network has stronger robustness and better generalization. In the training process of the convolutional neural network, the slight difference between the continuous frames can be learned, even if the picture changes slowly, the slight difference can still be reflected to the video characteristics through supervised learning, and the accuracy of the classification result obtained by classifying the video through the convolutional neural network is favorably ensured.

And 106, performing dimension reduction processing on the first video characteristics to generate second video characteristics, and respectively determining video categories corresponding to the video clips according to the second video characteristics.

Specifically, after the first video features corresponding to each video clip are extracted and obtained, the dimension reduction processing can be performed on the first video features, so that the video category of each video clip is determined by using the second video features generated by the dimension reduction processing, and the video category of the video to be classified is comprehensively judged according to the video category of each video clip.

In specific implementation, the classification model further comprises a dimension reduction processing module; after the first video features of each video clip are extracted by using the convolution layer of the 3D convolution neural network model, the first video features are multidimensional data, so that in order to reduce the complexity of a video classification processing process and improve the efficiency of video classification, in the embodiment of the application, after the first video features are extracted and obtained, the first video features can be input into a dimension reduction processing module of the classification model, and dimension reduction processing is performed on the first video features through the dimension reduction processing module to generate second video features.

Further, the classification model further comprises a classification module; therefore, the video categories corresponding to the video clips are respectively determined according to the second video features, specifically, the second video features can be input into the classification module, and the plurality of video clips are classified according to the second video features by the classification module to generate the video categories corresponding to the plurality of video clips respectively.

Specifically, the classification model comprises a dimension reduction processing module; in the case that the classification model is a 3D convolutional neural network model, the dimension reduction processing module may specifically be a multilayer perceptron (MLP), that is, a neural network including at least one hidden layer and composed of fully-connected layers, and an output of each hidden layer is transformed by an activation function.

And processing the first video characteristic by utilizing a hidden layer of the multilayer perceptron to realize the dimension reduction processing of the first video characteristic.

After the second video characteristics are obtained through the dimension reduction processing, the video categories respectively corresponding to the plurality of video clips can be determined according to the second video characteristics through a classification module of a classification model.

Specifically, the classification module may be specifically a classifier (softmax), which determines whether the video clip contains a PPT picture (difference between two adjacent video pictures is small, and thus, a user visually and obviously feels that a video picture is discontinuous) to determine whether the video clip belongs to a PPT video.

And 108, classifying the videos to be classified according to the video categories respectively corresponding to the plurality of video clips to generate corresponding classification results.

In specific implementation, the videos to be classified are classified according to the video categories corresponding to the plurality of video clips, so as to generate corresponding classification results, which can be realized in the following manner:

clustering the video clips according to the video categories respectively corresponding to the video clips to generate a video clip set corresponding to each video category;

and determining the classification result of the video to be classified according to the number of the video clips contained in the video clip set.

Specifically, after the video category corresponding to each video clip is obtained, the video to be classified can be voted in combination with the video category to obtain whether the video to be classified is a PPT video.

In practical application, a plurality of video clips can be clustered according to the video category corresponding to each video clip, namely the video clips are grouped, specifically, the video clips belonging to the target video category can be divided into one group named G1, the video clips not belonging to the target video category can be divided into one group named G2, and the classification result of the video to be classified is determined according to the number of the video clips respectively contained in the two groups; if the number of the video clips contained in G1 is greater than that of the video clips contained in G2, determining that the video to be classified belongs to a target video category, namely belongs to a PPT video; if the number of the video clips contained in G1 is less than that of the video clips contained in G2, the video to be classified is determined not to belong to the target video category, namely not to belong to the PPT video.

Further, after the videos to be classified are classified to generate corresponding classification results, whether the videos to be classified belong to the target video category can be determined, and under the condition that the videos not belong to the target video category, in order to ensure the accuracy of the video classification results, the videos to be classified can be secondarily classified, and the method can be specifically realized in the following manner:

determining whether the video to be classified belongs to a target video category according to the classification result;

if not, uniformly sampling the video to be classified to obtain a corresponding sampling result;

and carrying out secondary classification on the video to be classified according to the sampling result to generate a corresponding classification result.

Further, the video to be classified is secondarily classified according to the sampling result to generate a corresponding classification result, which can be specifically realized in the following manner:

inputting the video frames contained in the sampling result into a classification model, and performing feature extraction on the video frames through a feature extraction module of the classification model to generate a corresponding feature extraction result;

performing dimensionality reduction processing on the feature extraction result through a dimensionality reduction processing module of the classification model to generate a corresponding dimensionality reduction processing result;

and carrying out secondary classification on the video to be classified according to the dimension reduction processing result through a classification module of the classification model to generate a corresponding classification result.

Specifically, the target video category is a PPT video, if it is determined according to the classification result that the video to be classified does not belong to the target video category, that is, it is determined that the video to be classified does not belong to the PPT video, but in order to ensure the accuracy of the video classification result, under the condition that it is determined through the foregoing video classification process that the video to be classified does not belong to the PPT video, the video to be classified is further uniformly sampled, and the obtained sampling result is used to perform secondary classification on the video to be classified, so as to determine the classification result of the video to be classified.

In practical application, a sampling result obtained by sampling may be input into a classification model, a video frame included in the sampling result is processed by the classification model, a classification result of the video to be classified is generated, specifically, the video frame included in the sampling result may be input into the classification model, a feature extraction module (convolution layer) of the classification model is used to perform feature extraction on the video frame, a corresponding feature extraction result is generated, a dimension reduction processing module (multi-layer perceptron MLP) of the classification model is used to perform dimension reduction processing on the feature extraction result, a corresponding dimension reduction processing result is generated, and a classification module (classifier softmax) of the classification model is used to perform secondary classification on the video to be classified according to the dimension reduction processing result, so as to generate a corresponding classification result.

In addition, under the condition that the video to be classified is sampled by using a sampling window, after the video to be classified is classified to generate a corresponding classification result, under the condition that the video to be classified is determined not to belong to the target video category according to the classification result, the sampling frequency of the video to be classified is calculated according to the video duration of the video to be classified and the window size of the sampling window;

uniformly sampling the video to be classified according to the sampling frequency to generate a corresponding sampling result;

Specifically, after the videos to be classified are classified to generate corresponding classification results, if it is determined according to the classification results that the videos to be classified do not belong to the target video category, that is, it is determined that the videos to be classified do not belong to the PPT video, but in order to ensure the accuracy of the video classification results, even sampling is performed on the videos to be classified under the condition that the videos to be classified do not belong to the PPT video through the video classification process, and secondary classification is performed on the videos to be classified by using the obtained sampling results to determine the classification results of the videos to be classified.

The sampling frequency of uniform sampling can be calculated according to the video duration of the video to be classified and the window size of the sampling window, so that the number of video frames obtained by uniform sampling is equal to the number of video frames extracted by each sampling window.

Referring to fig. 2, fig. 2 is a schematic diagram illustrating a video classification process according to an embodiment of the present application.

In fig. 2, a sampling window is slid in a video frame of a video to be classified based on a preset sliding step length, and a plurality of video frames extracted from each sampling window in the sliding process are used as a video frame set of the video to be classified; inputting video frames contained in each video frame set into a 3D convolutional neural network, performing feature extraction on the video frames in the video frame set by using a convolutional layer of the 3D convolutional neural network to obtain a first video feature, performing dimension reduction processing on the first video feature by using a multilayer perceptron to obtain a second video feature, performing video category division on video clips corresponding to the video frame set by using a classifier softmax based on the second video feature to obtain corresponding classification results (namely determining whether each video clip contains a PPT picture, further determining whether each video clip is a PPT video), and finally voting based on the classification results corresponding to each video clip to determine whether the video to be classified is the PPT video according to the voting results.

According to the method and the device, the video segment of the video to be classified is divided, the characteristics of the video frames contained in the video segment are extracted, the video category of each video segment is determined according to the characteristic extraction result, and the video category of the video to be classified is comprehensively judged based on the video categories of different video segments, so that the accuracy of the video classification result of the video to be classified is improved; in addition, under the condition that the video to be classified belongs to the target video category according to the video category, namely the video to be classified is the PPT video, the video to be classified is removed, namely the video to be classified is not recommended to the user, so that the video watching experience of the user is improved.

Referring to fig. 3, the video classification method provided in the embodiment of the present application is further described by taking an application of the video classification method in short video classification in the multimedia field as an example. Fig. 3 shows a flowchart of a processing procedure of a video classification method according to an embodiment of the present application, which specifically includes the following steps:

step 302, obtaining a short video to be classified.

Step 304, sliding a sampling window in the video frames of the short video to be classified based on a preset sliding step length, and taking a plurality of video frames extracted from each sampling window in the sliding process as a video frame set of the short video to be classified to generate a plurality of video frame sets of the short video to be classified.

Step 306, inputting the video frames included in the plurality of video frame sets into a classification model, performing feature extraction on the video frames through a feature extraction module of the classification model, and generating a first video feature for each video frame set respectively.

And 308, inputting the first video feature into a dimension reduction processing module of a classification model, and performing dimension reduction processing on the first video feature through the dimension reduction processing module to generate a second video feature.

Step 310, inputting the second video characteristics into a classification module of a classification model, and classifying short video segments respectively corresponding to the plurality of video frame sets through the classification module according to the second video characteristics to generate video categories corresponding to the short video segments.

And step 312, classifying the short videos to be classified according to the video categories to generate corresponding classification results.

And step 314, determining whether the short video to be classified belongs to the target video category according to the classification result.

If yes, go to step 316.

And step 316, uniformly sampling the short videos to be classified to obtain corresponding sampling results.

Specifically, according to the video duration of the short video to be classified and the window size of the sampling window, the sampling frequency of the short video to be classified is calculated, and the short video to be classified is uniformly sampled according to the sampling frequency to generate a corresponding sampling result.

And 318, performing secondary classification on the short videos to be classified according to the sampling result to generate a corresponding classification result.

Specifically, the sampling result is input into a classification model, and the sampling result is processed through the classification model to generate a classification result of secondary classification of the short video to be classified.

According to the embodiment of the application, the short videos to be classified are subjected to video segment division, the features of the video frames contained in the video segments are extracted, the video categories of the video segments are determined according to the feature extraction results, and the video categories of the short videos to be classified are comprehensively judged based on the video categories of different video segments, so that the accuracy of the video classification results of the short videos to be classified is improved.

Corresponding to the above method embodiment, the present application further provides an embodiment of a video classification apparatus, and fig. 4 shows a schematic structural diagram of a video classification apparatus provided in an embodiment of the present application. As shown in fig. 4, the apparatus includes:

an obtaining module 402, configured to obtain a video to be classified, perform segment division on the video to be classified, and generate a plurality of corresponding video segments;

a feature extraction module 404, configured to perform feature extraction on video frames included in the plurality of video segments, and generate corresponding first video features for each video segment respectively;

a dimension reduction processing module 406, configured to perform dimension reduction processing on the first video features, generate second video features, and determine video categories corresponding to the video segments according to the second video features;

the generating module 408 is configured to classify the video to be classified according to the video categories respectively corresponding to the plurality of video segments, and generate a corresponding classification result.

Optionally, the obtaining module 402 includes:

the sampling submodule is configured to sample video frames contained in the video to be classified and generate a plurality of video frame sets of the video to be classified, wherein each video frame set corresponds to one video clip of the video to be classified, and each video frame set contains a plurality of video frames of the video to be classified.

Optionally, the sampling sub-module includes:

the sliding unit is configured to slide a sampling window in a video frame of a video to be classified based on a preset sliding step length, and take a plurality of video frames extracted from each sampling window in the sliding process as a video frame set of the video to be classified so as to generate a plurality of video frame sets of the video to be classified.

Optionally, the feature extraction module 404 includes:

and the feature extraction sub-module is configured to input video frames contained in the plurality of video clips into a classification model, and feature extraction is carried out on the video frames through a feature extraction module of the classification model.

Optionally, the classification model comprises a dimension reduction processing module;

accordingly, the dimension reduction processing module 406 includes:

and the dimension reduction processing submodule is configured to input the first video feature into the dimension reduction processing module, and perform dimension reduction processing on the first video feature through the dimension reduction processing module to generate a second video feature.

Optionally, the classification model further comprises a classification module;

accordingly, the dimension reduction processing module 406 includes:

the input sub-module is configured to input the second video features into the classification module, and the classification module classifies the plurality of video clips according to the second video features to generate video categories corresponding to the plurality of video clips.

Optionally, the generating module 408 includes:

the clustering submodule is configured to cluster the video segments according to the video categories respectively corresponding to the video segments, and generate a video segment set corresponding to each video category;

the determining submodule is configured to determine a classification result of the video to be classified according to the number of the video clips contained in the video clip set.

Optionally, the video classification apparatus further includes:

the determining module is configured to determine whether the video to be classified belongs to a target video category according to the classification result;

if the operation result of the determination module is negative, operating the sampling module;

the sampling module is configured to uniformly sample the video to be classified to obtain a corresponding sampling result;

and the classification result generation module is configured to perform secondary classification on the video to be classified according to the sampling result to generate a corresponding classification result.

Optionally, the classification result generating module includes:

the characteristic extraction result generation submodule is configured to input the video frames contained in the sampling result into a classification model, and perform characteristic extraction on the video frames through a characteristic extraction module of the classification model to generate corresponding characteristic extraction results;

the dimension reduction processing result generation submodule is configured to perform dimension reduction processing on the feature extraction result through a dimension reduction processing module of the classification model to generate a corresponding dimension reduction processing result;

and the classification result generation submodule is configured to perform secondary classification on the video to be classified according to the dimension reduction processing result through the classification module of the classification model to generate a corresponding classification result.

Optionally, the video classification apparatus further includes:

the calculation module is configured to calculate the sampling frequency of the video to be classified according to the video duration of the video to be classified and the window size of the sampling window under the condition that the video to be classified is determined not to belong to the target video category according to the classification result;

the sampling result generation module is configured to uniformly sample the video to be classified according to the sampling frequency and generate a corresponding sampling result;

and the classification generation module is configured to perform secondary classification on the video to be classified according to the sampling result to generate a corresponding classification result.

Optionally, the classification model is trained by:

The foregoing is a schematic view of a video classification apparatus according to the present embodiment. It should be noted that the technical solution of the video classification apparatus belongs to the same concept as the technical solution of the video classification method described above, and details that are not described in detail in the technical solution of the video classification apparatus can be referred to the description of the technical solution of the video classification method described above.

FIG. 5 illustrates a block diagram of a computing device 500 provided according to an embodiment of the present application. The components of the computing device 500 include, but are not limited to, a memory 510 and a processor 520. Processor 520 is coupled to memory 510 via bus 530, and database 550 is used to store data.

Computing device 500 also includes access device 540, access device 540 enabling computing device 500 to communicate via one or more networks 560. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 540 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the application, the above-described components of computing device 500 and other components not shown in FIG. 5 may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 5 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 500 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 500 may also be a mobile or stationary server.

Wherein the processor 520 is configured to execute computer-executable instructions for executing the computer-executable instructions, wherein the processor implements the steps of the video classification method when executing the computer-executable instructions.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the video classification method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the video classification method.

An embodiment of the present application also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the video classification method.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the video classification method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the video classification method.

The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application embodiment is not limited by the described acts or sequences, because some steps may be performed in other sequences or simultaneously according to the present application embodiment. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that acts and modules referred to are not necessarily required to implement the embodiments of the application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments of the application and its practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims

1. A method of video classification, comprising:

2. The video classification method according to claim 1, wherein the segment division of the video to be classified to generate a plurality of corresponding video segments comprises:

sampling video frames contained in the video to be classified, and generating a plurality of video frame sets of the video to be classified, wherein each video frame set corresponds to one video segment of the video to be classified, and each video frame set contains a plurality of video frames of the video to be classified.

3. The video classification method according to claim 2, wherein the sampling video frames included in the video to be classified to generate a plurality of video frame sets of the video to be classified includes:

sliding a sampling window in a video frame of a video to be classified based on a preset sliding step length, and taking a plurality of video frames extracted from each sampling window in the sliding process as a video frame set of the video to be classified so as to generate a plurality of video frame sets of the video to be classified.

4. The video classification method according to any one of claims 1 to 3, wherein the performing feature extraction on the video frames contained in the plurality of video segments comprises:

and inputting the video frames contained in the plurality of video clips into a classification model, and performing feature extraction on the video frames through a feature extraction module of the classification model.

5. The video classification method according to claim 4, characterized in that the classification model comprises a dimension reduction processing module;

correspondingly, the performing dimension reduction processing on the first video feature to generate a second video feature includes:

and inputting the first video characteristic into the dimension reduction processing module, and performing dimension reduction processing on the first video characteristic through the dimension reduction processing module to generate a second video characteristic.

6. The video classification method according to claim 5, characterized in that the classification model further comprises a classification module;

correspondingly, the determining the video categories corresponding to the video clips according to the second video features respectively includes:

and inputting the second video characteristics into the classification module, classifying the plurality of video clips through the classification module according to the second video characteristics, and generating video categories corresponding to the plurality of video clips respectively.

7. The video classification method according to claim 1, wherein the classifying the video to be classified according to the video categories respectively corresponding to the plurality of video segments to generate corresponding classification results comprises:

8. The video classification method according to claim 1, further comprising:

9. The video classification method according to claim 8, wherein the performing secondary classification on the video to be classified according to the sampling result to generate a corresponding classification result comprises:

10. The video classification method according to claim 3, further comprising:

under the condition that the video to be classified is determined not to belong to the target video category according to the classification result, calculating the sampling frequency of the video to be classified according to the video duration of the video to be classified and the window size of the sampling window;

11. The video classification method according to claim 4 or 9, characterized in that the classification model is trained by:

12. A video classification apparatus, comprising:

13. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, wherein the processor implements the steps of the video classification method according to any one of claims 1 to 11 when executing the computer-executable instructions.

14. A computer-readable storage medium, characterized in that it stores computer instructions which, when executed by a processor, implement the steps of the video classification method according to any one of claims 1 to 11.