CN113326760B - Video classification method and device - Google Patents

Video classification method and device Download PDF

Info

Publication number
CN113326760B
CN113326760B CN202110578272.XA CN202110578272A CN113326760B CN 113326760 B CN113326760 B CN 113326760B CN 202110578272 A CN202110578272 A CN 202110578272A CN 113326760 B CN113326760 B CN 113326760B
Authority
CN
China
Prior art keywords
video
classified
classification
generate
sampling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110578272.XA
Other languages
Chinese (zh)
Other versions
CN113326760A (en
Inventor
马进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bilibili Technology Co Ltd
Original Assignee
Shanghai Bilibili Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bilibili Technology Co Ltd filed Critical Shanghai Bilibili Technology Co Ltd
Priority to CN202110578272.XA priority Critical patent/CN113326760B/en
Publication of CN113326760A publication Critical patent/CN113326760A/en
Application granted granted Critical
Publication of CN113326760B publication Critical patent/CN113326760B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a video classification method and a device, wherein the video classification method comprises the following steps: obtaining videos to be classified, carrying out fragment division on the videos to be classified to generate a plurality of corresponding video fragments, carrying out feature extraction on video frames contained in the video fragments, respectively generating corresponding first video features for each video fragment, carrying out dimension reduction processing on the first video features to generate second video features, respectively determining video categories corresponding to the video fragments according to the second video features, and classifying the videos to be classified according to the video categories respectively corresponding to the video fragments to generate corresponding classification results.

Description

Video classification method and device
Technical Field
The embodiment of the application relates to the technical field of video processing, in particular to a video classification method. One or more embodiments of the present application relate to a video classification apparatus, a computing device, and a computer-readable storage medium.
Background
With the continuous development of internet technology, the field of multimedia information processing has gained more and more attention, and with the speed-up of life rhythm, most users tend to utilize the time of fragmentation, browse the short videos that different social platforms release and share, and the short videos also have incorporated the aspect of user life. Therefore, it becomes more and more important how short videos can be managed more efficiently to provide more accurate services to users.
Since the video is made up of a plurality of video frames, the pictures in the video will not change over time, such as once every time a T picture in the video. However, if the length of T exceeds 50ms, the user can obviously feel that the video frames are discontinuous, which brings bad video watching experience to the user, so that it is necessary to detect and identify such videos when video management is performed, and at present, the degree of change between the continuous frames of the videos is mostly analyzed manually or through image technology, and the accuracy of the judgment result generated by judging whether the videos to be classified belong to the target video category in this way is low, so that an effective method is needed to overcome such problems.
Disclosure of Invention
In view of the foregoing, embodiments of the present application provide a video classification method. One or more embodiments of the present application relate to a video classification device, a computing device, and a computer readable storage medium, so as to solve the defect in the prior art that the accuracy of the determination result of determining whether the video to be classified belongs to the target video category is low by analyzing the degree of variation between consecutive frames of the video manually or by image technology.
According to a first aspect of an embodiment of the present application, there is provided a video classification method, including:
obtaining a video to be classified, and dividing the video to be classified into segments to generate a plurality of corresponding video segments;
extracting features of video frames contained in the video clips, and generating corresponding first video features for each video clip respectively;
performing dimension reduction processing on the first video features to generate second video features, and respectively determining video categories corresponding to the video clips according to the second video features;
and classifying the videos to be classified according to the video categories respectively corresponding to the video clips to generate corresponding classification results.
According to a second aspect of embodiments of the present application, there is provided a video classification apparatus, including:
the acquisition module is configured to acquire videos to be classified, segment-divide the videos to be classified, and generate a plurality of corresponding video segments;
the feature extraction module is configured to perform feature extraction on video frames contained in the plurality of video clips, and generate corresponding first video features for each video clip respectively;
the dimension reduction processing module is configured to perform dimension reduction processing on the first video features, generate second video features, and respectively determine video categories corresponding to the video clips according to the second video features;
The generation module is configured to classify the videos to be classified according to the video categories respectively corresponding to the plurality of video clips, and generate corresponding classification results.
According to a third aspect of embodiments of the present application, there is provided a computing device comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, wherein the processor, when executing the computer-executable instructions, performs the steps of the video classification method.
According to a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium storing computer executable instructions which, when executed by a processor, implement the steps of the video classification method.
An embodiment of the application realizes a video classification method and a device, wherein the video classification method comprises the steps of obtaining a video to be classified, carrying out segment division on the video to be classified to generate a plurality of corresponding video segments, carrying out feature extraction on video frames contained in the video segments, respectively generating corresponding first video features for each video segment, carrying out dimension reduction processing on the first video features to generate second video features, respectively determining video categories corresponding to the video segments according to the second video features, and classifying the video to be classified according to the video categories respectively corresponding to the video segments to generate corresponding classification results.
According to the video classification method and device, the video segments of the video to be classified are divided, the feature extraction is carried out on the video frames contained in the video segments, so that the video category of each video segment is determined according to the feature extraction result, and then the video category of the video to be classified is comprehensively judged based on the video categories of different video segments, so that the accuracy of the video classification result of the video to be classified is improved.
Drawings
FIG. 1 is a flow chart of a video classification method provided in one embodiment of the present application;
FIG. 2 is a schematic diagram of a video classification process according to one embodiment of the present application;
FIG. 3 is a flow chart of one embodiment of the present application for applying the video classification method to short video classification from the media domain;
fig. 4 is a schematic structural diagram of a video classification device according to an embodiment of the present application;
FIG. 5 is a block diagram of a computing device provided in one embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is, however, susceptible of embodiment in many other ways than those herein described and similar generalizations can be made by those skilled in the art without departing from the spirit of the application and the application is therefore not limited to the specific embodiments disclosed below.
The terminology used in one or more embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of one or more embodiments of the application. As used in this application in one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of the present application to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
First, terms related to one or more embodiments of the present application will be explained.
Convolutional neural network: convolutional neural networks (Convolutional Neural Networks, CNN) are a class of feedforward neural networks (Feedforward Neural Networks) that contain convolutional computations and have a deep structure, and are representative algorithms for deep learning.
PPT video: the pictures in the video do not change for a certain period of time, such as every other time T pictures in the video change once. Typically T is longer than 50ms, and the human eye can obviously perceive the video picture discontinuity.
In the present application, a video classification method is provided. One or more embodiments of the present application relate to a video classification apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments.
The video classification method provided by the embodiment of the application can be applied to any field needing to classify videos, such as classifying live videos in a live broadcast field, classifying video videos in a video field, classifying recorded short videos in a self-media field and the like; for easy understanding, the embodiment of the present application will be described in detail with reference to the example in which the video classification method is applied to classifying recorded short videos from the media field, but is not limited thereto.
In the case that the video classification method is applied to the classification of the recorded short videos from the media field as an example, the video to be classified obtained in the video classification method can be understood as the recorded short video to be subjected to video classification.
In specific implementation, the video to be classified according to the embodiments of the present application may be presented in a large video playing device, a game console, a desktop computer, a smart phone, a tablet computer, an MP3 (Moving Picture Experts Group Audio Layer III, dynamic image expert compression standard audio layer 3) player, an MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert compression standard audio layer 4) player, a laptop portable computer, an electronic book reader, and other clients such as a display terminal.
Referring to fig. 1, fig. 1 shows a flowchart of a video classification method according to an embodiment of the present application, including the following steps:
step 102, obtaining videos to be classified, and dividing the videos to be classified into segments to generate a plurality of corresponding video segments.
Specifically, the video to be classified is the video to be classified. The video classification in the embodiment of the application determines whether the video to be classified is PPT video, and PPT video, that is, video with discontinuous video pictures can be obviously perceived by a user. Since the video is composed of a plurality of video frames, the pictures in the video do not change for a certain period of time, such as every time a T picture in the video changes. However, if the length of T exceeds 50ms, the user may obviously feel the discontinuity of the video frames, and such video may affect the video viewing experience of the user, so in the embodiment of the present application, the video to be classified needs to be identified and detected to determine whether the video to be classified is PPT video.
However, in practical applications, the classification criteria of the video to be classified may be determined according to practical requirements, which is not limited herein.
Specifically, the video to be classified is segmented, namely, the video to be classified is divided into a plurality of different video segments, or a plurality of different video segments are extracted from the video to be classified. In this embodiment of the present application, the video to be classified may be divided into a plurality of video segments according to the same or different time lengths, or a plurality of different video segments may be extracted from the video to be classified according to the same or different time lengths, and any two or more video segments may include the same video frame in the plurality of different extracted video segments, that is, any two or more extracted video segments may have overlapping portions.
After the video to be classified is segmented, the video categories corresponding to the generated video segments can be determined, so that the video categories of the video to be classified can be determined according to the video categories corresponding to different video segments.
In implementation, the video to be classified is segmented to generate a plurality of corresponding video segments, namely, video frames contained in the video to be classified are sampled to generate a plurality of video frame sets of the video to be classified, wherein each video frame set corresponds to one video segment of the video to be classified, and each video frame set contains a plurality of video frames of the video to be classified.
Further, sampling video frames contained in the video to be classified to generate a plurality of video frame sets of the video to be classified, namely sliding sampling windows in the video frames of the video to be classified based on a preset sliding step length, and taking the plurality of video frames extracted by each sampling window in the sliding process as the video frame sets of the video to be classified to generate the plurality of video frame sets of the video to be classified.
Specifically, since the video to be classified is composed of a plurality of video frames, the video to be classified is segmented, and specifically, the video frames contained in the video to be classified can be sampled to generate a plurality of video frame sets of the video to be classified, wherein the video frames contained in the video frame sets form video segments of the video to be classified, that is, each video frame set corresponds to one video segment of the video to be classified, and each video frame set contains a plurality of video frames of the video to be classified.
For example, if the video to be classified includes 1000 frames of video frames, sampling the video frames included in the video to be classified, where a specific sampling manner may be to use continuous 100 frames of video frames as a sampling unit, set a sampling interval to 100 frames, and then continuously sample from a start frame of the video to be classified, where 10 video frame sets, that is, 10 video segments, may be obtained by this sampling manner; if the sampling interval is set to a value smaller than 100 frames, any two or more video frames in the video frame sets obtained by sampling overlap, and the number of the video frame sets obtained by sampling is also larger than 10.
Further, the video to be classified can be sampled through sampling windows, namely, the sampling windows start from an initial sliding position and slide in video frames of the video to be classified according to a preset sliding step length, so that the video frames extracted by each sampling window in the sliding process are used as video fragments of the video to be classified.
The initial sliding position may be a position where a start video frame of the video to be classified is located, or may be a position where an end video frame of the video to be classified is located, or may be a position where other videos of the video to be classified are located.
Starting from the initial sliding position, the sampling window slides according to the arrangement sequence of each video frame in the video to be classified and the preset sliding step length; the arrangement sequence is the arrangement sequence of each video frame from the start video frame to the end video frame of the video to be classified. Video frames are typically arranged in a certain order to form a video that expresses a certain meaning.
In addition, the size of the sampling window is not limited, and in practical application, when the sampling window is used for sampling video frames, the size of the sampling window can be a fixed value or can be changed continuously in the sliding process of the sampling window. If the size of the sampling window changes in the sliding process, the size of the sampling window can be changed randomly or set to change according to a certain rule, and the size can be determined according to actual requirements without limitation; if a video frame is sampled by using a sampling window with a fixed size, the width of the sampling window can be set to 5000ms, the sliding step length can be set to 2500ms, and the video segments obtained by sampling in the sampling mode have overlapping parts.
In practical applications, the specific sampling manner may be determined according to practical requirements, which is not limited herein.
After the video to be classified is segmented, the video category corresponding to the generated video segment can be determined, so that the video category of the video to be classified is determined according to the video categories corresponding to different video segments.
According to the video frame sampling method and device, the video frame is sampled for the video to be classified through the sampling window with the variable size or the fixed size, namely video segment division is carried out, the division process of the video segments is simplified, and the video classification efficiency is improved.
And 104, extracting features of video frames contained in the video clips, and respectively generating corresponding first video features for each video clip.
Specifically, after the video to be classified is segmented to generate corresponding video segments, features of each video frame contained in each video segment in the plurality of video segments can be extracted, first video features corresponding to each video segment are generated, video categories of each video segment are determined by using the extracted first video features corresponding to each video segment, and the video category of the video to be classified is comprehensively judged according to the video categories of each video segment.
In the implementation, feature extraction is performed on video frames contained in the plurality of video clips, specifically, the video frames contained in the plurality of video clips can be input into a classification model, and feature extraction is performed on the video frames through a feature extraction module of the classification model.
Specifically, the classification model may be a 3D convolutional neural network model, the multiple video segments are input to a feature extraction module of the classification model, that is, video frames contained in the multiple video segments are input to a convolutional layer of the 3D convolutional neural network model, temporal and spatial features of each video frame contained in each video segment in the multiple video segments are extracted through a 3D convolutional kernel, the extracted features are convolved by using different convolutional checks, and convolution results are added to be output as the convolutional layer, that is, the first video feature is generated.
Because the 3D convolutional neural network model can process multi-frame video frames simultaneously, feature extraction is performed by using the model, so that the method is beneficial to improving the processing speed of video classification, ensuring the comprehensiveness and the accuracy of relevant features among extracted video frames, and ensuring the accuracy of generated video classification results.
In particular, the classification model is trained by:
acquiring a training sample set of a pre-training classification model, wherein the training sample set comprises at least two videos and video categories corresponding to each video;
and training the pre-training account classification model by taking the at least two videos as training samples and taking the video categories as sample labels to obtain the classification model.
Specifically, the embodiment of the application can train the pre-training classification model by using positive and negative samples, wherein the positive sample is a video with a sample label of 1, the negative sample is a video with a sample label of 0, 1 represents that the video belongs to a target video category (belongs to PPT video), 0 represents that the video does not belong to a target video category (does not belong to PPT video), and the obtained training sample set comprises the positive and negative samples, so that after the training sample set is obtained, the pre-training classification model can be trained by using the positive and negative samples contained in the training sample set to obtain the classification model.
In practical application, in the training process of the pre-training classification model, the video can be firstly segmented, then a plurality of video segments generated by segmentation are input into the pre-training classification model, so that feature extraction is performed on video frames contained in each video segment in the plurality of video segments through a feature extraction module of the pre-training classification model, a feature F1 is generated, then feature F2 is generated by performing dimension reduction processing on the feature F1 through a dimension reduction processing module of the pre-training classification model, the video is classified according to the feature F2, the accuracy of the classification result is determined by comparing the classification result with a label corresponding to the video, specifically, a loss value between the classification result and the label corresponding to the video can be calculated, and model parameters are adjusted according to the loss value, so that the classification model is obtained.
After the classification model is obtained through training, in the application process of the classification model, the video category corresponding to the video segment of the video to be classified can be determined according to the same processing mode.
In addition, after the training sample set is obtained, the training sample set may be divided into a training set and a testing set, after the training set is used for training the pre-training classification model, the testing set may be used for measuring the classification capability of the classification model, specifically, the classification capability of the classification model may be evaluated according to indexes such as accuracy, precision, recall rate, etc., that is, the ratio of the predicted correct number to the total number in the positive case and the negative case, the predicted correct ratio in the samples predicted to be the positive case or the actual samples to be the judgment basis, the ratio of the predicted correct positive case to the total actual positive case sample in the samples actually to be the positive case may be evaluated, and the classification capability of the classification model may be adjusted according to the evaluation result, so as to improve the accuracy of the output result of the classification model.
In the embodiment of the application, the video characteristics of the video to be classified are extracted directly by using the convolutional neural network. And the convolution neural network has strong capability of extracting local features, global features and semantic features in video frames of the video to be classified, so that the algorithm based on the convolution neural network has strong robustness and good generalization. In the training process of the convolutional neural network, the slight difference between the continuous frames can be learned, even if the picture changes slowly, the slight difference can still be reflected into the video features through supervised learning, and the accuracy of the classification result obtained by video classification by using the convolutional neural network is guaranteed.
And 106, performing dimension reduction processing on the first video features to generate second video features, and respectively determining video categories corresponding to the video clips according to the second video features.
Specifically, after extracting and obtaining the first video features corresponding to each video segment, performing dimension reduction processing on the first video features to determine video categories of each video segment by using the second video features generated by the dimension reduction processing, and comprehensively judging the video categories of the videos to be classified according to the video categories of each video segment.
When the method is implemented, the classification model further comprises a dimension reduction processing module; after the first video feature of each video segment is extracted by using the convolution layer of the 3D convolutional neural network model, the first video feature is multidimensional data, so that in order to reduce the complexity of the video classification processing process and improve the video classification efficiency, after the first video feature is extracted and obtained, the embodiment of the application can input the first video feature into a dimension reduction processing module of a classification model, and the dimension reduction processing module is used for carrying out dimension reduction processing on the first video feature to generate a second video feature.
Further, the classification model further comprises a classification module; therefore, the video categories corresponding to the video clips are respectively determined according to the second video features, the second video features can be specifically input into the classification module, the plurality of video clips are classified according to the second video features through the classification module, and the video categories respectively corresponding to the plurality of video clips are generated.
Specifically, the classification model comprises a dimension reduction processing module; in the case that the classification model is a 3D convolutional neural network model, the dimension reduction processing module may specifically be a multi-layer perceptron (MLP), i.e. a neural network consisting of fully connected layers with at least one hidden layer, and the output of each hidden layer is transformed by an activation function.
And processing the first video features by utilizing hidden layers of the multi-layer perceptron, so that the dimension reduction processing of the first video features can be realized.
After the second video features are obtained through the dimension reduction processing, the video categories respectively corresponding to the video clips can be determined through a classification module of the classification model according to the second video features.
Specifically, the classification module may specifically be a classifier (softmax), and determine whether the video segment includes a PPT picture (the difference between two adjacent video pictures is small, so that the user obviously perceives that the video picture is discontinuous visually) through the classifier, so as to determine whether the video segment belongs to the PPT video.
And step 108, classifying the videos to be classified according to the video categories respectively corresponding to the video clips, and generating corresponding classification results.
In the implementation, the videos to be classified are classified according to the video categories respectively corresponding to the plurality of video clips, and corresponding classification results are generated, which can be realized in the following manner:
clustering the video clips according to the video categories corresponding to the video clips respectively to generate video clip sets corresponding to the video categories;
and determining the classification result of the video to be classified according to the number of the video clips contained in the video clip set.
Specifically, after the video category corresponding to each video clip is obtained, the video to be classified can be voted by combining the video category to obtain whether the video to be classified is a PPT video.
In practical application, a plurality of video clips can be clustered according to the video category corresponding to each video clip, namely the video clips are grouped, namely the video clips belonging to the target video category can be divided into a group, the name is G1, the video clips not belonging to the target video category are divided into a group, the name is G2, and the classification result of the video to be classified is determined according to the number of the video clips respectively contained in the two groups; if the number of the video clips contained in the G1 is larger than the number of the video clips contained in the G2, determining that the video to be classified belongs to a target video category, namely belongs to the PPT video; if the number of the video clips contained in the G1 is smaller than the number of the video clips contained in the G2, determining that the video to be classified does not belong to the target video category, namely does not belong to the PPT video.
Further, after the video to be classified is classified to generate a corresponding classification result, whether the video to be classified belongs to a target video category or not can be determined, and in order to ensure the accuracy of the video classification result, the video to be classified can be secondarily classified, which can be realized in the following manner:
determining whether the video to be classified belongs to a target video category according to the classification result;
if not, uniformly sampling the video to be classified to obtain a corresponding sampling result;
and carrying out secondary classification on the video to be classified according to the sampling result to generate a corresponding classification result.
Further, the video to be classified is secondarily classified according to the sampling result, and a corresponding classification result is generated, which can be realized specifically by the following modes:
inputting the video frames contained in the sampling results into a classification model, and carrying out feature extraction on the video frames through a feature extraction module of the classification model to generate corresponding feature extraction results;
performing dimension reduction processing on the feature extraction result through a dimension reduction processing module of the classification model to generate a corresponding dimension reduction processing result;
And performing secondary classification on the video to be classified according to the dimension reduction processing result by a classification module of the classification model to generate a corresponding classification result.
Specifically, if the target video category, namely the PPT video, is determined according to the classification result that the video to be classified does not belong to the target video category, that is, the video to be classified does not belong to the PPT video is determined, but in order to ensure the accuracy of the video classification result, under the condition that the video to be classified is determined to not belong to the PPT video through the video classification process, the video to be classified is uniformly sampled, and the obtained sampling result is utilized to secondarily classify the video to be classified, so as to determine the classification result of the video to be classified.
In practical application, the sampled result obtained by sampling may be input into a classification model, so that the classification model processes the video frames contained in the sampled result to generate the classification result of the video to be classified, specifically, the video frames contained in the sampled result may be input into the classification model, the feature extraction module (convolution layer) of the classification model performs feature extraction on the video frames to generate a corresponding feature extraction result, the dimension reduction processing module (multi-layer perceptron MLP) of the classification model performs dimension reduction processing on the feature extraction result to generate a corresponding dimension reduction processing result, and the classification module (classifier softmax) of the classification model performs secondary classification on the video to be classified according to the dimension reduction processing result to generate a corresponding classification result.
In addition, under the condition that the video to be classified is sampled by utilizing a sampling window, after the video to be classified is classified to generate a corresponding classification result, under the condition that the video to be classified does not belong to a target video category according to the classification result, calculating the sampling frequency of the video to be classified according to the video duration of the video to be classified and the window size of the sampling window;
uniformly sampling the video to be classified according to the sampling frequency to generate a corresponding sampling result;
and carrying out secondary classification on the video to be classified according to the sampling result to generate a corresponding classification result.
Specifically, after the video to be classified is classified to generate a corresponding classification result, if the video to be classified is determined not to belong to the target video category according to the classification result, it is determined that the video to be classified does not belong to the PPT video, but in order to ensure the accuracy of the video classification result, under the condition that the video to be classified is determined not to belong to the PPT video through the video classification process, the video to be classified is uniformly sampled, and the obtained sampling result is utilized to secondarily classify the video to be classified, so as to determine the classification result of the video to be classified.
The sampling frequency of the uniform sampling can be calculated according to the video duration of the video to be classified and the window size of the sampling window, so that the number of video frames obtained by the uniform sampling is equal to the number of video frames extracted by each sampling window.
Referring to fig. 2, fig. 2 is a schematic diagram illustrating a video classification process according to an embodiment of the present application.
In fig. 2, sliding sampling windows in video frames of a video to be classified based on a preset sliding step length, and taking a plurality of video frames extracted by each sampling window in the sliding process as a video frame set of the video to be classified; inputting video frames contained in each video frame set into a 3D convolutional neural network, carrying out feature extraction on the video frames in the video frame set by utilizing a convolutional layer of the 3D convolutional neural network to obtain a first video feature, then carrying out dimension reduction processing on the first video feature by utilizing a multi-layer perceptron to obtain a second video feature, then carrying out video category classification on video fragments corresponding to the video frame set by utilizing a classifier softmax based on the second video feature to obtain a corresponding classification result (namely determining whether each video fragment contains a PPT picture or not, further determining whether each video fragment is a PPT video or not), and finally voting based on the classification result corresponding to each video fragment so as to determine whether the video to be classified is the PPT video or not according to the voting result.
According to the embodiment of the application, the video clips are divided into the videos to be classified, the feature extraction is carried out on the video frames contained in the video clips, so that the video category of each video clip is determined according to the feature extraction result, and the video category of the video to be classified is comprehensively judged based on the video categories of different video clips, so that the accuracy of the video classification result of the video to be classified is improved; in addition, the video to be classified can be removed, namely the video to be classified is not recommended to the user, under the condition that the video to be classified belongs to the target video category according to the video category, namely the video to be classified is PPT video, so that the video watching experience of the user is improved.
Referring to fig. 3, an application of the video classification method provided in the embodiment of the present application to short video classification in the multimedia field is taken as an example, and the video classification method is further described. Fig. 3 shows a process flow chart of a video classification method according to an embodiment of the present application, which specifically includes the following steps:
step 302, obtaining short videos to be classified.
Step 304, sliding a sampling window in video frames of the short video to be classified based on a preset sliding step length, and taking a plurality of video frames extracted by each sampling window in the sliding process as a video frame set of the short video to be classified so as to generate a plurality of video frame sets of the short video to be classified.
Step 306, inputting the video frames contained in the plurality of video frame sets into a classification model, and extracting features of the video frames through a feature extraction module of the classification model, so as to generate first video features for each video frame set.
Step 308, inputting the first video feature into a dimension reduction processing module of the classification model, and performing dimension reduction processing on the first video feature through the dimension reduction processing module to generate a second video feature.
Step 310, inputting the second video feature into a classification module of a classification model, and classifying, by the classification module, the short video segments corresponding to the multiple video frame sets respectively according to the second video feature, so as to generate a video category corresponding to the short video segment.
Step 312, classifying the short videos to be classified according to the video category, and generating a corresponding classification result.
And step 314, determining whether the short video to be classified belongs to a target video category according to the classification result.
If yes, go to step 316.
And step 316, uniformly sampling the short videos to be classified to obtain corresponding sampling results.
Specifically, according to the video duration of the short video to be classified and the window size of the sampling window, the sampling frequency of the short video to be classified is calculated, and the short video to be classified is uniformly sampled according to the sampling frequency, so that a corresponding sampling result is generated.
And step 318, performing secondary classification on the short video to be classified according to the sampling result, and generating a corresponding classification result.
Specifically, the sampling result is input into a classification model, and the sampling result is processed through the classification model to generate a classification result of secondary classification of the short video to be classified.
According to the embodiment of the application, the short videos to be classified are divided into the video segments, the feature extraction is carried out on the video frames contained in the video segments, so that the video category of each video segment is determined according to the feature extraction result, and then the video category of the short videos to be classified is comprehensively judged based on the video categories of different video segments, so that the accuracy of the video classification result of the short videos to be classified is improved.
Corresponding to the above method embodiment, the present application further provides an embodiment of a video classification device, and fig. 4 shows a schematic structural diagram of a video classification device according to one embodiment of the present application. As shown in fig. 4, the apparatus includes:
the obtaining module 402 is configured to obtain a video to be classified, segment-divide the video to be classified, and generate a plurality of corresponding video segments;
the feature extraction module 404 is configured to perform feature extraction on video frames contained in the plurality of video clips, and generate corresponding first video features for each video clip respectively;
The dimension reduction processing module 406 is configured to perform dimension reduction processing on the first video feature, generate a second video feature, and respectively determine a video category corresponding to the video clip according to the second video feature;
the generating module 408 is configured to classify the video to be classified according to the video categories respectively corresponding to the video clips, and generate a corresponding classification result.
Optionally, the acquiring module 402 includes:
the sampling submodule is configured to sample video frames contained in the video to be classified, and generate a plurality of video frame sets of the video to be classified, wherein each video frame set corresponds to one video segment of the video to be classified, and each video frame set contains a plurality of video frames of the video to be classified.
Optionally, the sampling submodule includes:
the sliding unit is configured to slide sampling windows in video frames of the video to be classified based on a preset sliding step length, and take a plurality of video frames extracted by each sampling window in the sliding process as a video frame set of the video to be classified so as to generate a plurality of video frame sets of the video to be classified.
Optionally, the feature extraction module 404 includes:
and the feature extraction sub-module is configured to input video frames contained in the plurality of video clips into a classification model, and perform feature extraction on the video frames through the feature extraction module of the classification model.
Optionally, the classification model includes a dimension reduction processing module;
correspondingly, the dimension reduction processing module 406 includes:
the dimension reduction processing sub-module is configured to input the first video feature into the dimension reduction processing module, and dimension reduction processing is performed on the first video feature through the dimension reduction processing module so as to generate a second video feature.
Optionally, the classification model further comprises a classification module;
correspondingly, the dimension reduction processing module 406 includes:
the input sub-module is configured to input the second video features into the classification module, classify the plurality of video clips according to the second video features through the classification module, and generate video categories corresponding to the plurality of video clips respectively.
Optionally, the generating module 408 includes:
the clustering sub-module is configured to cluster the video clips according to the video categories corresponding to the video clips respectively, and generate video clip sets corresponding to the video categories;
And the determining submodule is configured to determine a classification result of the video to be classified according to the number of the video fragments contained in the video fragment set.
Optionally, the video classification device further includes:
the determining module is configured to determine whether the video to be classified belongs to a target video category according to the classification result;
if the operation result of the determining module is negative, the sampling module is operated;
the sampling module is configured to uniformly sample the video to be classified to obtain a corresponding sampling result;
the classification result generation module is configured to perform secondary classification on the video to be classified according to the sampling result, and generate a corresponding classification result.
Optionally, the classification result generating module includes:
the feature extraction result generation sub-module is configured to input video frames contained in the sampling result into a classification model, and perform feature extraction on the video frames through the feature extraction module of the classification model to generate corresponding feature extraction results;
the dimension reduction processing result generation sub-module is configured to perform dimension reduction processing on the feature extraction result through the dimension reduction processing module of the classification model to generate a corresponding dimension reduction processing result;
The classification result generation sub-module is configured to perform secondary classification on the video to be classified according to the dimension reduction processing result through the classification module of the classification model, and generate a corresponding classification result.
Optionally, the video classification device further includes:
the calculating module is configured to calculate the sampling frequency of the video to be classified according to the video duration of the video to be classified and the window size of the sampling window under the condition that the video to be classified does not belong to the target video category according to the classification result;
the sampling result generation module is configured to uniformly sample the video to be classified according to the sampling frequency to generate a corresponding sampling result;
and the classification generation module is configured to perform secondary classification on the video to be classified according to the sampling result to generate a corresponding classification result.
Optionally, the classification model is trained by:
acquiring a training sample set of a pre-training classification model, wherein the training sample set comprises at least two videos and video categories corresponding to each video;
and training the pre-training account classification model by taking the at least two videos as training samples and taking the video categories as sample labels to obtain the classification model.
The above is a schematic solution of a video classification apparatus according to this embodiment. It should be noted that, the technical solution of the video classification device and the technical solution of the video classification method belong to the same concept, and details of the technical solution of the video classification device, which are not described in detail, can be referred to the description of the technical solution of the video classification method.
Fig. 5 illustrates a block diagram of a computing device 500 provided in accordance with one embodiment of the present application. The components of the computing device 500 include, but are not limited to, a memory 510 and a processor 520. Processor 520 is coupled to memory 510 via bus 530 and database 550 is used to hold data.
Computing device 500 also includes access device 540, access device 540 enabling computing device 500 to communicate via one or more networks 560. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 540 may include one or more of any type of network interface, wired or wireless (e.g., a Network Interface Card (NIC)), such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present application, the above-described components of computing device 500, as well as other components not shown in FIG. 5, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 5 is for exemplary purposes only and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 500 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 500 may also be a mobile or stationary server.
Wherein the processor 520 is configured to execute computer-executable instructions for performing steps of the video classification method when the processor executes the computer-executable instructions.
The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the video classification method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the video classification method.
An embodiment of the present application also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the video classification method.
The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the video classification method belong to the same concept, and details of the technical solution of the storage medium, which are not described in detail, can be referred to the description of the technical solution of the video classification method.
The foregoing describes specific embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may take other order or occur simultaneously in accordance with the embodiments. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments of the present application.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The above-disclosed preferred embodiments of the present application are provided only as an aid to the elucidation of the present application. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teachings of the embodiments of the present application. These embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This application is to be limited only by the claims and the full scope and equivalents thereof.

Claims (12)

1. A method of video classification, comprising:
acquiring a video to be classified, sampling video frames contained in the video to be classified through a sampling window, and generating a plurality of corresponding video clips;
extracting features of video frames contained in the video clips, and generating corresponding first video features for each video clip respectively;
performing dimension reduction processing on the first video features to generate second video features, and respectively determining video categories corresponding to the video clips according to the second video features;
Classifying the videos to be classified according to the video categories respectively corresponding to the video clips to generate corresponding classification results;
under the condition that the video to be classified does not belong to the target video category according to the classification result, calculating the sampling frequency of the video to be classified according to the video duration of the video to be classified and the window size of the sampling window; uniformly sampling the video to be classified according to the sampling frequency to generate a corresponding sampling result; and carrying out secondary classification on the video to be classified according to the sampling result to generate a corresponding classification result.
2. The method for video classification according to claim 1, wherein the sampling the video frames contained in the video to be classified through the sampling window to generate a corresponding plurality of video clips includes:
and sampling video frames contained in the video to be classified through a sampling window to generate a plurality of video frame sets of the video to be classified, wherein each video frame set corresponds to one video segment of the video to be classified, and each video frame set contains a plurality of video frames of the video to be classified.
3. The method for classifying video according to claim 2, wherein the sampling video frames contained in the video to be classified through the sampling window to generate a plurality of video frame sets of the video to be classified includes:
sliding sampling windows in video frames of the video to be classified based on a preset sliding step length, and taking a plurality of video frames extracted by each sampling window in the sliding process as a video frame set of the video to be classified so as to generate a plurality of video frame sets of the video to be classified.
4. A video classification method according to any one of claims 1 to 3, wherein the feature extraction of the video frames contained in the plurality of video clips comprises:
inputting video frames contained in the video clips into a classification model, and extracting the characteristics of the video frames through a characteristic extraction module of the classification model.
5. The video classification method of claim 4, wherein the classification model comprises a dimension reduction processing module;
correspondingly, the performing the dimension reduction processing on the first video feature to generate a second video feature includes:
and inputting the first video feature into the dimension reduction processing module, and performing dimension reduction processing on the first video feature through the dimension reduction processing module so as to generate a second video feature.
6. The video classification method of claim 5, wherein the classification model further comprises a classification module;
correspondingly, the determining the video category corresponding to the video clip according to the second video feature includes:
and inputting the second video features into the classification module, classifying the plurality of video clips according to the second video features through the classification module, and generating video categories respectively corresponding to the plurality of video clips.
7. The method for classifying video according to claim 1, wherein classifying the video to be classified according to the video categories respectively corresponding to the plurality of video clips, and generating a corresponding classification result includes:
clustering the video clips according to the video categories corresponding to the video clips respectively to generate video clip sets corresponding to the video categories;
and determining the classification result of the video to be classified according to the number of the video clips contained in the video clip set.
8. The method of claim 1, wherein the performing secondary classification on the video to be classified according to the sampling result to generate a corresponding classification result includes:
Inputting the video frames contained in the sampling results into a classification model, and carrying out feature extraction on the video frames through a feature extraction module of the classification model to generate corresponding feature extraction results;
performing dimension reduction processing on the feature extraction result through a dimension reduction processing module of the classification model to generate a corresponding dimension reduction processing result;
and performing secondary classification on the video to be classified according to the dimension reduction processing result by a classification module of the classification model to generate a corresponding classification result.
9. The video classification method of claim 4, wherein the classification model is trained by:
acquiring a training sample set of a pre-training classification model, wherein the training sample set comprises at least two videos and video categories corresponding to each video;
and training the pre-trained classification model by taking the at least two videos as training samples and taking the video categories as sample labels to obtain the classification model.
10. A video classification apparatus, comprising:
the acquisition module is configured to acquire videos to be classified, sample video frames contained in the videos to be classified through a sampling window, and generate a plurality of corresponding video clips;
The feature extraction module is configured to perform feature extraction on video frames contained in the plurality of video clips, and generate corresponding first video features for each video clip respectively;
the dimension reduction processing module is configured to perform dimension reduction processing on the first video features, generate second video features, and respectively determine video categories corresponding to the video clips according to the second video features;
the generation module is configured to classify the video to be classified according to the video categories respectively corresponding to the plurality of video clips, and generate a corresponding classification result, and the classification result generation module further comprises: the calculating module is configured to calculate the sampling frequency of the video to be classified according to the video duration of the video to be classified and the window size of the sampling window under the condition that the video to be classified does not belong to the target video category according to the classification result; the sampling result generation module is configured to uniformly sample the video to be classified according to the sampling frequency to generate a corresponding sampling result; and the classification generation module is configured to perform secondary classification on the video to be classified according to the sampling result to generate a corresponding classification result.
11. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer executable instructions and the processor is configured to execute the computer executable instructions, wherein the processor, when executing the computer executable instructions, performs the steps of the video classification method of any one of claims 1-9.
12. A computer readable storage medium, characterized in that it stores computer instructions which, when executed by a processor, implement the steps of the video classification method of any of claims 1-9.
CN202110578272.XA 2021-05-26 2021-05-26 Video classification method and device Active CN113326760B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110578272.XA CN113326760B (en) 2021-05-26 2021-05-26 Video classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110578272.XA CN113326760B (en) 2021-05-26 2021-05-26 Video classification method and device

Publications (2)

Publication Number Publication Date
CN113326760A CN113326760A (en) 2021-08-31
CN113326760B true CN113326760B (en) 2023-05-09

Family

ID=77415087

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110578272.XA Active CN113326760B (en) 2021-05-26 2021-05-26 Video classification method and device

Country Status (1)

Country Link
CN (1) CN113326760B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165573A (en) * 2018-08-03 2019-01-08 百度在线网络技术(北京)有限公司 Method and apparatus for extracting video feature vector
CN110443171A (en) * 2019-07-25 2019-11-12 腾讯科技(武汉)有限公司 Classification method, device, storage medium and the terminal of video file
CN110782920A (en) * 2019-11-05 2020-02-11 广州虎牙科技有限公司 Audio recognition method and device and data processing equipment
CN111933112A (en) * 2020-09-21 2020-11-13 北京声智科技有限公司 Awakening voice determination method, device, equipment and medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110147700B (en) * 2018-05-18 2023-06-27 腾讯科技(深圳)有限公司 Video classification method, device, storage medium and equipment
CN109740670B (en) * 2019-01-02 2022-01-11 京东方科技集团股份有限公司 Video classification method and device
CN110751030A (en) * 2019-09-12 2020-02-04 厦门网宿有限公司 Video classification method, device and system
CN112765403A (en) * 2021-01-11 2021-05-07 北京达佳互联信息技术有限公司 Video classification method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165573A (en) * 2018-08-03 2019-01-08 百度在线网络技术(北京)有限公司 Method and apparatus for extracting video feature vector
CN110443171A (en) * 2019-07-25 2019-11-12 腾讯科技(武汉)有限公司 Classification method, device, storage medium and the terminal of video file
CN110782920A (en) * 2019-11-05 2020-02-11 广州虎牙科技有限公司 Audio recognition method and device and data processing equipment
CN111933112A (en) * 2020-09-21 2020-11-13 北京声智科技有限公司 Awakening voice determination method, device, equipment and medium

Also Published As

Publication number Publication date
CN113326760A (en) 2021-08-31

Similar Documents

Publication Publication Date Title
CN108446390B (en) Method and device for pushing information
CN109543714B (en) Data feature acquisition method and device, electronic equipment and storage medium
CN110519636B (en) Voice information playing method and device, computer equipment and storage medium
CN112465008B (en) Voice and visual relevance enhancement method based on self-supervision course learning
CN111026914B (en) Training method of video abstract model, video abstract generation method and device
CN111918130A (en) Video cover determining method and device, electronic equipment and storage medium
CN111477250A (en) Audio scene recognition method, and training method and device of audio scene recognition model
WO2014176580A2 (en) Content based search engine for processing unstructurd digital
CN114519809A (en) Audio-visual video analysis device and method based on multi-scale semantic network
CN112584062B (en) Background audio construction method and device
CN111783712A (en) Video processing method, device, equipment and medium
CN114282047A (en) Small sample action recognition model training method and device, electronic equipment and storage medium
Yan et al. Emotion recognition based on sparse learning feature selection method for social communication
CN114297439A (en) Method, system, device and storage medium for determining short video label
CN111540364A (en) Audio recognition method and device, electronic equipment and computer readable medium
CN114140708A (en) Video processing method, device and computer readable storage medium
CN111488813B (en) Video emotion marking method and device, electronic equipment and storage medium
CN115481283A (en) Audio and video feature extraction method and device, electronic equipment and computer readable storage medium
CN113496156A (en) Emotion prediction method and equipment
CN112464106B (en) Object recommendation method and device
CN115935049A (en) Recommendation processing method and device based on artificial intelligence and electronic equipment
CN110516086B (en) Method for automatically acquiring movie label based on deep neural network
CN113326760B (en) Video classification method and device
CN113407772A (en) Video recommendation model generation method, video recommendation method and device
CN114510564A (en) Video knowledge graph generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant