CN112241470A

CN112241470A - Video classification method and system

Info

Publication number: CN112241470A
Application number: CN202011016801.9A
Authority: CN
Inventors: 吉长江
Original assignee: Beijing Moviebook Technology Corp ltd
Current assignee: Beijing Moviebook Technology Corp ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2021-01-19
Anticipated expiration: 2040-09-24
Also published as: CN112241470B

Abstract

The application provides a video classification method and a video classification system, wherein in the method provided by the application, video data to be classified are obtained firstly, and at least one video key frame of the video data to be classified is extracted; inputting each video key frame into a preset target detection network for training to generate a key frame vector of each video key frame; inputting the key frame vector of each video key frame into a preset depth characteristic video matrix model, and outputting a depth characteristic video matrix of video data to be classified; and finally, classifying the video data to be classified based on the depth characteristic video matrix and generating a classification label. Based on the video classification method and the video classification system, different features of the video are fused to form a depth feature video matrix, the problem that multiple features of the video are not extracted comprehensively in the prior art is solved, and the accuracy of video classification is improved.

Description

Video classification method and system

Technical Field

The present application relates to the field of video data processing, and in particular, to a video classification method and system.

Background

Video data is the most important source of big data, and video classification is helpful for understanding multimedia content, wherein it plays an important role for various applications such as retrieval database based on video content, online video indexing, video archiving, video identification, and the like.

In order to classify the video, the video needs to be represented by a feature vector to facilitate the analysis process later. There are generally two types of video classification methods: classical manual feature-based methods and deep learning-based methods. A video classification method that is often used today is to train a deep network on video frames. All frames of the same video are labeled with video labels and then the model is trained on all frames of the video. In order to classify the video, all frames thereof are classified respectively, and then most tags of all frames are used as tags of the video.

The existing traditional feature vector extraction method has the problem that the video structure cannot be represented well, and even if the video is regarded as a plurality of frames to consider time information, the video structure cannot be represented completely. In the deep learning-based method, the fusion condition of various features is not comprehensive and accurate enough.

Disclosure of Invention

It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.

According to an aspect of the present application, there is provided a video classification method, including:

acquiring video data to be classified, and extracting at least one video key frame of the video data to be classified;

inputting each video key frame into a preset target detection network for training, and generating a key frame vector of each video key frame based on each trained video key frame;

inputting a key frame vector of each video key frame into a preset depth characteristic video matrix model, and outputting a depth characteristic video matrix of the video data to be classified through the depth characteristic video matrix model;

and classifying the video data to be classified based on the depth characteristic video matrix to generate a classification label of the video data to be classified.

Optionally, the inputting each of the video key frames into a preset target detection network for training, and generating a key frame vector of each of the video key frames based on each of the trained video key frames includes:

inputting each video key frame into a preset RetinaNet network, and detecting an identification object included in each video key frame through the RetinaNet network;

judging attribute parameters of the identification objects, and classifying the identification objects; wherein the attribute parameters of the identification object comprise an object category, an identification score and/or a bounding box;

and generating a key frame vector of each video key frame based on the classified identification object.

Optionally, the generating a key frame vector of each video key frame based on the classified identification object includes:

for each video key frame, calculating the occurrence frequency of each identification object in the video key frame to generate an occurrence vector;

calculating the sum of the identification scores of all the identification objects to generate a score vector;

acquiring object categories and the times of each object category included in the video key frame, and generating a binary vector;

and acquiring a bounding box of each recognition object, inputting the bounding box into a preset neural network, and training the bounding box through the neural network to generate a ConvPool vector.

Optionally, the inputting the key frame vector of each video key frame into a preset depth feature video matrix model, and outputting the depth feature video matrix of the video data to be classified through the depth feature video matrix model includes:

inputting the occurrence vector, the fractional vector, the binary vector and the ConvPool vector of each video key frame into a preset matrix for fusion to form a depth characteristic frame matrix of each video key frame;

and fusing the depth characteristic frame matrixes of the video key frames to form a depth characteristic video matrix of the video data to be classified.

Optionally, the classifying the video data to be classified based on the depth feature video matrix to generate a classification label of the video data to be classified includes:

inputting the depth characteristic video matrix of the video data to be classified into a preset video classifier as a final characteristic vector of the video data to be classified, and classifying the video data to be classified by the video classifier by adopting a random forest algorithm to obtain a classification result;

and generating a classification label of the video data to be classified based on the classification result.

According to another aspect of the present application, there is provided a video classification system comprising:

the video key frame extraction module is configured to acquire video data to be classified and extract at least one video key frame of the video data to be classified;

a key frame vector generation module configured to input each of the video key frames into a preset target detection network for training, and generate a key frame vector of each of the video key frames based on each of the trained video key frames;

the depth feature video matrix output module is configured to input a key frame vector of each video key frame into a preset depth feature video matrix model, and output a depth feature video matrix of the video data to be classified through the depth feature video matrix model;

a video data classification module configured to classify the video data to be classified based on the depth feature video matrix, and generate a classification label of the video data to be classified.

Optionally, the key frame vector generating module is further configured to:

Optionally, wherein the key frame vector generation module is further configured to:

Optionally, the depth feature video matrix output module is further configured to:

Optionally, the video data classification module is further configured to:

The application provides a video classification method and a video classification system, wherein in the method provided by the application, video data to be classified are obtained firstly, and at least one video key frame of the video data to be classified is extracted; inputting each video key frame into a preset target detection network for training, and generating a key frame vector of each video key frame based on each trained video key frame; inputting the key frame vector of each video key frame into a preset depth characteristic video matrix model, and outputting a depth characteristic video matrix of video data to be classified through the depth characteristic video matrix model; and finally, classifying the video data to be classified based on the depth characteristic video matrix to generate a classification label of the video data to be classified.

Based on the video classification method and the video classification system provided by the application, each video is converted into a depth characteristic video matrix, and then the videos are classified and verified, so that the accuracy of video classification can be further improved.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

fig. 1 is a schematic flow chart of a video classification method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a video classification system according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a computing device according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a computer-readable storage medium according to an embodiment of the application.

Detailed Description

In two types of video classification methods: the first category is classical manual feature-based methods, including methods for extracting traditional global and local features, including, for example, shot counts, mean color histograms, Histogram of Oriented Gradients (HOG), Histogram of Optical Flow (HOF), and classifiers (SVMs, KNN, etc.) or clustering algorithms. In the first approach, either video-based features or frame-based features are employed. Based on features of the video such as average shot length, average number of faces, etc. Frame-based features are features extracted from a sequence of frames (all frames in a video or key frames of a shot) and then the features extracted from the frames are mapped into one or more vectors to represent the entire video. The frame-based features may be global features, such as color histograms, or local features, such as Scale-invariant feature transforms (SIFT), Speeded Up Robust Features (SURF), HOG, etc.

The second category is a deep learning based approach that uses features extracted from selected key frames as the basis for video classification. Recently, with the advent of deep learning techniques, it has become easier to learn more powerful feature representations.

Fig. 1 is a flowchart illustrating a video classification method according to an embodiment of the present application. As can be seen from fig. 1, a video classification method provided in an embodiment of the present application may include:

step S101: acquiring video data to be classified, and extracting at least one video key frame of the video data to be classified;

step S102: inputting each video key frame into a preset target detection network for training, and generating a key frame vector of each video key frame based on each trained video key frame;

step S103: inputting the key frame vector of each video key frame into a preset depth characteristic video matrix model, and outputting a depth characteristic video matrix of video data to be classified through the depth characteristic video matrix model;

step S104: and classifying the video data to be classified based on the depth characteristic video matrix to generate a classification label of the video data to be classified.

The application provides a video classification method, in the method provided by the application, video data to be classified are obtained firstly, and at least one video key frame of the video data to be classified is extracted; inputting each video key frame into a preset target detection network for training, and generating a key frame vector of each video key frame based on each trained video key frame; inputting the key frame vector of each video key frame into a preset depth characteristic video matrix model, and outputting a depth characteristic video matrix of video data to be classified through the depth characteristic video matrix model; and finally, classifying the video data to be classified based on the depth characteristic video matrix to generate a classification label of the video data to be classified.

According to the video classification method, the different features of the video are extracted and fused to form the depth feature video matrix, and the video is classified based on the depth feature video matrix, so that various features of the video can be considered during video classification, and the accuracy of video classification is further improved. The following describes steps S101 to S104 in detail.

Step S101 is performed first, video data to be classified is acquired, and at least one key frame is extracted from the video data.

In the embodiment of the application, an experimental data set, namely the dev part of the BlipTv data set, is selected when the video data to be classified is acquired. The data set contains two sets of videos: a training set containing 5288 videos and a test set containing approximately 9000 videos.

After the video data to be classified is acquired from the experimental data set, the key frames in the video data to be classified are extracted, and the number of the extracted key frames is at least one, or multiple, and the number of the extracted key frames is not limited in the invention.

After the plurality of key frames are extracted, as described in step S102, the plurality of key frames are trained in a preset target detection network, and a key frame vector of each video key frame is generated based on each trained video key frame.

In an optional embodiment of the present application, when generating a key frame vector, each video key frame may be input into a preset retinaNet network, and an identification object included in each video key frame is detected by the retinaNet network; secondly, judging attribute parameters of each identification object, and classifying each identification object; wherein the attribute parameters of the identification object comprise an object category, an identification score and/or a bounding box; and finally, generating a key frame vector of each video key frame based on the classified identification object.

The retinet network is a single network consisting of a backbone network and two sub-networks with specific tasks, wherein the backbone network is responsible for calculating convolution characteristics on the whole image, the first sub-network executes an image classification task on the output of the backbone network, and the second sub-network is responsible for convolution frame regression, so that the problem of unbalanced positive and negative samples of the target detection problem can be solved.

In the embodiment of the invention, through the RetinaNet network, all identifiable objects in each video key frame can be identified quickly, and the category, the identification score and/or the bounding box of the identified objects are obtained. Further, each recognition object can be classified by each attribute of the object. And calculating a key frame vector for each video key frame based on the attributes of each recognition object.

Alternatively, the process of calculating the key frame vector for each video key frame may be as follows:

s1, for each video key frame, calculating the occurrence frequency of each identification object in the video key frame, and generating an occurrence vector;

s2, calculating the sum of the identification scores of all the identification objects to generate a score vector;

s3, acquiring the object categories included in the video key frame and the times of each object category, and generating a binary vector;

and S4, acquiring the bounding box of each recognition object, inputting the bounding box into a preset neural network, and generating a ConvPool vector through neural network training.

And based on each identification object of each video key frame, accurately positioning the key frame vector of each video key frame through the calculation of the four vectors.

After the key frame vectors of the video key frames are obtained, step S103 is executed to input the key frame vectors of the video key frames into a preset depth feature video matrix model, and output a depth feature video matrix of the video data to be classified through the depth feature video matrix model.

In an optional embodiment of the present application, the key frame vectors of each video key frame, i.e., the occurrence vector, the fractional vector, the binary vector, and the ConvPool vector, are input into a preset matrix for fusion to form a depth feature frame matrix of each video key frame; and then fusing the depth characteristic frame matrixes of the video key frames to form a depth characteristic video matrix of the video data to be classified.

And finally, executing the step S104, classifying the video data to be classified based on the depth characteristic video matrix, and generating a classification label of the video data to be classified.

Inputting a depth characteristic video matrix of the video data to be classified into a preset video classifier as a final characteristic vector of the video data to be classified, and classifying the video data to be classified by the video classifier by adopting a random forest algorithm to obtain a classification result; and finally, generating a classification label of the video data to be classified based on the classification result.

For example, after the data to be classified is obtained, a plurality of video key frames are extracted from the data. Define each video as V_iAnd i represents a video number. For each video, extracting its key video frame KF_ijI.e. V_i＝{KF_ij,j＝1,...,S_iJ denotes the key frame number of video i, where S_iAs a video V_iThe number of the extracted key frames.

Then, the extracted multiple video key frames are trained in a RetinaNet network pre-trained in a COCO data set, and the method is used for detecting each key frame KF_ijAnd classify them into one of 80 categories provided in the pre-trained model, such as human, car, airplane, and horse. For each detected object, its class is considered, a score is identified and the bounding box surrounding it.

For each key frame KF_ijFour vectors are calculated:

(1) occurrence vector OV_ij

Is a vector of dimension 80, the vector pair appearing in the key frame KF_ijFor example, if two people, a dog, and a cat, appear in the keyframe, then the column corresponding to people in the 80-dimensional vector is 2, the column for dog is 1, the column for cat is 1, and the remaining values are 0.

(2) Fractional vector OV_ij

Since there is a detection score for each detected object, we compute a score vector of dimension 80, which contains the key frame KF_ijThe sum of the scores of all the objects in (c).

(3) Binary vector SV_ij

The binary vector is similar to the occurrence vector, but it computes a binary number instead of the number of times each of the 80 classes detected in the keyframe. In this case, the vector is an 80-dimensional vector, for example, if two persons, a dog and a cat, appear in the key frame, the category value of the person, dog and cat is 1, and the remaining category value is 0.

(4) ConvPool vector convPV_ij

For KF_ijWe train the bounding region through a CNN network that generates a network named convPV_ij128-dimensional vector of (1). The network consists of two consecutive convolutional layers, a maximum pooling layer and an average pooling layer.

Next, the four vectors are linked to a vector named DFFM_ijThe matrix of (depth feature frame matrix) is as follows:

then each video V is put_iAll key frames KF in_ijDFFM of_ijFusing the matrixes to form a depth characteristic video matrix DFVM_i:

And finally, taking the depth feature video matrix as a final feature vector of the video, and adopting a random forest algorithm as a final video classifier to classify the video data to be classified so as to generate a classification label of the video data to be classified. According to the video classification method provided by the embodiment of the invention, the depth feature video matrix is used as the final feature vector of the video and as the input of the video classifier to classify the video, so that the accuracy of video classification can be effectively improved. The random forest algorithm refers to a classifier that trains and predicts a sample by using a plurality of trees, and a method for classifying according to a feature vector is well known to those skilled in the art, and therefore, the detailed description is omitted here.

The embodiment of the application provides a video classification method aiming at the problem of incomplete fusion of video features in the existing method, and the video classification method is a novel video classification method based on deep learning.

Based on the same inventive concept, as shown in fig. 2, an embodiment of the present application further provides a video classification system, including:

a video key frame extraction module 210 configured to acquire video data to be classified and extract at least one video key frame of the video data to be classified;

a key frame vector generation module 220 configured to input each video key frame into a preset target detection network for training, and generate a key frame vector of each video key frame based on each video key frame after training;

a depth feature video matrix output module 230 configured to input the key frame vector of each video key frame into a preset depth feature video matrix model, and output a depth feature video matrix of the video data to be classified through the depth feature video matrix model;

and a video data classification module 240 configured to classify the video data to be classified based on the depth feature video matrix, and generate a classification label of the video data to be classified.

In another optional embodiment of the present application, the key frame vector generating module 220 may be further configured to:

judging attribute parameters of each identification object, and classifying each identification object; wherein the attribute parameters of the identification object comprise an object category, an identification score and/or a bounding box;

for each video key frame, calculating the occurrence frequency of each identification object in the video key frame, and generating an occurrence vector;

In another optional embodiment of the present application, the depth feature video matrix output module 230 may be further configured to:

In another optional embodiment of the present application, the video data classification module 240 may be further configured to:

inputting a depth characteristic video matrix of the video data to be classified into a preset video classifier as a final characteristic vector of the video data to be classified, and classifying the video data to be classified by the video classifier by adopting a random forest algorithm to obtain a classification result;

Based on the video classification method and the video classification system, different features of the video are fused to form a depth feature video matrix, the problem that multiple features of the video are not extracted comprehensively in the prior art is solved, and the accuracy of video classification is improved.

An embodiment of the present application also provides a computing device, which, referring to fig. 3, comprises a memory 320, a processor 310 and a computer program stored in said memory 320 and executable by said processor 310, the computer program being stored in a space 330 in the memory 320 for program code, the computer program, when executed by the processor 310, implementing the method steps 331 for performing any of the methods according to the present invention.

The embodiment of the application also provides a computer readable storage medium. Referring to fig. 4, the computer readable storage medium comprises a storage unit for program code provided with a program 331' for performing the steps of the method according to the invention, which program is executed by a processor.

The embodiment of the application also provides a computer program product containing instructions. Which, when run on a computer, causes the computer to carry out the steps of the method according to the invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A video classification method, comprising:

2. The method of claim 1, wherein inputting each of the video key frames into a pre-defined object detection network for training and generating a key frame vector for each of the video key frames based on each of the trained video key frames comprises:

3. The method of claim 2, wherein generating a keyframe vector for each of the video keyframes based on the classified identified objects comprises:

4. The method according to claim 3, wherein the inputting the key frame vector of each video key frame into a preset depth feature video matrix model, and outputting the depth feature video matrix of the video data to be classified through the depth feature video matrix model comprises:

5. The method according to claim 4, wherein the classifying the video data to be classified based on the depth feature video matrix, and generating a classification label of the video data to be classified comprises:

6. A video classification system comprising:

7. The system of claim 6, wherein the key frame vector generation module is further configured to:

8. The system of claim 7, wherein the key frame vector generation module is further configured to:

9. The system of claim 8, wherein the depth feature video matrix output module is further configured to:

10. The system of claim 9, wherein the video data classification module is further configured to: