CN112241470A - Video classification method and system - Google Patents

Video classification method and system Download PDF

Info

Publication number
CN112241470A
CN112241470A CN202011016801.9A CN202011016801A CN112241470A CN 112241470 A CN112241470 A CN 112241470A CN 202011016801 A CN202011016801 A CN 202011016801A CN 112241470 A CN112241470 A CN 112241470A
Authority
CN
China
Prior art keywords
video
key frame
vector
classified
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011016801.9A
Other languages
Chinese (zh)
Other versions
CN112241470B (en
Inventor
吉长江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Moviebook Technology Corp ltd
Original Assignee
Beijing Moviebook Technology Corp ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Moviebook Technology Corp ltd filed Critical Beijing Moviebook Technology Corp ltd
Priority to CN202011016801.9A priority Critical patent/CN112241470B/en
Publication of CN112241470A publication Critical patent/CN112241470A/en
Application granted granted Critical
Publication of CN112241470B publication Critical patent/CN112241470B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • G06F16/7328Query by example, e.g. a complete video frame or video sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Abstract

The application provides a video classification method and a video classification system, wherein in the method provided by the application, video data to be classified are obtained firstly, and at least one video key frame of the video data to be classified is extracted; inputting each video key frame into a preset target detection network for training to generate a key frame vector of each video key frame; inputting the key frame vector of each video key frame into a preset depth characteristic video matrix model, and outputting a depth characteristic video matrix of video data to be classified; and finally, classifying the video data to be classified based on the depth characteristic video matrix and generating a classification label. Based on the video classification method and the video classification system, different features of the video are fused to form a depth feature video matrix, the problem that multiple features of the video are not extracted comprehensively in the prior art is solved, and the accuracy of video classification is improved.

Description

Video classification method and system
Technical Field
The present application relates to the field of video data processing, and in particular, to a video classification method and system.
Background
Video data is the most important source of big data, and video classification is helpful for understanding multimedia content, wherein it plays an important role for various applications such as retrieval database based on video content, online video indexing, video archiving, video identification, and the like.
In order to classify the video, the video needs to be represented by a feature vector to facilitate the analysis process later. There are generally two types of video classification methods: classical manual feature-based methods and deep learning-based methods. A video classification method that is often used today is to train a deep network on video frames. All frames of the same video are labeled with video labels and then the model is trained on all frames of the video. In order to classify the video, all frames thereof are classified respectively, and then most tags of all frames are used as tags of the video.
The existing traditional feature vector extraction method has the problem that the video structure cannot be represented well, and even if the video is regarded as a plurality of frames to consider time information, the video structure cannot be represented completely. In the deep learning-based method, the fusion condition of various features is not comprehensive and accurate enough.
Disclosure of Invention
It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.
According to an aspect of the present application, there is provided a video classification method, including:
acquiring video data to be classified, and extracting at least one video key frame of the video data to be classified;
inputting each video key frame into a preset target detection network for training, and generating a key frame vector of each video key frame based on each trained video key frame;
inputting a key frame vector of each video key frame into a preset depth characteristic video matrix model, and outputting a depth characteristic video matrix of the video data to be classified through the depth characteristic video matrix model;
and classifying the video data to be classified based on the depth characteristic video matrix to generate a classification label of the video data to be classified.
Optionally, the inputting each of the video key frames into a preset target detection network for training, and generating a key frame vector of each of the video key frames based on each of the trained video key frames includes:
inputting each video key frame into a preset RetinaNet network, and detecting an identification object included in each video key frame through the RetinaNet network;
judging attribute parameters of the identification objects, and classifying the identification objects; wherein the attribute parameters of the identification object comprise an object category, an identification score and/or a bounding box;
and generating a key frame vector of each video key frame based on the classified identification object.
Optionally, the generating a key frame vector of each video key frame based on the classified identification object includes:
for each video key frame, calculating the occurrence frequency of each identification object in the video key frame to generate an occurrence vector;
calculating the sum of the identification scores of all the identification objects to generate a score vector;
acquiring object categories and the times of each object category included in the video key frame, and generating a binary vector;
and acquiring a bounding box of each recognition object, inputting the bounding box into a preset neural network, and training the bounding box through the neural network to generate a ConvPool vector.
Optionally, the inputting the key frame vector of each video key frame into a preset depth feature video matrix model, and outputting the depth feature video matrix of the video data to be classified through the depth feature video matrix model includes:
inputting the occurrence vector, the fractional vector, the binary vector and the ConvPool vector of each video key frame into a preset matrix for fusion to form a depth characteristic frame matrix of each video key frame;
and fusing the depth characteristic frame matrixes of the video key frames to form a depth characteristic video matrix of the video data to be classified.
Optionally, the classifying the video data to be classified based on the depth feature video matrix to generate a classification label of the video data to be classified includes:
inputting the depth characteristic video matrix of the video data to be classified into a preset video classifier as a final characteristic vector of the video data to be classified, and classifying the video data to be classified by the video classifier by adopting a random forest algorithm to obtain a classification result;
and generating a classification label of the video data to be classified based on the classification result.
According to another aspect of the present application, there is provided a video classification system comprising:
the video key frame extraction module is configured to acquire video data to be classified and extract at least one video key frame of the video data to be classified;
a key frame vector generation module configured to input each of the video key frames into a preset target detection network for training, and generate a key frame vector of each of the video key frames based on each of the trained video key frames;
the depth feature video matrix output module is configured to input a key frame vector of each video key frame into a preset depth feature video matrix model, and output a depth feature video matrix of the video data to be classified through the depth feature video matrix model;
a video data classification module configured to classify the video data to be classified based on the depth feature video matrix, and generate a classification label of the video data to be classified.
Optionally, the key frame vector generating module is further configured to:
inputting each video key frame into a preset RetinaNet network, and detecting an identification object included in each video key frame through the RetinaNet network;
judging attribute parameters of the identification objects, and classifying the identification objects; wherein the attribute parameters of the identification object comprise an object category, an identification score and/or a bounding box;
and generating a key frame vector of each video key frame based on the classified identification object.
Optionally, wherein the key frame vector generation module is further configured to:
for each video key frame, calculating the occurrence frequency of each identification object in the video key frame to generate an occurrence vector;
calculating the sum of the identification scores of all the identification objects to generate a score vector;
acquiring object categories and the times of each object category included in the video key frame, and generating a binary vector;
and acquiring a bounding box of each recognition object, inputting the bounding box into a preset neural network, and training the bounding box through the neural network to generate a ConvPool vector.
Optionally, the depth feature video matrix output module is further configured to:
inputting the occurrence vector, the fractional vector, the binary vector and the ConvPool vector of each video key frame into a preset matrix for fusion to form a depth characteristic frame matrix of each video key frame;
and fusing the depth characteristic frame matrixes of the video key frames to form a depth characteristic video matrix of the video data to be classified.
Optionally, the video data classification module is further configured to:
inputting the depth characteristic video matrix of the video data to be classified into a preset video classifier as a final characteristic vector of the video data to be classified, and classifying the video data to be classified by the video classifier by adopting a random forest algorithm to obtain a classification result;
and generating a classification label of the video data to be classified based on the classification result.
The application provides a video classification method and a video classification system, wherein in the method provided by the application, video data to be classified are obtained firstly, and at least one video key frame of the video data to be classified is extracted; inputting each video key frame into a preset target detection network for training, and generating a key frame vector of each video key frame based on each trained video key frame; inputting the key frame vector of each video key frame into a preset depth characteristic video matrix model, and outputting a depth characteristic video matrix of video data to be classified through the depth characteristic video matrix model; and finally, classifying the video data to be classified based on the depth characteristic video matrix to generate a classification label of the video data to be classified.
Based on the video classification method and the video classification system provided by the application, each video is converted into a depth characteristic video matrix, and then the videos are classified and verified, so that the accuracy of video classification can be further improved.
The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
Drawings
Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:
fig. 1 is a schematic flow chart of a video classification method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a video classification system according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a computing device according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a computer-readable storage medium according to an embodiment of the application.
Detailed Description
In two types of video classification methods: the first category is classical manual feature-based methods, including methods for extracting traditional global and local features, including, for example, shot counts, mean color histograms, Histogram of Oriented Gradients (HOG), Histogram of Optical Flow (HOF), and classifiers (SVMs, KNN, etc.) or clustering algorithms. In the first approach, either video-based features or frame-based features are employed. Based on features of the video such as average shot length, average number of faces, etc. Frame-based features are features extracted from a sequence of frames (all frames in a video or key frames of a shot) and then the features extracted from the frames are mapped into one or more vectors to represent the entire video. The frame-based features may be global features, such as color histograms, or local features, such as Scale-invariant feature transforms (SIFT), Speeded Up Robust Features (SURF), HOG, etc.
The second category is a deep learning based approach that uses features extracted from selected key frames as the basis for video classification. Recently, with the advent of deep learning techniques, it has become easier to learn more powerful feature representations.
Fig. 1 is a flowchart illustrating a video classification method according to an embodiment of the present application. As can be seen from fig. 1, a video classification method provided in an embodiment of the present application may include:
step S101: acquiring video data to be classified, and extracting at least one video key frame of the video data to be classified;
step S102: inputting each video key frame into a preset target detection network for training, and generating a key frame vector of each video key frame based on each trained video key frame;
step S103: inputting the key frame vector of each video key frame into a preset depth characteristic video matrix model, and outputting a depth characteristic video matrix of video data to be classified through the depth characteristic video matrix model;
step S104: and classifying the video data to be classified based on the depth characteristic video matrix to generate a classification label of the video data to be classified.
The application provides a video classification method, in the method provided by the application, video data to be classified are obtained firstly, and at least one video key frame of the video data to be classified is extracted; inputting each video key frame into a preset target detection network for training, and generating a key frame vector of each video key frame based on each trained video key frame; inputting the key frame vector of each video key frame into a preset depth characteristic video matrix model, and outputting a depth characteristic video matrix of video data to be classified through the depth characteristic video matrix model; and finally, classifying the video data to be classified based on the depth characteristic video matrix to generate a classification label of the video data to be classified.
According to the video classification method, the different features of the video are extracted and fused to form the depth feature video matrix, and the video is classified based on the depth feature video matrix, so that various features of the video can be considered during video classification, and the accuracy of video classification is further improved. The following describes steps S101 to S104 in detail.
Step S101 is performed first, video data to be classified is acquired, and at least one key frame is extracted from the video data.
In the embodiment of the application, an experimental data set, namely the dev part of the BlipTv data set, is selected when the video data to be classified is acquired. The data set contains two sets of videos: a training set containing 5288 videos and a test set containing approximately 9000 videos.
After the video data to be classified is acquired from the experimental data set, the key frames in the video data to be classified are extracted, and the number of the extracted key frames is at least one, or multiple, and the number of the extracted key frames is not limited in the invention.
After the plurality of key frames are extracted, as described in step S102, the plurality of key frames are trained in a preset target detection network, and a key frame vector of each video key frame is generated based on each trained video key frame.
In an optional embodiment of the present application, when generating a key frame vector, each video key frame may be input into a preset retinaNet network, and an identification object included in each video key frame is detected by the retinaNet network; secondly, judging attribute parameters of each identification object, and classifying each identification object; wherein the attribute parameters of the identification object comprise an object category, an identification score and/or a bounding box; and finally, generating a key frame vector of each video key frame based on the classified identification object.
The retinet network is a single network consisting of a backbone network and two sub-networks with specific tasks, wherein the backbone network is responsible for calculating convolution characteristics on the whole image, the first sub-network executes an image classification task on the output of the backbone network, and the second sub-network is responsible for convolution frame regression, so that the problem of unbalanced positive and negative samples of the target detection problem can be solved.
In the embodiment of the invention, through the RetinaNet network, all identifiable objects in each video key frame can be identified quickly, and the category, the identification score and/or the bounding box of the identified objects are obtained. Further, each recognition object can be classified by each attribute of the object. And calculating a key frame vector for each video key frame based on the attributes of each recognition object.
Alternatively, the process of calculating the key frame vector for each video key frame may be as follows:
s1, for each video key frame, calculating the occurrence frequency of each identification object in the video key frame, and generating an occurrence vector;
s2, calculating the sum of the identification scores of all the identification objects to generate a score vector;
s3, acquiring the object categories included in the video key frame and the times of each object category, and generating a binary vector;
and S4, acquiring the bounding box of each recognition object, inputting the bounding box into a preset neural network, and generating a ConvPool vector through neural network training.
And based on each identification object of each video key frame, accurately positioning the key frame vector of each video key frame through the calculation of the four vectors.
After the key frame vectors of the video key frames are obtained, step S103 is executed to input the key frame vectors of the video key frames into a preset depth feature video matrix model, and output a depth feature video matrix of the video data to be classified through the depth feature video matrix model.
In an optional embodiment of the present application, the key frame vectors of each video key frame, i.e., the occurrence vector, the fractional vector, the binary vector, and the ConvPool vector, are input into a preset matrix for fusion to form a depth feature frame matrix of each video key frame; and then fusing the depth characteristic frame matrixes of the video key frames to form a depth characteristic video matrix of the video data to be classified.
And finally, executing the step S104, classifying the video data to be classified based on the depth characteristic video matrix, and generating a classification label of the video data to be classified.
Inputting a depth characteristic video matrix of the video data to be classified into a preset video classifier as a final characteristic vector of the video data to be classified, and classifying the video data to be classified by the video classifier by adopting a random forest algorithm to obtain a classification result; and finally, generating a classification label of the video data to be classified based on the classification result.
For example, after the data to be classified is obtained, a plurality of video key frames are extracted from the data. Define each video as ViAnd i represents a video number. For each video, extracting its key video frame KFijI.e. Vi={KFij,j=1,...,SiJ denotes the key frame number of video i, where SiAs a video ViThe number of the extracted key frames.
Then, the extracted multiple video key frames are trained in a RetinaNet network pre-trained in a COCO data set, and the method is used for detecting each key frame KFijAnd classify them into one of 80 categories provided in the pre-trained model, such as human, car, airplane, and horse. For each detected object, its class is considered, a score is identified and the bounding box surrounding it.
For each key frame KFijFour vectors are calculated:
(1) occurrence vector OVij
Is a vector of dimension 80, the vector pair appearing in the key frame KFijFor example, if two people, a dog, and a cat, appear in the keyframe, then the column corresponding to people in the 80-dimensional vector is 2, the column for dog is 1, the column for cat is 1, and the remaining values are 0.
(2) Fractional vector OVij
Since there is a detection score for each detected object, we compute a score vector of dimension 80, which contains the key frame KFijThe sum of the scores of all the objects in (c).
(3) Binary vector SVij
The binary vector is similar to the occurrence vector, but it computes a binary number instead of the number of times each of the 80 classes detected in the keyframe. In this case, the vector is an 80-dimensional vector, for example, if two persons, a dog and a cat, appear in the key frame, the category value of the person, dog and cat is 1, and the remaining category value is 0.
(4) ConvPool vector convPVij
For KFijWe train the bounding region through a CNN network that generates a network named convPVij128-dimensional vector of (1). The network consists of two consecutive convolutional layers, a maximum pooling layer and an average pooling layer.
Next, the four vectors are linked to a vector named DFFMijThe matrix of (depth feature frame matrix) is as follows:
Figure BDA0002699326560000081
then each video V is putiAll key frames KF inijDFFM ofijFusing the matrixes to form a depth characteristic video matrix DFVMi:
Figure BDA0002699326560000082
And finally, taking the depth feature video matrix as a final feature vector of the video, and adopting a random forest algorithm as a final video classifier to classify the video data to be classified so as to generate a classification label of the video data to be classified. According to the video classification method provided by the embodiment of the invention, the depth feature video matrix is used as the final feature vector of the video and as the input of the video classifier to classify the video, so that the accuracy of video classification can be effectively improved. The random forest algorithm refers to a classifier that trains and predicts a sample by using a plurality of trees, and a method for classifying according to a feature vector is well known to those skilled in the art, and therefore, the detailed description is omitted here.
The embodiment of the application provides a video classification method aiming at the problem of incomplete fusion of video features in the existing method, and the video classification method is a novel video classification method based on deep learning.
Based on the same inventive concept, as shown in fig. 2, an embodiment of the present application further provides a video classification system, including:
a video key frame extraction module 210 configured to acquire video data to be classified and extract at least one video key frame of the video data to be classified;
a key frame vector generation module 220 configured to input each video key frame into a preset target detection network for training, and generate a key frame vector of each video key frame based on each video key frame after training;
a depth feature video matrix output module 230 configured to input the key frame vector of each video key frame into a preset depth feature video matrix model, and output a depth feature video matrix of the video data to be classified through the depth feature video matrix model;
and a video data classification module 240 configured to classify the video data to be classified based on the depth feature video matrix, and generate a classification label of the video data to be classified.
In another optional embodiment of the present application, the key frame vector generating module 220 may be further configured to:
inputting each video key frame into a preset RetinaNet network, and detecting an identification object included in each video key frame through the RetinaNet network;
judging attribute parameters of each identification object, and classifying each identification object; wherein the attribute parameters of the identification object comprise an object category, an identification score and/or a bounding box;
and generating a key frame vector of each video key frame based on the classified identification object.
In another optional embodiment of the present application, the key frame vector generating module 220 may be further configured to:
for each video key frame, calculating the occurrence frequency of each identification object in the video key frame, and generating an occurrence vector;
calculating the sum of the identification scores of all the identification objects to generate a score vector;
acquiring object categories and the times of each object category included in the video key frame, and generating a binary vector;
and acquiring a bounding box of each recognition object, inputting the bounding box into a preset neural network, and training the bounding box through the neural network to generate a ConvPool vector.
In another optional embodiment of the present application, the depth feature video matrix output module 230 may be further configured to:
inputting the occurrence vector, the fractional vector, the binary vector and the ConvPool vector of each video key frame into a preset matrix for fusion to form a depth characteristic frame matrix of each video key frame;
and fusing the depth characteristic frame matrixes of the video key frames to form a depth characteristic video matrix of the video data to be classified.
In another optional embodiment of the present application, the video data classification module 240 may be further configured to:
inputting a depth characteristic video matrix of the video data to be classified into a preset video classifier as a final characteristic vector of the video data to be classified, and classifying the video data to be classified by the video classifier by adopting a random forest algorithm to obtain a classification result;
and generating a classification label of the video data to be classified based on the classification result.
The application provides a video classification method and a video classification system, wherein in the method provided by the application, video data to be classified are obtained firstly, and at least one video key frame of the video data to be classified is extracted; inputting each video key frame into a preset target detection network for training, and generating a key frame vector of each video key frame based on each trained video key frame; inputting the key frame vector of each video key frame into a preset depth characteristic video matrix model, and outputting a depth characteristic video matrix of video data to be classified through the depth characteristic video matrix model; and finally, classifying the video data to be classified based on the depth characteristic video matrix to generate a classification label of the video data to be classified.
Based on the video classification method and the video classification system, different features of the video are fused to form a depth feature video matrix, the problem that multiple features of the video are not extracted comprehensively in the prior art is solved, and the accuracy of video classification is improved.
The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
An embodiment of the present application also provides a computing device, which, referring to fig. 3, comprises a memory 320, a processor 310 and a computer program stored in said memory 320 and executable by said processor 310, the computer program being stored in a space 330 in the memory 320 for program code, the computer program, when executed by the processor 310, implementing the method steps 331 for performing any of the methods according to the present invention.
The embodiment of the application also provides a computer readable storage medium. Referring to fig. 4, the computer readable storage medium comprises a storage unit for program code provided with a program 331' for performing the steps of the method according to the invention, which program is executed by a processor.
The embodiment of the application also provides a computer program product containing instructions. Which, when run on a computer, causes the computer to carry out the steps of the method according to the invention.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A video classification method, comprising:
acquiring video data to be classified, and extracting at least one video key frame of the video data to be classified;
inputting each video key frame into a preset target detection network for training, and generating a key frame vector of each video key frame based on each trained video key frame;
inputting a key frame vector of each video key frame into a preset depth characteristic video matrix model, and outputting a depth characteristic video matrix of the video data to be classified through the depth characteristic video matrix model;
and classifying the video data to be classified based on the depth characteristic video matrix to generate a classification label of the video data to be classified.
2. The method of claim 1, wherein inputting each of the video key frames into a pre-defined object detection network for training and generating a key frame vector for each of the video key frames based on each of the trained video key frames comprises:
inputting each video key frame into a preset RetinaNet network, and detecting an identification object included in each video key frame through the RetinaNet network;
judging attribute parameters of the identification objects, and classifying the identification objects; wherein the attribute parameters of the identification object comprise an object category, an identification score and/or a bounding box;
and generating a key frame vector of each video key frame based on the classified identification object.
3. The method of claim 2, wherein generating a keyframe vector for each of the video keyframes based on the classified identified objects comprises:
for each video key frame, calculating the occurrence frequency of each identification object in the video key frame to generate an occurrence vector;
calculating the sum of the identification scores of all the identification objects to generate a score vector;
acquiring object categories and the times of each object category included in the video key frame, and generating a binary vector;
and acquiring a bounding box of each recognition object, inputting the bounding box into a preset neural network, and training the bounding box through the neural network to generate a ConvPool vector.
4. The method according to claim 3, wherein the inputting the key frame vector of each video key frame into a preset depth feature video matrix model, and outputting the depth feature video matrix of the video data to be classified through the depth feature video matrix model comprises:
inputting the occurrence vector, the fractional vector, the binary vector and the ConvPool vector of each video key frame into a preset matrix for fusion to form a depth characteristic frame matrix of each video key frame;
and fusing the depth characteristic frame matrixes of the video key frames to form a depth characteristic video matrix of the video data to be classified.
5. The method according to claim 4, wherein the classifying the video data to be classified based on the depth feature video matrix, and generating a classification label of the video data to be classified comprises:
inputting the depth characteristic video matrix of the video data to be classified into a preset video classifier as a final characteristic vector of the video data to be classified, and classifying the video data to be classified by the video classifier by adopting a random forest algorithm to obtain a classification result;
and generating a classification label of the video data to be classified based on the classification result.
6. A video classification system comprising:
the video key frame extraction module is configured to acquire video data to be classified and extract at least one video key frame of the video data to be classified;
a key frame vector generation module configured to input each of the video key frames into a preset target detection network for training, and generate a key frame vector of each of the video key frames based on each of the trained video key frames;
the depth feature video matrix output module is configured to input a key frame vector of each video key frame into a preset depth feature video matrix model, and output a depth feature video matrix of the video data to be classified through the depth feature video matrix model;
a video data classification module configured to classify the video data to be classified based on the depth feature video matrix, and generate a classification label of the video data to be classified.
7. The system of claim 6, wherein the key frame vector generation module is further configured to:
inputting each video key frame into a preset RetinaNet network, and detecting an identification object included in each video key frame through the RetinaNet network;
judging attribute parameters of the identification objects, and classifying the identification objects; wherein the attribute parameters of the identification object comprise an object category, an identification score and/or a bounding box;
and generating a key frame vector of each video key frame based on the classified identification object.
8. The system of claim 7, wherein the key frame vector generation module is further configured to:
for each video key frame, calculating the occurrence frequency of each identification object in the video key frame to generate an occurrence vector;
calculating the sum of the identification scores of all the identification objects to generate a score vector;
acquiring object categories and the times of each object category included in the video key frame, and generating a binary vector;
and acquiring a bounding box of each recognition object, inputting the bounding box into a preset neural network, and training the bounding box through the neural network to generate a ConvPool vector.
9. The system of claim 8, wherein the depth feature video matrix output module is further configured to:
inputting the occurrence vector, the fractional vector, the binary vector and the ConvPool vector of each video key frame into a preset matrix for fusion to form a depth characteristic frame matrix of each video key frame;
and fusing the depth characteristic frame matrixes of the video key frames to form a depth characteristic video matrix of the video data to be classified.
10. The system of claim 9, wherein the video data classification module is further configured to:
inputting the depth characteristic video matrix of the video data to be classified into a preset video classifier as a final characteristic vector of the video data to be classified, and classifying the video data to be classified by the video classifier by adopting a random forest algorithm to obtain a classification result;
and generating a classification label of the video data to be classified based on the classification result.
CN202011016801.9A 2020-09-24 2020-09-24 Video classification method and system Active CN112241470B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011016801.9A CN112241470B (en) 2020-09-24 2020-09-24 Video classification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011016801.9A CN112241470B (en) 2020-09-24 2020-09-24 Video classification method and system

Publications (2)

Publication Number Publication Date
CN112241470A true CN112241470A (en) 2021-01-19
CN112241470B CN112241470B (en) 2024-02-02

Family

ID=74171246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011016801.9A Active CN112241470B (en) 2020-09-24 2020-09-24 Video classification method and system

Country Status (1)

Country Link
CN (1) CN112241470B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113191205A (en) * 2021-04-03 2021-07-30 国家计算机网络与信息安全管理中心 Method for identifying special scene, object, character and noise factor in video
CN113642422A (en) * 2021-07-27 2021-11-12 东北电力大学 Continuous Chinese sign language recognition method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160085733A1 (en) * 2005-10-26 2016-03-24 Cortica, Ltd. System and method thereof for dynamically associating a link to an information resource with a multimedia content displayed in a web-page
CN106709511A (en) * 2016-12-08 2017-05-24 华中师范大学 Urban rail transit panoramic monitoring video fault detection method based on depth learning
CN108229336A (en) * 2017-12-13 2018-06-29 北京市商汤科技开发有限公司 Video identification and training method and device, electronic equipment, program and medium
CN108470077A (en) * 2018-05-28 2018-08-31 广东工业大学 A kind of video key frame extracting method, system and equipment and storage medium
CN111368656A (en) * 2020-02-21 2020-07-03 华为技术有限公司 Video content description method and video content description device
CN111680614A (en) * 2020-06-03 2020-09-18 安徽大学 Abnormal behavior detection method based on video monitoring

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160085733A1 (en) * 2005-10-26 2016-03-24 Cortica, Ltd. System and method thereof for dynamically associating a link to an information resource with a multimedia content displayed in a web-page
CN106709511A (en) * 2016-12-08 2017-05-24 华中师范大学 Urban rail transit panoramic monitoring video fault detection method based on depth learning
CN108229336A (en) * 2017-12-13 2018-06-29 北京市商汤科技开发有限公司 Video identification and training method and device, electronic equipment, program and medium
US20190266409A1 (en) * 2017-12-13 2019-08-29 Beijing Sensetime Technology Development Co., Ltd Methods and apparatuses for recognizing video and training, electronic device and medium
CN108470077A (en) * 2018-05-28 2018-08-31 广东工业大学 A kind of video key frame extracting method, system and equipment and storage medium
CN111368656A (en) * 2020-02-21 2020-07-03 华为技术有限公司 Video content description method and video content description device
CN111680614A (en) * 2020-06-03 2020-09-18 安徽大学 Abnormal behavior detection method based on video monitoring

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
AMAL DANDASHI 等: "Video Classification Methods: Multimodal Techniques", 《RECENT TRENDS IN COMPUTER APPLICATIONS》, pages 33 *
刘亚南: "基于视频样本分类的事件库构建方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, pages 138 - 1476 *
刘倩兰: "关于智能视频监控中运动目标的分类识别研究", 《电脑迷》, pages 146 *
赵伟: "视频数据挖掘的系统设计", 《硅谷》, pages 53 - 54 *
赵泽宇: "面向情感语义的图像分类及语言描述方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, pages 138 - 2636 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113191205A (en) * 2021-04-03 2021-07-30 国家计算机网络与信息安全管理中心 Method for identifying special scene, object, character and noise factor in video
CN113642422A (en) * 2021-07-27 2021-11-12 东北电力大学 Continuous Chinese sign language recognition method

Also Published As

Publication number Publication date
CN112241470B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
EP2806374B1 (en) Method and system for automatic selection of one or more image processing algorithm
Chen et al. Learning deep features for image emotion classification
US10949702B2 (en) System and a method for semantic level image retrieval
WO2020164278A1 (en) Image processing method and device, electronic equipment and readable storage medium
CN111242083B (en) Text processing method, device, equipment and medium based on artificial intelligence
CN109522970B (en) Image classification method, device and system
CN113221918B (en) Target detection method, training method and device of target detection model
CN112241470B (en) Video classification method and system
Qamar Bhatti et al. Explicit content detection system: An approach towards a safe and ethical environment
KR20210051473A (en) Apparatus and method for recognizing video contents
CN110705490A (en) Visual emotion recognition method
Sultan et al. A hybrid egocentric video summarization method to improve the healthcare for Alzheimer patients
US11250299B2 (en) Learning representations of generalized cross-modal entailment tasks
Ousmane et al. Automatic recognition system of emotions expressed through the face using machine learning: Application to police interrogation simulation
CN111797762A (en) Scene recognition method and system
CN112015966A (en) Image searching method and device, electronic equipment and storage medium
CN110717407A (en) Human face recognition method, device and storage medium based on lip language password
CN113255557A (en) Video crowd emotion analysis method and system based on deep learning
CN114417860A (en) Information detection method, device and equipment
CN114780757A (en) Short media label extraction method and device, computer equipment and storage medium
CN108882033B (en) Character recognition method, device, equipment and medium based on video voice
CN112446311A (en) Object re-recognition method, electronic device, storage medium and device
Hoang et al. Monitoring Employees Entering and Leaving the Office with Deep Learning Algorithms
CN113535951B (en) Method, device, terminal equipment and storage medium for information classification
Alex et al. Local alignment of gradient features for face sketch recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant