CN110879974B

CN110879974B - Video classification method and device

Info

Publication number: CN110879974B
Application number: CN201911058829.6A
Authority: CN
Inventors: 邓积杰; 何楠; 林星; 白兴安; 徐扬
Original assignee: Beijing Weiboyi Technology Co ltd
Current assignee: Beijing Weiboyi Technology Co ltd
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2020-10-13
Anticipated expiration: 2039-11-01
Also published as: CN110879974A

Abstract

The invention discloses a video classification method and device, and relates to the field of data processing. The invention aims to solve the problem that the existing video classification process is low in efficiency and accuracy. The technical scheme provided by the embodiment of the invention comprises the following steps: acquiring a feature vector of each key frame in the video to be classified according to the key frames in the video to be classified; acquiring a visual classification vector of the video to be classified according to the feature vector of each key frame in the video to be classified; acquiring a text classification vector of the video to be classified according to texts contained in image frames in the video to be classified; and substituting the visual classification vector and the text classification vector into a preset classification model to obtain the category of the video to be classified. The scheme can be applied to the fields of video directional pushing and the like.

Description

Video classification method and device

Technical Field

The present invention relates to the field of data processing, and in particular, to a video classification method and apparatus.

Background

In recent years, with the rapid development of internet short video platforms, various videos such as movies, food, science and technology, tourism, education, games and the like show explosive growth. These videos are widely available, low cost, large in daily number, and extremely fast in propagation speed, which brings great challenges to video classification.

In the prior art, videos are generally classified manually or by extracting keywords from titles. However, a large amount of manpower and material resources are consumed by adopting a manual mode, so that the efficiency is low; moreover, the title may not accurately summarize the content of the video, so that the accuracy of video classification by extracting keywords is low; through a pure visual classification method, the categories of constellation fate, job site management, emotion and the like which need semantic understanding cannot be classified, so that the video classification accuracy is low.

Disclosure of Invention

In view of the above, the main objective of the present invention is to solve the problem of low efficiency and accuracy of the existing video classification method.

In one aspect, a video classification method provided in an embodiment of the present invention includes: acquiring a feature vector of each key frame in the video to be classified according to the key frames in the video to be classified; acquiring a visual classification vector of the video to be classified according to the feature vector of each key frame in the video to be classified; acquiring a text classification vector of the video to be classified according to texts contained in image frames in the video to be classified; and substituting the visual classification vector and the text classification vector into a preset classification model to obtain the category of the video to be classified.

On the other hand, an embodiment of the present invention provides a video classification apparatus, including:

the characteristic acquisition module is used for acquiring a characteristic vector of each key frame in the video to be classified according to the key frames in the video to be classified;

the visual classification module is connected with the feature acquisition module and used for acquiring a visual classification vector of the video to be classified according to the feature vector of each key frame in the video to be classified;

the text classification module is used for acquiring a text classification vector of the video to be classified according to texts contained in image frames in the video to be classified;

and the category acquisition module is respectively connected with the visual classification module and the text classification module and used for substituting the visual classification vector and the text classification vector into a preset classification model to acquire the category of the video to be classified.

In summary, the video classification method and apparatus provided by the present invention achieve video classification by respectively obtaining the visual classification vector and the text classification vector, and substituting the visual classification vector and the text classification vector into the classification model. According to the technical scheme provided by the embodiment of the invention, the visual classification vector and the text classification vector are used as the parameters of video classification together, so that the accuracy of video classification is improved, and the problems of low efficiency and accuracy of the conventional video classification method are solved. In addition, the visual classification vector is obtained according to the feature vector of the key frame, and the text classification vector contains deeper semantic information, so that the accuracy of video classification can be further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a video classification method according to embodiment 1 of the present invention;

fig. 2 is a flowchart of a video classification method according to embodiment 2 of the present invention;

fig. 3 is a first schematic structural diagram of a video classification apparatus according to embodiment 3 of the present invention;

FIG. 4 is a schematic diagram of a visual classification module of the video classification apparatus shown in FIG. 3;

fig. 5 is a schematic structural diagram of a video classification apparatus according to embodiment 3 of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1, the present invention provides a video classification method, including:

step 101, obtaining a feature vector of each key frame in a video to be classified according to the key frame in the video to be classified.

In this embodiment, the key frame in step 101 is also called an I-frame (Intra-coded frame), which is a frame that completely retains image data in the compressed video, and when decoding the key frame, only the image data of the key frame is needed to complete decoding. Because the similarity among all key frames in the video to be classified is small, the video to be classified can be comprehensively represented by a plurality of key frames; by extracting the feature vectors of the key frames, the accuracy rate of classifying the video images to be classified can be improved.

Specifically, the process of obtaining the feature vector through step 101 includes: extracting key frames from the video to be classified according to a preset rule; the preset rules include: one of duration, interval, weight and click rate; and acquiring the feature vector of each key frame in the video to be classified. The specific way to obtain the feature vector may be: respectively extracting the features of each key frame in the video to be classified by using a preset image classifier to obtain the feature vector of each key frame in the video to be classified; or acquiring the feature point of each key frame in the video to be classified, and acquiring the feature vector of each key frame in the video to be classified according to the feature point. The method for determining the feature points and the feature vectors may include: scale-invariant feature transform (SIFT) method, Speeded Up Robust Features (SURF) method, orb (organized FAST and organized brief) method, neural Network method, ResNet method (organized Residual Network), Xception (depth separable convolution), I3D (migrated 3D conv, Inflated 3D convolution method), P3D (Pseudo-3D Residual Networks), and TSN (temporal segment Networks).

And 102, acquiring a visual classification vector of the video to be classified according to the feature vector of each key frame in the video to be classified.

In this embodiment, the process of obtaining the visual classification vector through step 102 includes: combining the feature vectors of all key frames in the video to be classified according to rows to obtain a feature map; and respectively fusing each line of data in the characteristic diagram into one data to obtain a visual classification vector of the video to be classified. The process of respectively fusing each line of data in the feature map into one data may include: respectively calculating the average value of each line of data in the characteristic diagram to obtain a visual classification vector of the video to be classified; or respectively calculating the maximum value of each line of data in the feature map to obtain the visual classification vector of the video to be classified. In particular, each line of data in the feature map may be fused into a single data by other methods, such as calculating the minimum value, and the like, which are not described in detail herein.

Step 103, acquiring a text classification vector of the video to be classified according to the text contained in the image frame of the video to be classified.

In this embodiment, the process of obtaining the text classification vector in step 103 may be performed after obtaining the visual classification vector, as shown in fig. 1; before the visual classification vector is obtained, the process of obtaining the visual classification vector may be performed simultaneously, and the process is not limited herein. The process of obtaining the text classification vector through step 103 includes: extracting image frames from a video to be classified; identifying characters in the image frame to obtain texts contained in the image frame; and combining texts contained in all image frames in the video to be classified, and then classifying the texts to obtain a text classification vector of the video to be classified. The method for combining the texts contained in all the image frames in the video to be classified can be used for splicing the texts contained in all the image frames in the video to be classified to form a long text. The method adopted by the Text recognition may be CRNN (convolutional recurrent Neural Network) or CTPN (connective Text forward Network) connected to a Text candidate box Network; the method adopted for text classification can be TextCNN (algorithm for classifying texts by using convolutional neural network), FastText (fast text classification algorithm) or LSTM (Long-Short term memory, Long-term memory artificial neural network).

And 104, substituting the visual classification vector and the text classification vector into a preset classification model to obtain the category of the video to be classified.

In this embodiment, the preset classification model in step 104 may be generated in advance by using a model such as a neural network. The process of obtaining the category of the video to be classified through step 104 may be: splicing the visual classification vector and the text classification vector to synthesize a line vector to obtain a vector to be classified; and substituting the vectors to be classified into a preset classification model to obtain the category of the video to be classified.

In summary, the video classification method provided by the present invention realizes video classification by respectively obtaining the visual classification vector and the text classification vector, and substituting the visual classification vector and the text classification vector into the classification model. According to the technical scheme provided by the embodiment of the invention, the visual classification vector and the text classification vector are used as the parameters of video classification together, so that the accuracy of video classification is improved, and the problems of low efficiency and accuracy of the conventional video classification method are solved. In addition, the visual classification vector is obtained according to the feature vector of the key frame, and the text classification vector contains deeper semantic information, so that the accuracy of video classification can be further improved.

Example 2

As shown in fig. 2, an embodiment of the present invention provides a video classification method, including:

step 201 to step 203, obtaining the visual classification vector and the text classification vector, which is similar to step 101 to step 103 shown in fig. 1 and is not repeated here.

Step 204, a plurality of video samples are obtained, and a visual classification vector, a text classification vector and a category value corresponding to each video sample are obtained.

Step 205, training the initial classifier according to the visual classification vector, the text classification vector and the class value corresponding to each video sample, respectively, to obtain a classification model.

In this embodiment, the initial classifier in step 205 may adopt a convolutional neural network model, or may adopt other models, which is not limited herein.

And step 206, substituting the visual classification vector and the text classification vector into a preset classification model to obtain the category of the video to be classified. The process is similar to step 104 shown in fig. 1, and is not described in detail here.

Example 3

As shown in fig. 3, an embodiment of the present invention provides a video classification apparatus, including:

the feature obtaining module 301 is configured to obtain a feature vector of each key frame in the video to be classified according to the key frame in the video to be classified;

the visual classification module 302 is connected with the feature acquisition module and is used for acquiring a visual classification vector of the video to be classified according to the feature vector of each key frame in the video to be classified;

the text classification module 303 is configured to obtain a text classification vector of the video to be classified according to a text included in an image frame in the video to be classified;

and the category obtaining module 304 is connected to the visual classification module and the text classification module, respectively, and is configured to substitute the visual classification vector and the text classification vector into a preset classification model to obtain a category of the video to be classified.

In this embodiment, the process of classifying videos through the feature obtaining module 301 to the category obtaining module 304 is similar to that provided in the first embodiment of the present invention, and is not repeated here.

Further, as shown in fig. 4, a visual classification module 302 in the video classification apparatus according to the embodiment of the present invention includes:

the vector combination submodule 3021 is configured to combine feature vectors of all key frames in a video to be classified according to rows to obtain a feature map;

and the vector fusion submodule 3022 is connected to the vector combination submodule, and is configured to fuse each line of data in the feature map into one data, so as to obtain a visual classification vector of the video to be classified.

Further, as shown in fig. 5, the video classification apparatus provided in the embodiment of the present invention further includes:

a sample obtaining module 305, configured to obtain a plurality of video samples, and a visual classification vector, a text classification vector, and a category value corresponding to each video sample;

and the training module 306 is connected to the sample acquisition module and the category acquisition module, and is configured to train the initial classifier according to the visual classification vector, the text classification vector and the category value corresponding to each video sample, so as to obtain a classification model.

When the video classification apparatus provided in this embodiment further includes the sample obtaining module 305 and the training module 306, the process of implementing video classification is similar to that provided in the second embodiment of the present invention, and details are not repeated here.

In summary, the video classification apparatus provided in the present invention realizes video classification by respectively obtaining the visual classification vector and the text classification vector, and substituting the visual classification vector and the text classification vector into the classification model. According to the technical scheme provided by the embodiment of the invention, the visual classification vector and the text classification vector are used as the parameters of video classification together, so that the accuracy of video classification is improved, and the problems of low efficiency and accuracy of the conventional video classification method are solved. In addition, the visual classification vector is obtained according to the feature vector of the key frame, and the text classification vector contains deeper semantic information, so that the accuracy of video classification can be further improved.

Specifically, the video category corresponding to the video sample used in the training is 20, and the 20 training categories include dance, music, gourmet, makeup, dance, sports, manual work, pets, mother and baby, drawing, life, wearing and building, games, animation, fitness, emotion, constellation, travel, digital code and furniture. Corresponding to different training classes, the class values of which respectively correspond to integers between 0 and 19. The process of training the classifier may specifically include:

and extracting N frames (N is more than or equal to 3) of key frames from each video sample, wherein N is 4 as an example. And training the video initial classifier through the extracted key frames and the corresponding class values to obtain a video classification model. The video initial classifier may employ models of Resnet50(Residual Network 50, depth Residual Network 50), Resnet101(Residual Network 101, depth Residual Network 101), Xceptance (depth separable convolution), and the like.

Taking a video sample as an example, the process of obtaining the row vector of each video sample includes:

extracting image frames from the video samples; identifying characters in the image frame to obtain texts contained in the image frame, wherein the identification method can be CRNN, CTPN and the like; and training the initial text classifier according to the texts and the corresponding class values contained in the image frames to obtain a text classification model. The initial text classifier may employ models such as TextCNN, FastText, LSTM, etc.

Acquiring key frames in a video sample, taking extracting the following 4 key frames as an example:

key frame 0, dimension (255, 3);

keyframe 1, dimension (255, 3);

keyframe 2, dimension (255, 3);

key frame 3, dimension (255, 3).

Substituting each key frame in the video sample into the video classification model to obtain a feature vector of each key frame in the video sample, wherein the dimension of each feature vector is (1,2048);

[-1.4759,-0.6063,1.2209,……,0.3973,-0.1676,2.7899]

[-0.7009,-0.4696,1.7640,……,1.1952,1.3861,0.2387]

[1.0831,-1.9600,0.8904，……,0.3973,-0.1676,2.7899]

[0.1322,0.6038,2.6935,……,0.3889,1.4386,1.0443]

combining the 4 eigenvectors according to rows to obtain an eigenvector map

Respectively fusing each line of data in the feature map into data, and averaging and fusing each line of data intoFor example, a visual classification vector of dimension (2048,1) [ -0.2404, -0.6080,1.6422, … …,0.4514,0.6358, 1.1756) is obtained]。

Image frames are extracted from the video, and the image frames are obtained by taking the extraction at equal intervals of 1s as an example. Recognizing characters in an image frame to obtain a text 'national goods, highlight, card, heart, hello, great family, i is a makeup musician of a demeanor imitation makeup, i needs to make the best recently, the habit of habitually buying out is a person, i's old swan and i's colorpop are not used up yet, i's old swan and i's colophony are bought of two types of national goods, the reason of purchasing is not unique, i's shell is good-looking, the price is cheap, i's first type fawn, i's second color, rainbow color, powder is particularly high, the fash is wiped out by a brush, the fash is a little rainbow, the fash is a polarized light, the second type, the shell is grown to be more European and American, a laser package is opened, the fash has three colors, the fash is a dark pink bar, the fash is wiped down, the face color is an Ji, and the light eyebrow is wiped, when not making up, people can wipe a little high light, so that five sense organs of people can be more three-dimensional, and people can share the situation at the present flat price. The text is input into a text classification model, and a text classification vector with the dimensionality of (1024,1) [0.0107,0.2644,0.4699, … …,0.5430,0.8514,0.6103] is obtained.

And splicing the visual classification vector and the text classification vector to obtain a line vector with dimension (3072, 1).

And training the initial classifier according to the row vectors and the corresponding class values of all the video samples to obtain a classification model.

Finally, all videos can be classified through the video classification model, the text classification model and the classification model, and the classification process is similar to that provided by the first embodiment of the invention and is not repeated herein.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of video classification, comprising:

acquiring a feature vector of each key frame in the video to be classified according to the key frames in the video to be classified;

acquiring a visual classification vector of the video to be classified according to the feature vector of each key frame in the video to be classified;

acquiring a text classification vector of the video to be classified according to texts contained in image frames in the video to be classified;

substituting the visual classification vector and the text classification vector into a preset classification model to obtain the category of the video to be classified;

the obtaining of the visual classification vector of the video to be classified includes:

combining the feature vectors of all key frames in the video to be classified according to rows to obtain a feature map;

respectively fusing each line of data in the feature map into one data to obtain a visual classification vector of the video to be classified;

the step of respectively fusing each line of data in the feature map into one data to obtain the visual classification vector of the video to be classified comprises the following steps:

and respectively calculating the maximum value of each line of data in the characteristic diagram to obtain the visual classification vector of the video to be classified.

2. The method of claim 1, wherein before said assigning the visual classification vector and the text classification vector to a preset classification model, further comprising:

acquiring a plurality of video samples, and a visual classification vector, a text classification vector and a category value corresponding to each video sample;

and training the initial classifier according to the visual classification vector, the text classification vector and the class value corresponding to each video sample to obtain the classification model.

3. The method according to any one of claims 1 to 2, wherein the obtaining the feature vector of each key frame in the video to be classified comprises:

respectively extracting the features of each key frame in the video to be classified by using a preset image classifier to obtain the feature vector of each key frame in the video to be classified; alternatively, the first and second electrodes may be,

and acquiring the feature point of each key frame in the video to be classified, and acquiring the feature vector of each key frame in the video to be classified according to the feature points.

4. The method according to any one of claims 1 to 2, wherein the obtaining the feature vector of each key frame in the video to be classified comprises:

extracting key frames from the video to be classified according to a preset rule; the preset rules include: one of duration, weight, interval and click rate;

and acquiring the feature vector of each key frame in the video to be classified.

5. The video classification method according to any one of claims 1 to 2, wherein obtaining the text classification vector of the video to be classified comprises:

extracting image frames from the video to be classified;

identifying characters in the image frame to obtain texts contained in the image frame;

and combining texts contained in all image frames in the video to be classified, and then classifying the texts to obtain a text classification vector of the video to be classified.

6. A video classification apparatus, comprising:

the category acquisition module is respectively connected with the visual classification module and the text classification module and used for substituting the visual classification vector and the text classification vector into a preset classification model to acquire the category of the video to be classified;

the visual classification module comprises:

the vector combination submodule is used for combining the feature vectors of all key frames in the video to be classified according to rows to obtain a feature map;

the vector fusion submodule is connected with the vector combination submodule and used for respectively fusing each line of data in the feature map into one data to obtain a visual classification vector of the video to be classified;

7. The video classification apparatus according to claim 6, wherein the apparatus further comprises:

the system comprises a sample acquisition module, a classification module and a classification module, wherein the sample acquisition module is used for acquiring a plurality of video samples and a visual classification vector, a text classification vector and a category value corresponding to each video sample;

and the training module is respectively connected with the sample acquisition module and the category acquisition module and is used for training the initial classifier according to the visual classification vector, the text classification vector and the category value corresponding to each video sample to obtain the classification model.