CN111400551B

CN111400551B - Video classification method, electronic equipment and storage medium

Info

Publication number: CN111400551B
Application number: CN202010176420.0A
Authority: CN
Inventors: 周晓晓
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2020-03-13
Filing date: 2020-03-13
Publication date: 2022-11-15
Anticipated expiration: 2040-03-13
Also published as: CN111400551A

Abstract

The embodiment of the invention provides a video classification method, a video classification device, electronic equipment and a storage medium, wherein the classification of videos is realized through a video classification model, and the video classification model comprises a clustering operation layer which performs operation according to a clustering center matrix used as a training parameter and the characteristic information. Through the operation process of the clustering operation layer, clustering analysis can be performed on the characteristic information based on the clustering center represented by each column vector in the clustering center matrix, the characteristics beneficial to determining the category of the video are extracted, and the accuracy of video classification is improved. Meanwhile, the automatic classification of the videos is realized through the video classification model, and the classification efficiency of video classification is improved.

Description

Video classification method, electronic equipment and storage medium

Technical Field

The present invention relates to the field of machine learning and video analysis technologies, and in particular, to a video classification method, an electronic device, and a storage medium.

Background

Video classification facilitates retrieval and management of videos, typically by tagging the videos to indicate the categories to which the videos belong. The traditional method classifies videos in a manual labeling mode. However, with the development of internet technology, more and more videos, especially short videos, appear on the network, for example, short videos uploaded by individual users. These short videos relate to multiple categories of animation, movie, diet, entertainment, sports, games, and so on. If the classification is carried out in a manual labeling mode, a large amount of labor cost is consumed, and the classification is easily influenced by personal subjective factors to cause incomplete and inaccurate classification.

Therefore, the efficiency of classifying videos in a manual labeling mode is low, and inaccurate classification is easily caused.

Disclosure of Invention

The embodiment of the invention provides a video classification method, electronic equipment and a storage medium, which are used for solving the problems that the efficiency is low and the classification is inaccurate easily caused by classifying videos in a manual labeling mode in the prior art.

In view of the above technical problems, in a first aspect, an embodiment of the present invention provides a video classification method, including:

extracting feature information according to constituent elements of a video, wherein the constituent elements include images, audio and/or subtitles of the video;

inputting the characteristic information into a video classification model to obtain classification information output by the video classification model; the classification information is used for representing the category to which the video belongs;

the video classification model is a model for classifying videos, which is obtained by taking characteristic information extracted according to sample videos as input and through machine learning training; the video classification model comprises a clustering operation layer, and the clustering operation layer is used for performing operation according to a clustering center matrix as a training parameter and the characteristic information.

In a second aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the video classification method described above when executing the program.

In a third aspect, an embodiment of the present invention provides a non-transitory readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the video classification method described in any one of the above.

The embodiment of the invention provides a video classification method, electronic equipment and a storage medium, which realize the classification of videos through a video classification model, wherein the video classification model comprises a clustering operation layer which operates according to a clustering center matrix used as a training parameter and the characteristic information. Through the operation process of the clustering operation layer, clustering analysis can be performed on the characteristic information based on the clustering center represented by each column vector in the clustering center matrix, the characteristics beneficial to determining the category of the video are extracted, and the accuracy of video classification is improved. Meanwhile, the automatic classification of the videos is realized through the video classification model, and the classification efficiency of video classification is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a video classification method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a video classification method according to another embodiment of the present invention;

FIG. 3 is a schematic diagram of an information processing procedure of a clustering sublayer according to another embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating an implementation principle of a weight aggregation layer according to another embodiment of the present invention;

fig. 5 is a block diagram of a video classification apparatus according to another embodiment of the present invention;

fig. 6 is a schematic physical structure diagram of an electronic device according to another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment provides a video classification method, which is suitable for classifying any video needing to be classified, and can be executed by any equipment, such as a computer, a server, a mobile phone and the like. For example, a social platform that provides a large number of short videos needs to label each user with the video uploaded to the social platform. In order to label videos correctly and efficiently, the category to which the videos belong can be determined through the video classification method provided by the application, and then labels corresponding to the category are labeled to the videos.

Fig. 1 is a schematic flow chart of a video classification method provided in this embodiment, and referring to fig. 1, the method includes the following steps:

step 101: feature information is extracted from constituent elements of a video, wherein the constituent elements include images, audio, and/or subtitles of the video.

The constituent elements may further include a description text and a description voice for describing the video, and the like, which is not specifically limited in this embodiment.

The feature information includes information extracted from a plurality of constituent elements or information extracted from a certain constituent element, for example, the feature information includes a feature matrix extracted from an image of a video and a feature matrix extracted from an audio of the video, or the feature information includes only a feature matrix extracted from an image of a video.

Wherein, the extraction of the characteristic information comprises the following steps: inputting a multi-frame image extracted from a video into an inclusion _ v3 model, and taking a feature matrix output by the inclusion _ v3 model as a feature matrix extracted according to the image of the video; an audio clip acquired from the audio of the video is input into a vgg model, and a feature matrix output by a vgg model is used as a feature matrix extracted from the audio of the video.

Step 102: inputting the characteristic information into a video classification model to obtain classification information output by the video classification model; the classification information is used for representing the category to which the video belongs; the video classification model is a model for classifying videos, which is obtained by taking characteristic information extracted according to sample videos as input and through machine learning training; the video classification model comprises a clustering operation layer, and the clustering operation layer is used for performing operation according to a clustering center matrix as a training parameter and the characteristic information.

Specifically, the video classification model is a model obtained by training a pre-constructed initial model with sample feature information extracted from a sample video as an input and classification information indicating a category to which the sample video belongs as an expected output. The initial model comprises a clustering operation layer which takes a clustering center matrix as a training parameter.

The category to which the video belongs includes animation, movie, diet, entertainment, sports, games, and the like. The classification information may indicate the probability that the video belongs to each category, and the category corresponding to the highest probability is generally set as the category to which the video belongs.

Each column vector in the cluster center matrix represents a cluster center. After the training process of the model, the clustering operation layer can perform clustering analysis on the characteristic information of the video based on each clustering center, so that the category of the video is accurately determined.

The embodiment provides a video classification method, which is used for classifying videos through a video classification model, wherein the video classification model comprises a clustering operation layer which performs operation according to a clustering center matrix as a training parameter and the characteristic information. Through the operation process of the clustering operation layer, clustering analysis can be performed on the characteristic information based on the clustering center represented by each column vector in the clustering center matrix, the characteristics beneficial to determining the category of the video are extracted, and the accuracy of video classification is improved. Meanwhile, the automatic classification of the videos is realized through the video classification model, and the classification efficiency of video classification is improved.

Wherein, in order to label the video, the method further comprises, after the step 102: and determining the category of the video according to the classification information, and marking the video with a label corresponding to the category of the video.

Further, on the basis of the foregoing embodiment, the inputting the feature information into a video classification model to obtain classification information output by the video classification model includes:

inputting the characteristic information into the clustering operation layer, and outputting a first matrix by the clustering operation layer;

inputting the first matrix into a weight aggregation layer of the video classification model, and outputting a second matrix by the weight aggregation layer;

inputting the second matrix into a full-link layer of the video classification model, outputting a prediction vector by the full-link layer, and taking the prediction vector as the classification information;

wherein the weight aggregation layer comprises at least one convolution sublayer and at least one activation function; the prediction vector includes a probability that a category to which the video belongs is each preset category.

In this embodiment, the video classification model includes a clustering operation layer, a weight aggregation layer, and a full connection layer. And the characteristic information extracted from the video is processed by a clustering operation layer, a weight aggregation layer and a full connection layer in sequence to output classification information. The weight aggregation layer is composed of a convolution sublayer and an activation function, so that the difference between the characteristics representing the categories can be further strengthened, and the category of the video can be accurately determined.

Fig. 2 is a schematic diagram of a principle of a video classification method according to this embodiment, and referring to fig. 2, after video features (i.e., feature matrices extracted from images of videos) and audio features (i.e., feature matrices extracted from audios of videos) are input to a clustering operation layer 201, a first matrix is determined by the clustering operation layer 201, and the first matrix is input to a weight aggregation layer 202. After the weight aggregation layer 202 determines the second matrix, the second matrix is input into the full-link layer 203, and prediction vectors corresponding to the preset categories are output through the full-link layer 203, so that the probability that the video belongs to each preset category can be determined through the prediction vectors, one category with the highest probability or a plurality of categories with higher probability is used as the category to which the video belongs, and a label is added according to the category to which the video belongs.

The embodiment further strengthens the difference between the characteristics corresponding to the various categories through the weight aggregation layer, thereby being beneficial to predicting the category to which the video belongs more accurately.

Wherein the inputting the first matrix into a weight aggregation layer of the video classification model and outputting a second matrix by the weight aggregation layer comprises:

inputting the second matrix into a first convolution sublayer in the weight aggregation layer to obtain a first convolution result, processing the first convolution result through a first activation function to obtain a first processing result, inputting the first processing result into a second convolution sublayer to obtain a second convolution result, and processing the second convolution result through a second activation function to obtain the second matrix.

Wherein the first activation function is Relu and the second activation function is Sigmod.

In this embodiment, a weight aggregation layer composed of a first convolution sublayer, a first activation function, a second convolution sublayer, and a second activation function is constructed, where table 1 is a structure of the weight aggregation layer composed of the first convolution sublayer, the first activation function, the second convolution sublayer, and the second activation function. It is understood that other configurations of weight aggregation layers may be constructed as desired, for example, a weight aggregation layer consisting of three or more convolution sublayers and three or more activation functions may be constructed.

Table 1 structural information of weight aggregation layer

The embodiment provides a weight aggregation layer with a simpler structure, and the classification effect is further improved on the premise of not increasing the calculation complexity through the weight aggregation layer.

Further, on the basis of the foregoing embodiments, the inputting the feature information into the clustering operation layer, and outputting a first matrix by the clustering operation layer includes:

taking a feature matrix extracted according to any one component element in the feature information as a target feature matrix, and determining a target clustering sublayer corresponding to the target feature matrix from each clustering sublayer of the clustering operation layer;

inputting the target characteristic matrix into the target clustering sublayer, and outputting a clustering operation matrix by the target clustering sublayer;

acquiring a clustering operation matrix output by a clustering sublayer corresponding to each feature matrix in the feature information, and splicing the acquired clustering operation matrices to obtain the first matrix;

and the target clustering sublayer is used for calculating according to the target feature matrix and the clustering center matrix belonging to the target clustering sublayer.

If the feature information includes feature matrices extracted according to different constituent elements, each feature matrix can be input into a clustering sublayer corresponding to the feature matrix, and finally, clustering operation matrices output by the clustering sublayers are spliced to obtain a first matrix.

It should be noted that each cluster sublayer includes a cluster center matrix, and the cluster center matrices in different cluster sublayers may be different or the same in size. For example, fig. 2 shows a cluster center matrix size of 1024 × 64 in the cluster sub-layer for processing the video features, and a cluster center matrix size of 128 × 32 in the cluster sub-layer for processing the audio features.

In the embodiment, the feature matrixes extracted according to different types of component elements are respectively subjected to clustering analysis, so that the interference between different types of feature matrixes is avoided, and finally, all clustering operation matrixes are spliced, so that the subsequent classification process is carried out based on all component elements, and the classification process comprehensively considers the different types of features of the video.

Further, on the basis of the foregoing embodiments, in order to clearly illustrate the operation process of each clustering sublayer, fig. 3 is a schematic diagram of an information processing process of the clustering sublayer provided in this embodiment, and referring to fig. 3, the inputting the target feature matrix into the target clustering sublayer and outputting the clustering operation matrix by the target clustering sublayer includes:

1) Taking the cluster center matrix belonging to the target cluster sublayer as a target cluster center matrix, inputting the target feature matrix into a cluster analysis unit in the target cluster sublayer, and outputting a cluster analysis result by the cluster analysis unit;

2) Inputting the cluster analysis result and the target feature matrix into an intermediate operation unit in the target cluster sublayer, and outputting an intermediate operation result by the intermediate operation unit;

3) Inputting the cluster analysis result and the intermediate operation result into a first operation unit in the target cluster sublayer, and determining a first coding matrix by the first operation unit according to the cluster analysis result, the intermediate operation result and a covariance matrix serving as a training parameter;

4) Inputting the cluster analysis result, the intermediate operation result and the target feature matrix into a second operation unit in the target cluster sublayer, and determining a second coding matrix by the second operation unit according to the cluster analysis result, the intermediate operation result, the target feature matrix and the covariance matrix;

5) And splicing the first coding matrix and the second coding matrix to obtain a clustering operation matrix output by the target clustering sublayer.

The inputting the target feature matrix into the cluster analysis unit in the target cluster sublayer, and outputting a cluster analysis result by the cluster analysis unit in 1) above specifically includes:

inputting the target characteristic matrix in _ put into the cluster analysis unit, determining a first transformation matrix by the cluster analysis unit according to a cross multiplication result of the target characteristic matrix in _ put and the target cluster center matrix Ck, determining a first weight matrix activation according to the first transformation matrix, and transposing the first weight matrix activation to obtain a second weight matrix activation _ T;

and determining a characteristic cluster vector a _ sum according to the sum of elements in each column vector of the first weight matrix activation, and taking the second weight matrix activation _ T and the characteristic cluster vector a _ sum as the cluster analysis result.

Wherein the determining a first weight matrix from the first transformation matrix comprises: and activating the first transformation matrix through a softmax function to obtain the first weight matrix.

The step 2) of inputting the cluster analysis result and the target feature matrix into an intermediate operation unit in the target cluster sublayer, and outputting an intermediate operation result by the intermediate operation unit includes;

inputting the target feature matrix in _ put and the second weight matrix activation _ T in the cluster analysis result into the intermediate operation unit, determining a second transformation matrix by the intermediate operation unit according to a cross multiplication result of the second weight matrix activation _ T and the target feature matrix in _ put, transposing the second transformation matrix to obtain a third transformation matrix fv1_1, and taking the third transformation matrix fv1_1 as the intermediate operation result.

The step 3) of inputting the cluster analysis result and the intermediate operation result into the first operation unit in the target cluster sub-layer, wherein the first operation unit determines the first coding matrix according to the cluster analysis result, the intermediate operation result and a covariance matrix as a training parameter, and includes:

inputting the feature cluster vector a _ sum and the third transformation matrix fv1_1 in the cluster analysis result into the first operation unit, and determining a fourth transformation matrix a1 by the first operation unit according to a result of multiplying each row vector of the target cluster center matrix Ck by a position element corresponding to the feature cluster vector;

determining a first residual matrix fv1_2 according to a result of subtracting the third transformation matrix fv1_1 from the fourth transformation matrix a1, and determining a second residual matrix fv1_2 according to the first residual matrix fv1_2 and the covariance matrix delta _k As a result of the division, the first coding matrix fv1_3 is determined.

The step 4) of inputting the cluster analysis result, the intermediate operation result, and the target feature matrix into a second operation unit in the target cluster sublayer, wherein the second operation unit determines a second coding matrix according to the cluster analysis result, the intermediate operation result, the target feature matrix, and the covariance matrix, and the step includes:

inputting the second weight matrix activation _ T, the feature clustering vector a _ sum, the third transformation matrix fv1_1 and the target feature matrix in _ put into the second arithmetic unit, performing square operation on each element of the target feature matrix by the second arithmetic unit to obtain a second-order feature matrix, determining a fifth transformation matrix according to a cross multiplication result of the second weight matrix activation _ T and the second-order feature matrix, and transposing the fifth transformation matrix to obtain a sixth transformation matrix fv2_1;

performing square operation on each element of the target clustering center matrix Ck to obtain a second-order clustering center matrix, and determining a seventh transformation matrix a2 according to a result of multiplying each row vector of the second-order clustering center matrix by a position element corresponding to the feature clustering vector a _ sum;

multiplying each element of the target clustering center matrix Ck by a preset ratio to obtain a transformation clustering center matrix, and determining an eighth transformation matrix b2 according to a result of point multiplication of the third transformation matrix fv1_1 and the transformation center matrix;

for the covariance matrix delta _k Performing a square operation on each element of the first coding matrix to obtain a second order covariance matrix, adding the sixth transformation matrix fv2_1, the seventh transformation matrix a2 and the eighth transformation matrix b2 to obtain a second residual matrix fv2_2, and determining the second coding matrix fv2_3 according to a result of dividing the second residual matrix fv2_2 by the second order covariance matrix.

The preset ratio is set manually, for example, the preset ratio is-2.

The target feature matrix in _ put is a feature extracted from an image of a video in _ video and a feature extracted from an audio of a video in _ audio.

In the above 4), the operation is performed according to the target feature matrix subjected to the square operation, the target clustering center matrix and the covariance matrix, so that the nonlinear factors in the process of analyzing the target feature matrix by the clustering sublayer are increased, and the increase of the nonlinear factors is beneficial to improving the accuracy of classification.

Wherein, the splicing the first coding matrix and the second coding matrix in the above 5) to obtain the clustering operation matrix output by the target clustering sublayer includes:

and normalizing the first coding matrix fv1_3, normalizing the second coding matrix fv2_3, and splicing the normalized first coding matrix fv1_3 and the normalized second coding matrix fv2_3 to obtain a clustering operation matrix output by the target clustering sublayer.

Wherein, the purpose of the normalization processing in 5) is to facilitate the subsequent data processing and ensure that the convergence is accelerated when the model runs.

In the embodiment, the clustering analysis of the target feature matrix is realized through the clustering analysis unit, the intermediate operation unit, the first operation unit and the second operation unit in the clustering sublayer, and the video classification according to the analysis is facilitated.

Wherein, the extracting the feature information according to the constituent elements of the video comprises:

extracting images with the frame number equal to a preset frame number from the video, inputting the extracted images into an inclusion _ v3 model, and taking a matrix output by the last hidden layer of the inclusion _ v3 model as a characteristic matrix extracted according to the images of the video;

extracting audio frequency segments with the segment number equal to the preset segment number from the audio frequency of the video, inputting the extracted audio frequency segments into a vgg model, and taking a matrix output by a vgg model as a characteristic matrix extracted according to the audio frequency of the video;

taking a feature matrix extracted from an image of the video and a feature matrix extracted from an audio of the video as the feature information;

the more the number of image frames contained in the video is, the larger the preset number of frames is; the longer the audio duration of the video is, the larger the preset segment number is.

The method provided by the embodiment can be used for classifying videos of any duration, but in order to ensure the efficiency of video classification, the method provided by the embodiment is generally used for classifying short videos. The short video is a video with the video playing time length smaller than the preset playing time length. For example, the preset playing time is 5 minutes, the preset frame number is 300, and the preset number of clips is 300.

The video classification model is obtained by training an initial model, and the following training process of the video classification model is introduced:

as an example, the initial model includes the clustering operation layer, the weight aggregation layer and the fully-connected layer in the above embodiments, wherein the clustering operation layer includes at least one sub-clustering layer. The training parameters in the initial model include the cluster center matrix of each cluster sub-layer and the covariance matrix of each cluster sub-layer. In the training process of the model, sample characteristic information extracted from a sample video is used as input, classification information representing the class to which the sample video belongs is used as a training label, and the video classification model is obtained by continuously adjusting training parameters.

The following provides a specific process of obtaining a video classification model through model training, which uses an image extraction feature matrix (in _ video) through video and an audio extraction feature matrix through video as feature information (in _ audio), and the process includes the following 4 steps:

step 1: constructing a sample data set

Acquiring a large number of videos with the duration shorter than 5 minutes, uniformly sampling 300 frames of images of each video, inputting 300 sampled images into the existing inclusion _ v3 model, and obtaining a vector with the dimension of 2048 through the output of the last hidden layer of the model. Since the number of the sampled images is 300, an initial image feature matrix of 2048 × 300 is obtained. And then, performing PCA dimension reduction processing to obtain 300-1024 image feature matrices in _ video (feature matrices extracted according to the images).

The audio is uniformly sampled to obtain 300 audio segments, and the 300 audio segments are input into the existing vgg model to obtain a 300X128 audio feature matrix in _ audio (according to the feature matrix extracted from the audio). The preset category includes at least one of: self-timer, fun, animation, games, basketball, soccer, art, movies, etc. And integrating the video feature matrix, the audio features and the video label to obtain a video label data set.

Step 2: building a deep learning model

A first module: implementing in _ video clustering

(1) Defining a cluster center matrix C for processing a feature matrix in _ video extracted from a video image _k Wherein, C _k The matrix is 1024 × 64, k represents 64 cluster centers, each of which is 1024 dimensions. Defining a covariance matrix delta _k Matrix 1024 x 64, and C _k Correspondingly, the error is adjusted.

(2) The feature matrix in _ video extracted according to the video image and the clustering center matrix C corresponding to the image _k And multiplying to obtain a first transformation matrix with the size of 300 × 64, and activating the first transformation matrix through a softmax function to obtain a first weight matrix activation with the dimension of 300 × 64.

(3) And sequentially calculating the sum of each row of elements of the obtained first weight matrix activation to obtain a feature cluster vector a _ sum with the dimensionality of 1 × 64. The feature cluster vector represents the distance from the video image feature to each cluster center, and if the value in the feature cluster vector is closer to 1, the feature cluster vector represents that the video image feature is closer to the corresponding cluster center. Conversely, a value closer to 0 indicates that the video image feature is further from the corresponding cluster center.

(4) The first weight matrix activation is subjected to matrix transposition to obtain a second weight matrix activation _ T of 64 × 300.

(5) Multiplying the second weight matrix activation _ T by the feature matrix in _ video to obtain a second transformation matrix of 64 × 1024, and transposing the second transformation matrix to obtain a third transformation matrix fv1_1 of 1024 × 64.

(6) The feature clustering vector a _ sum and a clustering center matrix C are combined _k The vectors of each row are multiplied point by point to obtain a fourth transformation matrix a1 with the size of 1024 × 64.

(7) And subtracting the fourth transformation matrix a1 from the third transformation matrix fv1_1 to obtain a first residual matrix fv1_2 with the size of 1024 × 64, wherein the first residual matrix fv1_2 represents the accumulated residual of each 1024-dimensional feature about each cluster center.

(8) The first residual matrix fv1_2 is combined with the covariance matrix delta _k Divide to get a first order of 1024 x 64Data as the first encoding matrix fv1_3.

(9) Squaring each element in the feature matrix in _ video to obtain a second-order feature matrix with the size of 300 × 1024, multiplying the second weight matrix activation _ T by the second-order feature matrix to obtain a fifth transformation matrix with the size of 64 × 1024, and transposing the fifth transformation matrix to obtain a sixth transformation matrix fv2_1 with the size of 1024 × 64.

(10) Clustering center matrix C _k Square each element in (a) to get a second order cluster center matrix of 1024 x 64 size. And multiplying the characteristic clustering vector a _ sum with each row vector of the second-order clustering center matrix point by point to obtain a seventh transformation matrix a2 with the size of 1024 × 64.

(11) Clustering center matrix C _k Each element in the transformation matrix is amplified by-2 times in an equal proportion to obtain a transformation clustering center matrix with the size of 1024 × 64, and the third transformation matrix fv1_1 and each row vector of the transformation center matrix are multiplied point by point to obtain an eighth transformation matrix b2 with the size of 1024 × 64.

(12) And adding the sixth transformation matrix fv2_1 to the seventh transformation matrix a2 and the eighth transformation matrix b2 to obtain a second residual matrix fv2_2 with the size of 1024 × 64, namely, the accumulation of residual squares of 1024-dimensional features with respect to the center of each cluster.

(13) The covariance matrix delta _k Square each element in the matrix to obtain 1024 x 64 matrix

The second residual matrix fv2_2 and the matrix

The division results in second order data with a size of 1024 × 64 as the second encoding matrix fv2_3.

(14) The first encoding matrix fv1_3 is normalized to obtain a matrix fv1_4 with a size of 1 × 65536, and the second encoding matrix fv2_3 is normalized to obtain a matrix fv2_4 with a size of 1 × 65536. And (3) splicing the matrices fv1_4 and fv2_4 according to columns to obtain an output matrix fv _ video of 1 × 131072 (namely, a clustering operation matrix output by a clustering sublayer).

A second module: implementing in _ audio clustering (where the processing of the first and second modules are independent of each other)

(1) A matrix of 128 × 32 cluster centers C _ k for processing the feature matrix in _ audio extracted from the video image is defined, k representing 32 cluster centers each having 128 dimensions. A matrix 128 x 32 is defined as the covariance matrix δ _ k, corresponding to C _ k, for adjusting the error magnitude.

(2) Multiplying a feature matrix in _ audio extracted according to the video and the audio by a clustering center matrix C _ k corresponding to the audio to obtain a first transformation matrix with the size of 300 × 128, and activating the first transformation matrix through a softmax function to obtain a first weight matrix activation with the dimension of 300 × 32.

(3) And sequentially calculating the sum of each row of elements of the obtained first weight matrix activation to obtain a feature cluster vector a _ sum with the dimension of 1 × 32. The feature cluster vector represents the distance from the audio of the video to each cluster center, and if the value in the feature cluster vector is closer to 1, the closer the audio feature is to the corresponding cluster center. Conversely, a value closer to 0 indicates that the audio feature is further from the corresponding cluster center.

(4) And performing matrix transposition on the first weight matrix activation to obtain a second weight matrix activation _ T of 32 × 300.

(5) The second weight matrix activation _ T is multiplied by the feature matrix in _ audio to obtain a second transformation matrix of 32 × 128, and the second transformation matrix is transposed to obtain a third transformation matrix fv1_1 of 128 × 32.

(6) And multiplying the feature cluster vector a _ sum with each row vector of the cluster center matrix C _ k point by point to obtain a fourth transformation matrix a1 with the size of 128 x 32.

(7) And subtracting the fourth transformation matrix a1 from the third transformation matrix fv1_1 to obtain a first residual matrix fv1_2 with the size of 128 × 32, wherein the first residual matrix fv1_2 represents the accumulated residual of each 128-dimensional feature about each cluster center.

(8) The first residual matrix fv1_2 is divided by the covariance matrix δ _ k to obtain first-order data with a size of 128 × 64, which is used as the first coding matrix fv1_3.

(9) Squaring each element in the feature matrix in _ audio to obtain a second-order feature matrix with the size of 300 × 128, multiplying the second weight matrix activation _ T by the second-order feature matrix to obtain a fifth transformation matrix with the size of 32 × 128, and transposing the fifth transformation matrix to obtain a sixth transformation matrix fv2_1 of 128 × 32.

(10) And squaring each element in the clustering center matrix C _ k to obtain a second-order clustering center matrix with the size of 128 × 32. And multiplying the characteristic clustering vector a _ sum with each row vector of the second-order clustering center matrix point by point to obtain a seventh transformation matrix a2 with the size of 128 x 32.

(11) And amplifying each element in the clustering center matrix C _ k by-2 times in an equal proportion to obtain a transformation clustering center matrix with the size of 128 x 32, and multiplying the third transformation matrix fv1_1 and each row vector of the transformation center matrix point by point to obtain an eighth transformation matrix b2 with the size of 128 x 32.

(12) The sixth transformation matrix fv2_1 is added to the seventh transformation matrix a2 and the eighth transformation matrix b2 to obtain a second residual matrix fv2_2 with a size of 128 × 32, i.e., the accumulation of the residual squares of the respective 128-dimensional features with respect to each cluster center.

(13) And squaring each element in the covariance matrix delta _ k to obtain a matrix delta _ k ^2 of 128 x 32, and dividing the second residual error matrix fv2_2 by the matrix delta _ k ^2 to obtain second-order data with the size of 128 x 32, wherein the second-order data is used as a second coding matrix fv2_3.

(14) The first encoding matrix fv1_3 is normalized to obtain a matrix fv1_4 with a size of 1 × 4096, and the second encoding matrix fv2_3 is normalized to obtain a matrix fv2_4 with a size of 1 × 4096. And splicing the matrices fv1_4 and fv2_4 according to columns to obtain an output matrix fv _ audio of 1 × 8192.

A third module: weightLayer realizes feature weight aggregation

(1) And splicing the video output matrix fv _ video and the audio output matrix fv _ audio according to columns to obtain an output matrix fv with the size of 1 × 139264. Will be provided with

The matrix fv is then processed through a weight aggregation layer:

(2) The matrix fv is input into a convolutional group, weightLayer, whose network structure is defined below. As shown in the following table, the WeightLayer includes 2 convolutional layers, 1 Relu layer, and 1 sigmod activation function, where the convolutional kernel size of each of the 2 convolutional layers is 1*1. The structure of the WeightLayer is shown in table 1 above.

Fig. 4 is a schematic diagram of an implementation principle of the weight aggregation layer provided in this embodiment, and referring to fig. 4, for an input matrix fv, an image feature P0 after passing through a Relu layer and a corresponding weight W2 after passing through a sigmod activation function are obtained, where P0 and W2 are both matrices with a size of 1 × 2048. And performing matrix dot multiplication on the P0 and the W2 to obtain a feature matrix P1 with the size of 1 × 2048. Wherein, snow, tree and Ski in FIG. 4 are all tag words for representing video category.

A fourth module: label classification

The feature matrix P1 is passed through a full-link layer to obtain an output prediction (i.e. a prediction vector) represented by the probability of the video in each category, where a larger value indicates a closer proximity to the corresponding category, and conversely, a smaller value indicates a greater difference from the corresponding category.

And step 3: training model

(1) And inputting the sample data into the model constructed in the step 2. A set of sample data contains video features, audio features and label data label.

(2) And performing cross entropy loss calculation on the model output value predict and the actual label by adopting a cross entropy loss function to obtain a loss value. The cross entropy loss function formula is as follows, where y represents the actual value label, x represents the output value prediction, and w is the initialization weight:

loss(x,y)＝-w[ylogx+(1-y)log(1-x)]

for example, the preset classification categories of short videos are animation, movie and television, diet, entertainment, sports and games, the classification results of the short videos are [ animation (0.01), movie and television (0.91), diet (0.87), entertainment (0.02), sports (0.01) and games (0.01) ], and the generated classification output prediction is (0.01,0.98,0.95,0.02,0.01,0.01). If the short video actual label is [ animation (0), movie (1), diet (1), entertainment (0), sports (0), and games (0) ],0 represents not belonging to the category, and 1 represents belonging to the corresponding category, i.e. (0,1,1,0,0,0). And sequentially calculating the cross entropy loss of each type, and accumulating and averaging to obtain a final loss value. And after obtaining the loss value, training the model through a back propagation algorithm.

(3) And after the training is finished, obtaining a deep learning model.

And 4, step 4: automatic tag classification detection for short video

Extracting 300 frames of the short video, obtaining video characteristics and audio characteristics of the short video through characteristic extraction, inputting the characteristics into a trained short video classification model, outputting the corresponding probability of each category, extracting the first 3 categories with the maximum probability, and outputting the categories as the categories of the short video.

The method provided by the embodiment can realize automatic classification of the short video tags, is accurate and efficient, and the result accords with the expectation through simulation.

Fig. 5 is a block diagram of a video classification apparatus provided in this embodiment, and referring to fig. 5, the video classification apparatus includes an extraction module 501 and a classification module 502, wherein,

an extracting module 501, configured to extract feature information according to constituent elements of a video, where the constituent elements include an image, an audio, and/or a subtitle of the video;

a classification module 502, configured to input the feature information into a video classification model, so as to obtain classification information output by the video classification model; the classification information is used for representing the category to which the video belongs;

The embodiment of the invention provides a video classification device, which realizes the classification of videos through a video classification model, wherein the video classification model comprises a clustering operation layer which operates according to a clustering center matrix as a training parameter and the characteristic information. Through the operation process of the clustering operation layer, clustering analysis can be performed on the characteristic information based on the clustering center represented by each column vector in the clustering center matrix, the characteristics beneficial to determining the category of the video are extracted, and the accuracy of video classification is improved. Meanwhile, the automatic classification of the videos is realized through the video classification model, and the classification efficiency of video classification is improved.

The video classification apparatus provided in this embodiment is suitable for the video classification method provided in the foregoing embodiments, and details are not repeated herein.

Optionally, the inputting the feature information into a video classification model to obtain classification information output by the video classification model includes:

Optionally, the inputting the feature information into the clustering operation layer, and outputting a first matrix by the clustering operation layer, includes:

Optionally, the inputting the target feature matrix into the target clustering sublayer and outputting a clustering operation matrix by the target clustering sublayer includes:

taking the cluster center matrix belonging to the target cluster sublayer as a target cluster center matrix, inputting the target feature matrix into a cluster analysis unit in the target cluster sublayer, and outputting a cluster analysis result by the cluster analysis unit;

inputting the cluster analysis result and the target feature matrix into an intermediate operation unit in the target cluster sublayer, and outputting an intermediate operation result by the intermediate operation unit;

inputting the cluster analysis result and the intermediate operation result into a first operation unit in the target cluster sublayer, and determining a first coding matrix by the first operation unit according to the cluster analysis result, the intermediate operation result and a covariance matrix serving as a training parameter;

inputting the cluster analysis result, the intermediate operation result and the target feature matrix into a second operation unit in the target cluster sublayer, and determining a second coding matrix by the second operation unit according to the cluster analysis result, the intermediate operation result, the target feature matrix and the covariance matrix;

and splicing the first coding matrix and the second coding matrix to obtain a clustering operation matrix output by the target clustering sublayer.

Optionally, the inputting the target feature matrix into a cluster analysis unit in the target cluster sublayer, and outputting a cluster analysis result by the cluster analysis unit includes:

inputting the target feature matrix into the cluster analysis unit, determining a first transformation matrix by the cluster analysis unit according to a cross multiplication result of the target feature matrix and the target cluster center matrix, determining a first weight matrix according to the first transformation matrix, and transposing the first weight matrix to obtain a second weight matrix;

and determining a characteristic clustering vector according to the sum of elements in each column vector of the first weight matrix, and taking the second weight matrix and the characteristic clustering vector as the clustering analysis result.

Optionally, the inputting the cluster analysis result and the target feature matrix into an intermediate operation unit in the target cluster sublayer, and the outputting an intermediate operation result by the intermediate operation unit includes;

and inputting the target feature matrix and the second weight matrix in the cluster analysis result into the intermediate operation unit, determining a second transformation matrix by the intermediate operation unit according to a result of cross multiplication of the second weight matrix and the target feature matrix, transposing the second transformation matrix to obtain a third transformation matrix, and taking the third transformation matrix as the intermediate operation result.

Optionally, the inputting the cluster analysis result and the intermediate operation result into a first operation unit in the target cluster sublayer, and determining, by the first operation unit, a first coding matrix according to the cluster analysis result, the intermediate operation result, and a covariance matrix as a training parameter includes:

inputting the feature cluster vectors in the cluster analysis result and the third transformation matrix into the first operation unit, and determining a fourth transformation matrix by the first operation unit according to the result of multiplying each row vector of the target cluster center matrix by the corresponding position element of the feature cluster vector;

and determining a first residual matrix according to a result of subtracting the third transformation matrix from the fourth transformation matrix, and determining the first coding matrix according to a result of dividing the first residual matrix by the covariance matrix.

Optionally, the inputting the cluster analysis result, the intermediate operation result, and the target feature matrix into a second operation unit in the target cluster sublayer, and determining, by the second operation unit, a second coding matrix according to the cluster analysis result, the intermediate operation result, the target feature matrix, and the covariance matrix includes:

inputting the second weight matrix, the feature cluster vector, the third transformation matrix and the target feature matrix into the second arithmetic unit, performing square operation on each element of the target feature matrix by the second arithmetic unit to obtain a second-order feature matrix, determining a fifth transformation matrix according to a result of cross multiplication of the second weight matrix and the second-order feature matrix, and transposing the fifth transformation matrix to obtain a sixth transformation matrix;

performing square operation on each element of the target clustering center matrix to obtain a second-order clustering center matrix, and determining a seventh transformation matrix according to the result of multiplying each row vector of the second-order clustering center matrix by the corresponding position element of the feature clustering vector;

multiplying each element of the target clustering center matrix by a preset ratio to obtain a transformation clustering center matrix, and determining an eighth transformation matrix according to a dot multiplication result of the third transformation matrix and the transformation center matrix;

performing a square operation on each element of the covariance matrix to obtain a second-order covariance matrix, adding the sixth transformation matrix, the seventh transformation matrix and the eighth transformation matrix to obtain a second residual error matrix, and determining the second coding matrix according to a result of dividing the second residual error matrix by the second-order covariance matrix.

Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor) 601, a communication Interface (Communications Interface) 602, a memory (memory) 603 and a communication bus 604, wherein the processor 601, the communication Interface 602 and the memory 603 complete communication with each other through the communication bus 604. The processor 601 may call logic instructions in the memory 603 to perform the following method: extracting feature information according to constituent elements of a video, wherein the constituent elements include images, audio and/or subtitles of the video; inputting the characteristic information into a video classification model to obtain classification information output by the video classification model; the classification information is used for representing the category to which the video belongs; the video classification model is a model for classifying videos, which is obtained by taking characteristic information extracted according to sample videos as input and through machine learning training; the video classification model comprises a clustering operation layer, and the clustering operation layer is used for performing operation according to a clustering center matrix serving as a training parameter and the characteristic information.

In addition, the logic instructions in the memory 603 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Further, an embodiment of the present invention discloses a computer program product, the computer program product comprising a computer program stored on a non-transitory readable storage medium, the computer program comprising program instructions, which when executed by a computer, enable the computer to perform the method provided by the above method embodiments, for example, including: extracting feature information according to constituent elements of a video, wherein the constituent elements include images, audio and/or subtitles of the video; inputting the characteristic information into a video classification model to obtain classification information output by the video classification model; the classification information is used for representing the category to which the video belongs; the video classification model is a model for classifying videos, which is obtained by taking characteristic information extracted according to sample videos as input and through machine learning training; the video classification model comprises a clustering operation layer, and the clustering operation layer is used for performing operation according to a clustering center matrix serving as a training parameter and the characteristic information.

In another aspect, an embodiment of the present invention further provides a non-transitory readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, for example, the method includes: extracting feature information according to constituent elements of a video, wherein the constituent elements include images, audio and/or subtitles of the video; inputting the characteristic information into a video classification model to obtain classification information output by the video classification model; the classification information is used for representing the category to which the video belongs; the video classification model is a model for classifying videos, which is obtained by taking characteristic information extracted according to sample videos as input and through machine learning training; the video classification model comprises a clustering operation layer, and the clustering operation layer is used for performing operation according to a clustering center matrix as a training parameter and the characteristic information.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding, the above technical solutions may be embodied in the form of a software product, which may be stored in a readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of video classification, comprising:

the video classification model is a model for classifying videos, which is obtained by taking characteristic information extracted according to sample videos as input and through machine learning training; the video classification model comprises a clustering operation layer, and the clustering operation layer is used for performing operation according to a clustering center matrix as a training parameter and the characteristic information;

the inputting the feature information into a video classification model to obtain classification information output by the video classification model includes:

acquiring a clustering operation matrix output by a clustering sublayer corresponding to each feature matrix in the feature information, and splicing the acquired clustering operation matrices to obtain a first matrix;

the target clustering sublayer is used for calculating according to the target feature matrix and a clustering center matrix belonging to the target clustering sublayer;

2. The video classification method according to claim 1, wherein the inputting the target feature matrix into the target clustering sublayer and the outputting a clustering operation matrix by the target clustering sublayer comprises:

3. The video classification method according to claim 2, wherein the inputting the target feature matrix into a cluster analysis unit in the target cluster sub-layer, and outputting a cluster analysis result by the cluster analysis unit, comprises:

4. The video classification method according to claim 3, wherein the inputting the cluster analysis result and the target feature matrix into an intermediate operation unit in the target cluster sublayer, the intermediate operation unit outputting an intermediate operation result, comprises;

5. The video classification method according to claim 4, wherein the inputting the cluster analysis result and the intermediate operation result into a first operation unit in the target cluster sub-layer, and the determining, by the first operation unit, a first coding matrix according to the cluster analysis result, the intermediate operation result and a covariance matrix as a training parameter comprises:

6. The video classification method according to claim 4, wherein the inputting the cluster analysis result, the intermediate operation result, and the target feature matrix into a second operation unit in the target cluster sub-layer, and the determining, by the second operation unit, a second coding matrix according to the cluster analysis result, the intermediate operation result, the target feature matrix, and the covariance matrix comprises:

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the video classification method according to any of claims 1 to 6 are implemented by the processor when executing the program.

8. A non-transitory readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the video classification method according to any one of claims 1 to 6.