WO2020221278A1

WO2020221278A1 - Video classification method and model training method and apparatus thereof, and electronic device

Info

Publication number: WO2020221278A1
Application number: PCT/CN2020/087690
Authority: WO
Inventors: 苏驰; 李凯; 陈宜航; 刘弘也
Original assignee: 北京金山云网络技术有限公司; 北京金山云科技有限公司
Priority date: 2019-04-29
Filing date: 2020-04-29
Publication date: 2020-11-05
Also published as: CN110070067A; CN110070067B

Abstract

The present application provides a video classification method and a model training method and apparatus thereof, and an electronic device. The training method comprises: extracting initial features of a plurality of video frames by means of a convolutional neural network; extracting final features of the plurality of video frames from the initial features by means of a recurrent neural network; inputting the final features into an output network, and outputting a prediction result of the plurality of video frames; determining a loss value of the prediction result by means of a preset predicted loss function; and training an initial model according to the loss value until a parameter in the initial model converges, and obtaining a video classification model. According to the present application, the convolutional neural network and the recurrent neural network are combined, so that an operation amount can be greatly reduced, thereby improving the model training and recognition efficiency; and meanwhile, association information between the video frames can be considered in a feature extraction process, so that the extracted features can accurately represent the video types, and the accuracy of video classification is improved.

Description

Video classification method and model training method, device and electronic equipment

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 29, 2019, the application number is 201910359704.0, and the invention title is "Video classification method and its model training method, device and electronic equipment", and its entire content Incorporated in this application by reference.

Technical field

This application relates to the field of image processing technology, and in particular to a video classification method and model training method, device and electronic equipment.

Background technique

In related technologies, the video can be classified by 3D convolutional neural network, and the spatiotemporal features of the video can be extracted by 3D convolution. However, the network parameters of the 3D convolutional neural network are relatively large, resulting in high computational cost in the network training process and recognition process , The time cost is large; in addition, the number of layers of the three-dimensional convolutional neural network is relatively shallow, and it is difficult to mine high-level semantic features, which makes the video classification accuracy rate low.

Summary of the invention

In view of this, the purpose of this application is to provide a video classification method and model training method, device, and electronic equipment to reduce the amount of calculation, improve model training and recognition efficiency, and at the same time improve the accuracy of video classification.

In the first aspect, an embodiment of the present application provides a method for training a video classification model. The method includes: determining current training data based on a preset training set; the training data includes multiple video frames; and inputting the training data to the initial model ; The initial model includes a convolutional neural network, a recurrent neural network and an output network; the initial features of a multi-frame video frame are extracted through the convolutional neural network; the final feature of a multi-frame video frame is extracted from the initial feature through the recurrent neural network; the final feature Input to the output network and output the prediction results of multiple frames of video frames; determine the loss value of the prediction result through the preset prediction loss function; train the initial model according to the loss value until the parameters in the initial model converge to obtain a video classification model.

In the second aspect, an embodiment of the present application provides a video classification method, which includes: obtaining a video to be classified; obtaining multiple video frames from the video according to a preset sampling interval; and inputting the multiple video frames to the pre-training The completed video classification model outputs the classification results of multi-frame video frames; the video classification model is trained through the training method of the above-mentioned video classification model; the video category is determined according to the classification results of the multi-frame video frames.

In a third aspect, an embodiment of the present application provides a training device for a video classification model. The device includes: a training data determination module configured to determine current training data based on a preset training set; the training data includes multiple video frames; The training data input module is set to input training data to the initial model; the initial model includes a convolutional neural network, a cyclic neural network and an output network; the initial feature extraction module is set to extract the initial features of a multi-frame video frame through the convolutional neural network ; The final feature extraction module is set to extract the final features of the multi-frame video frame from the initial features through the recurrent neural network; the prediction result output module is set to input the final feature to the output network and output the prediction result of the multi-frame video frame; loss The value determination and training module is set to determine the loss value of the prediction result through a preset prediction loss function; the initial model is trained according to the loss value until the parameters in the initial model converge to obtain a video classification model.

In a fourth aspect, an embodiment of the present application provides a video classification device. The device includes: a video acquisition module configured to acquire a video to be classified; a video frame acquisition module configured to acquire a video from the video at a preset sampling interval. Frame video frame; classification module, set to input multi-frame video frames to the pre-trained video classification model, and output the classification result of the multi-frame video frame; the video classification model is trained through the training method of the above-mentioned video classification model; category determination module , Set to determine the category of the video according to the classification result of the multi-frame video frame.

In a fifth aspect, an embodiment of the present application provides an electronic device, including a processor and a memory. The memory stores machine-executable instructions that can be executed by the processor. The processor executes the machine-executable instructions to implement the training of the aforementioned video classification model. Method, or the steps of the above video classification method.

In a sixth aspect, an embodiment of the present application provides a machine-readable storage medium that stores machine-executable instructions. When the machine-executable instructions are called and executed by a processor, the machine-executable instructions prompt The processor implements the training method of the video classification model or the steps of the video classification method.

In a seventh aspect, an embodiment of the present application provides an executable program code, the executable program code is set to be executed to execute any of the above-mentioned video classification model training methods, or the steps of any of the above-mentioned video classification methods .

The embodiments of the application bring the following beneficial effects:

The video classification method and its model training method, device and electronic equipment provided by the embodiments of the application adopt a combination of convolutional neural network and recurrent neural network, and extract features through a combination of two-dimensional convolution and one-dimensional convolution, Compared with three-dimensional convolution, the amount of calculation can be greatly reduced, thereby improving the efficiency of model training and recognition; this method can also consider the correlation information between video frames in the process of extracting features, so the extracted features can be accurate Characterize the video type, thereby improving the accuracy of video classification.

In order to make the above-mentioned objectives, features and advantages of the present application more obvious and understandable, the following is a detailed description of the preferred embodiments together with the accompanying drawings.

Description of the drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application and related technologies, the following briefly introduces the drawings that need to be used in the embodiments and related technologies. Obviously, the drawings in the following description are only of the present application. For some embodiments, those of ordinary skill in the art can obtain other drawings based on these drawings without creative work.

FIG. 1 is a flowchart of a method for training a video classification model provided by an embodiment of the application;

2 is a schematic structural diagram of a convolutional neural network in an initial model provided by an embodiment of this application;

FIG. 3 is a schematic structural diagram of an initial model provided by an embodiment of this application;

4 is a schematic structural diagram of another initial model provided by an embodiment of the application;

FIG. 5 is a flowchart of another video classification model training method provided by an embodiment of the application;

FIG. 6 is a flowchart of a video classification method provided by an embodiment of the application;

FIG. 7 is a schematic structural diagram of a training device for a video classification model provided by an embodiment of the application;

FIG. 8 is a schematic structural diagram of a video classification device provided by an embodiment of this application;

FIG. 9 is a schematic structural diagram of an electronic device provided by an embodiment of the application.

Detailed ways

In order to make the purpose, technical solutions, and advantages of the present application clearer, the following further describes the present application in detail with reference to the drawings and embodiments. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

Considering the problems of high computational cost, high time overhead, and low video classification accuracy for video classification by a three-dimensional convolutional neural network, embodiments of the present application provide a video classification method and model training method, device and electronic equipment ; This technology can be widely used in the classification of conventional video and short video files in various formats, and can be used in scenes such as video surveillance, video push, and video management.

In order to facilitate the understanding of this embodiment, a method for training a video classification model disclosed in the embodiment of the application is first introduced in detail. As shown in FIG. 1, the method includes the following steps:

Step S102: Determine current training data based on a preset training set; the training data includes multiple video frames.

In the subsequent content, in some cases, the training data needs to be determined multiple times during the training of the initial model; in one embodiment, the current training data can be determined from the preset training set each time; or, other implementations In this way, new training data can also be obtained every time.

Taking the current training data determined from the preset training set as an example, the training set can contain multiple videos or multiple groups of video frames, and each group contains multiple video frames. The multi-frame video frames are collected from the same video. Each video or each group of video frames is pre-labeled with a type tag, which can be divided from multiple angles, such as video theme, scene, action, character attributes, etc., so each video or each group of video frames can be performed from multiple angles classification. For example, the genre tags of video A include TV series, metropolis, crime solving, idol, etc.

When determining the training data, if the training set contains multiple videos, you can select a video from it, and then collect multiple video frames from the video, and determine the collected multiple video frames as the training data; if the training set is It contains multiple sets of video frames, from which a set of video frames can be selected, and the multiple video frames in the set of video frames are determined as training data.

In addition, the above-mentioned training set can also be divided into a training subset and a cross-validation subset according to a preset ratio. During the training process, the current training data can be determined from the training subset. After the training is completed or reaches a certain stage of training, test data can be obtained from the cross-validation subset to verify the performance of the model.

Step S104, input the training data to the initial model; the initial model includes a convolutional neural network, a cyclic neural network and an output network.

Before inputting to the initial model, multiple video frames in the training data can be adjusted to a preset size, such as 512 pixels * 512 pixels, so that the input video frame matches the convolutional neural network.

In step S106, the feature of the multi-frame video frame is extracted through the convolutional neural network as the initial feature.

For example, multiple video frames may be input to the convolutional neural network. In order to distinguish from subsequent content, the features output by the convolutional neural network are referred to as initial features.

The convolutional neural network can be implemented by a multi-layer convolutional layer, of course, it can also include a pooling layer, a fully connected layer, an activation function, and so on. The convolutional neural network performs convolution operations on each input video frame to obtain the feature map corresponding to each video frame. That is, the initial feature includes multiple feature maps, or the initial feature is composed of multiple feature maps. A large feature map composed.

In step S108, the feature of the multi-frame video frame is extracted from the initial feature through the cyclic neural network as the final feature.

For example, the above-mentioned initial features can be input to the recurrent neural network. In order to distinguish from subsequent content, the features output by the recurrent neural network are referred to as final features here.

Since multiple video frames are collected from the same video, the multiple video frames are related to each other in content. The above-mentioned convolutional neural network usually processes each video frame separately, and the extracted feature maps of each video frame are not related to each other. In order to enable the trained model to understand more comprehensively and accurately the content of the video corresponding to the multi-frame video frame, the initial feature can be processed through the cyclic neural network. According to the timing between the multi-frame video frame, the feature processing process is introduced The associated information of the upper and lower video frames makes the final feature more representative of the video type.

Recurrent neural network is a type of recurrent neural network that takes sequence data as input and recursively in the evolution direction of the sequence. Therefore, the use of recurrent neural network to process the initial features can introduce the associated information of the upper and lower video frames.

Step S110, input the final feature to the output network, and output the prediction result of the multi-frame video frame.

The output network can be realized by a fully connected layer. If the final feature is a two-dimensional multilayer feature, the fully connected layer can convert the final feature of the two-dimensional multilayer into a prediction result in the form of a one-dimensional vector. Each element in the prediction result corresponds to a category, and the value of this element represents the possibility that the video belongs to the category. Alternatively, the final feature may also be a feature of other dimensions, which is not specifically limited.

Step S112: Determine the loss value of the prediction result through the preset prediction loss function; train the initial model according to the loss value until the parameters in the initial model converge to obtain the video classification model.

As mentioned above, the multi-frame video frames in the training data are pre-labeled with type labels. In order to facilitate calculations, the type labels can be converted into vector form. In this vector, the probability value corresponding to the category of the video is usually 1. The probability value corresponding to the category not belonging is usually 0. The prediction loss function can compare the difference between the prediction result and the labeled type label. Generally, the greater the difference, the greater the aforementioned loss value. Based on the loss value, the parameters of each part of the above-mentioned initial model can be adjusted to achieve the purpose of training. When each parameter in the model converges, the training ends and the video classification model is obtained.

The training method of the video classification model provided by the embodiment of this application uses a combination of convolutional neural network and recurrent neural network, and extracts features through a combination of two-dimensional convolution and one-dimensional convolution. Compared with three-dimensional convolution, The amount of calculation can be greatly reduced, thereby improving the efficiency of model training and recognition; this method can also consider the associated information between video frames in the process of extracting features, so the extracted features can accurately represent the video type, thereby improving the video The accuracy of classification.

In addition, the above model can use multi-frame video frames sampled from the video to process and identify the video category. Compared with the way that the three-dimensional convolutional neural network needs to input video segments, the amount of processed data is small, thus further reducing the amount of calculation , Improve the efficiency of training and recognition.

The embodiment of this application also provides another method for training a video classification model, which is implemented on the basis of the method described in the above embodiment; from the above embodiment, it can be seen that the initial model includes a convolutional neural network, a recurrent neural network, and an output network In this embodiment, the specific structure of the initial model is further described.

Figure 2 shows a schematic diagram of the structure of a convolutional neural network in an initial model. The convolutional neural network includes multiple groups of sub-networks connected in sequence (three groups of sub-networks are taken as an example in Figure 2), a global average pooling layer and Classification fully connected layer; each group of sub-networks includes a batch normalization layer, an activation function layer, a convolution layer, and a pooling layer that are sequentially connected. Among them, the batch normalization layer in each group of sub-networks is used to normalize the data in the input video frame or feature map. This process can speed up the convergence speed of the convolutional neural network and the initial model, and can alleviate The problem of gradient dispersion in the multi-layer convolutional network enables the activation function layer in the convolutional neural network to perform function transformation on the normalized video frame or feature map. This transformation process breaks the linear combination of the convolutional layer input , Can improve the feature expression ability of convolutional neural network. The activation function layer may specifically be Sigmoid function, tanh function, Relu function, etc. The convolution layer is used to perform convolution calculations on the video frame or feature map transformed by the activation function layer, and output the corresponding feature map; the pooling layer can be an average pooling layer (average pooling or mean-pooling), global average pooling Layer (Global Average Pooling), max-pooling layer (max-pooling), etc.; the pooling layer can be used to compress the feature map output by the convolutional layer, retain the main features in the feature map, and delete non-main features to reduce The dimension of the feature map, taking the average pooling layer as an example, the average pooling layer can average the feature point values in the neighborhood of the preset range size of the current feature point, and use the average value as the new feature of the current feature point Point value. In addition, the pooling layer can also help the feature map to maintain some non-deformation, such as rotation invariance, translation invariance, and expansion invariance.

The global average pooling layer connected to the sub-network is used for the feature maps output by the last set of sub-networks, and the feature sub-maps of each layer are averaged to obtain a one-dimensional feature vector to further reduce the dimensionality of the feature map. The classification fully connected layer performs fully connected calculations on the feature vectors output by the global average pooling layer, and normalizes the calculation results through functions such as softmax.

In one embodiment, before executing the training method of the video classification model, the convolutional neural network may be pre-trained through a large number of data sets in advance, so as to obtain the initial parameters of the convolutional neural network. In this way, when the initial model is subsequently trained, the training is started from the initial parameters, which can improve the generalization ability of the model. Specifically, the data set may include an object recognition data set and a scene recognition data set. First, randomly initialize the weights of the convolutional neural network, randomly extract a preset number of training images from the above data set, and input them into the convolutional neural network for training one by one. If the parameters of the trained convolutional neural network cannot all converge , Then continue to randomly extract a preset number of training images from the data set for training until the parameters in the convolutional neural network converge, and the training is completed. As an example, before training the convolutional neural network, the batch size can be set to 256 (that is, the above-mentioned preset number), the momentum is set to 0.9, and the weight attenuation coefficient is set to 0.0001. In the training process, the momentum and weight attenuation coefficients are used to update various parameters in the convolutional neural network through the back propagation algorithm and the stochastic gradient descent method. After the training is completed, each parameter of the convolutional neural network converges, and these parameters can be used as the initial parameters of the convolutional neural network when the training method of the video classification model is executed.

Figure 3 shows a schematic structural diagram of an initial model; the initial model includes a convolutional neural network, a recurrent neural network and an output network, and also includes a global average pooling network; the global average pooling network is set in the convolutional neural network Between the cyclic neural network and the cyclic neural network; through the global average pooling network, the initial feature can be reduced in dimensionality, so that the dimension of the initial feature matches the cyclic neural network. That is, the dimensionality reduction process can be performed on the initial feature through the global average pooling network to obtain the dimensionality reduction feature; the feature of the multi-frame video frame can be extracted from the dimensionality reduction feature through the recurrent neural network , As the final feature.

In one embodiment, the recurrent neural network may specifically be a Long Short Term Memory Network (Long Short Term Memory Network, which may be referred to as an LSTM network for short). The performance of the long and short term memory network is better than that of ordinary recurrent neural networks and can make up for ordinary loops. Defects such as gradient explosion and gradient disappearance of neural network. The LSTM network includes input gates, output gates, and forgetting gates; the input gate is set to lift the features that need to be memorized from the initial features; the output gate is set to read the memory features, and the forgetting gate is set to determine whether to retain the memory features . When the initial features corresponding to multiple video frames are sequentially input into the LSTM network, the opening and closing timing of the input gate, output gate, and forget gate can be trained to complete the training of the cyclic neural network.

Specifically, taking M video frames as an example, the initial feature contains M feature vectors, expressed as z _t ,t∈[1,...,M], and then these M feature vectors can be sent to the LSTM network. The final feature of the multi-frame video frame is obtained, denoted as h _M ; the calculation process of each feature vector by the LSTM network is as follows:

f _t ＝σ(W _f [h _t-1 ,z _t ]+b _f )

i _t =σ(W _i [h _t-1 ,z _t ]+b _i )

o _t =σ(W _o [h _t-1 ,z _t ]+b _o )

h _t ＝o _t *tanh(C _t )

Among them, W _f , W _i , W _C , W _o , b _f , b _i , b _C and b _o are the preset parameters of the LSTM; after the M-th feature vector is input to the LSTM, h _{M is} obtained; the h _M That is, the final feature can be input to the subsequent output network.

In one embodiment, the above-mentioned output network may include a classification fully connected layer; the above-mentioned final features are input to the classification fully connected layer, and the classification result vector may be output. The classification fully connected layer contains multiple neurons, and the classification fully connected layer presets a weight vector; the weight vector contains the weight elements corresponding to each neuron in the classification fully connected layer; for each neuron, the neuron The element is connected with each feature element of the final feature. The neuron multiplies each feature element in the final feature with the corresponding weight element in the weight vector to obtain the predicted value of the neuron; because of the fully connected layer Contains multiple neurons, and the predicted values corresponding to multiple neurons constitute the above classification result vector.

In an embodiment, the above-mentioned initial model may further include a classification function; inputting the classification result vector output by the above-mentioned classification fully connected layer into the classification function can output the classification probability vector corresponding to the classification result vector. The classification function is used to calculate the probability of each element in the classification result vector. The function can be a Softmax function or other probability regression functions.

The above-mentioned initial model uses a combination of convolutional neural network and long and short-term memory network, and extracts features through a combination of two-dimensional convolution and one-dimensional convolution. Compared with three-dimensional convolution, the amount of calculation can be greatly reduced, thereby increasing Model training and recognition efficiency, this method can also consider the associated information between video frames in the process of extracting features, so the extracted features can accurately represent the video type; and the long- and short-term memory network can also avoid the deep network level. The problem of gradient explosion and gradient disappearance improves the performance of the model, which is conducive to extracting the deep features of the video frame, thereby further improving the accuracy of video classification.

The embodiment of the present application also provides another method for training a video classification model, which is implemented on the basis of the method described in the foregoing embodiment; this embodiment focuses on the specific content of the output network and the prediction loss function.

First, the prediction loss function includes a classification loss function; the classification loss function can be expressed as:

Among them, ∑ represents the summation operation,

exp represents the exponential function with the natural constant e as the base, log represents the logarithmic operation; p _l is the lth element of the classification probability vector corresponding to the classification result vector in the prediction result; y _l is the pre-labeled multi-frame video frame standard The l-th element of the probability vector; r _l is the proportion of the category corresponding to y _l in the training set; τ is a preset hyperparameter, which can be set to 1.

r _l is the proportion of the category corresponding to y _l in the training set. If the proportion of a category in the training set is low, the r _l value corresponding to the category will be smaller, and the w _l value will be larger, so Play a balanced role, alleviate the problem of uneven distribution of samples in each category, and then improve the training efficiency of the model and the recognition accuracy of the model.

In the above embodiment, it is described that the output network includes a classification fully connected layer. In this embodiment, the output network also includes a threshold fully connected layer, as shown in Figure 4; the final feature is input to the threshold fully connected layer, and the threshold result vector can be output . Similar to the classification fully connected layer, the threshold fully connected layer contains multiple neurons, and the threshold fully connected layer is preset with a weight vector; the weight vector contains the weight elements corresponding to each neuron of the threshold fully connected layer; for each This neuron is connected with each feature element of the final feature. The neuron multiplies each feature element in the final feature with the corresponding weight element in the weight vector to obtain the prediction corresponding to the neuron Value; Since the fully connected layer contains multiple neurons, the predicted values corresponding to multiple neurons constitute the above threshold result vector.

The threshold fully connected layer is set to extract the threshold result of the model for each category learning from the final feature, that is, the threshold result vector. Each category corresponds to its own threshold. The thresholds of each category can be the same or different. Compared with the way of manually setting the threshold, the threshold of model learning is more accurate and reasonable, which is beneficial to improve the classification accuracy of the model.

Based on the threshold result vector output by the threshold fully connected layer, the prediction loss function also includes a threshold loss function to evaluate the accuracy of the threshold result vector; the threshold loss function can be expressed as

Among them, ∑ represents the summation operation, log represents the logarithmic operation; y _l is the lth element of the standard probability vector of the pre-labeled multi-frame video frame; δ _l = σ(p _l -θ _l ); θ _l is the prediction The lth element of the threshold result vector in the result;

e is the natural constant in mathematics.

When the prediction loss function includes the classification loss function and the threshold loss function, in the process of determining the loss value of the prediction result through the prediction loss function, the function value of the classification loss function and the function value of the threshold loss function can be weighted and summed to obtain The loss value of the prediction result, for example, the loss value of the prediction result L=αL1+βL2; where α+β=1, and the values of α and β can be preset.

In the above method, the classification loss function takes into account the proportion of each category in the training set, which alleviates the problem of uneven sample distribution in each category, thereby improving the training efficiency of the model and the recognition accuracy of the model; the output network also sets The threshold fully connected layer, compared with the way of manually setting the threshold, the threshold of the model learning is more accurate and reasonable, which further improves the classification accuracy of the model.

This embodiment of the application also provides another method for training a video classification model, which is implemented on the basis of the method described in the above embodiment; this embodiment focuses on the specific process of training the initial model according to the loss value; as shown in Figure 5 As shown, the method includes the following steps:

Step S502: Determine current training data based on a preset training set; the training data includes multiple video frames;

Step S504, input training data to an initial model; the initial model includes a convolutional neural network, a cyclic neural network, and an output network;

Step S506, extracting features of multiple video frames through a convolutional neural network as initial features;

Step S508, extracting the features of the multi-frame video frame from the initial features through the recurrent neural network as the final feature;

Step S510, input the final feature to the output network, and output the prediction result of the multi-frame video frame;

Step S512: Determine the loss value of the prediction result through a preset prediction loss function;

Step S514, update the parameters in the initial model according to the loss value;

For example, the function mapping relationship can be set in advance, the original parameters and the loss value are input into the function mapping relationship, and the updated parameters can be calculated. The function mapping relationship of different parameters can be the same or different.

Specifically, the parameters to be updated can be determined from the initial model according to preset rules; the parameters to be updated can be all parameters in the initial model, or some parameters can be randomly determined from the initial model; and then calculate the loss value to be updated parameters Derivative of

Among them, L is the loss value of the probability matrix; W is the parameter to be updated;

Represents partial derivative operation; the parameter to be updated can also be called the weight of each neuron. This process can also be called a back-propagation algorithm; if the loss value is large, it means that the output of the current initial model does not match the expected output result, then the derivative of the loss value to the parameter to be updated in the initial model can be calculated. As a basis for adjusting the parameters to be updated.

After obtaining the derivative of each parameter to be updated, update the parameter to be updated to obtain the updated parameter to be updated

Among them, α is the preset coefficient. This process can also be referred to as a stochastic gradient descent algorithm; the derivative of each parameter to be updated can also be understood as the direction in which the loss value drops the fastest based on the current parameter to be updated. By adjusting the parameters in this direction, the loss value can be quickly reduced, so that The parameter converges. In addition, when the initial model is trained once, a loss value is obtained. At this time, one or more parameters can be randomly selected from each parameter in the initial model to perform the above-mentioned update process. The model training time is shorter and the algorithm is faster. ; Of course, the above-mentioned update process can also be performed on all the parameters in the initial model, and the model training in this way is more accurate.

Step S516: Determine whether the updated parameters are all converged; if the updated parameters are all converged, perform step S518; if the updated parameters are not all converged, perform step S502;

If the updated parameters do not all converge, the step of determining the current training data based on the preset training set is continued until the updated parameters all converge.

Step S518: Determine the initial model after parameter update as the video classification model.

In the above method, the combination of convolutional neural network and recurrent neural network is used to extract features through the combination of two-dimensional convolution and one-dimensional convolution. Compared with three-dimensional convolution, the amount of calculation can be greatly reduced, thereby increasing Model training and recognition efficiency; this method can also consider the associated information between video frames in the process of extracting features, so the extracted features can accurately characterize the video type, thereby improving the accuracy of video classification.

Based on the above video classification model training method, an embodiment of the present application also provides a video classification method; this method is implemented on the basis of the video classification model training method described in the above embodiment, as shown in FIG. 6, the method includes The following steps:

Step S602: Obtain a video to be classified;

The video can be a regular video or a short video; the specific format of the video can be MPEG (Moving Picture Experts Group), AVI (Audio Video Interleaved, audio video interleaved format), MOV (QuickTime film format) Wait, it is not limited here.

Step S604: Obtain multiple video frames from the video according to a preset sampling interval;

The sampling interval can be preset, for example, the sampling interval is 0.2 seconds, that is, 5 frames are sampled in 1 second.

Step S606: Input the multi-frame video frame to the pre-trained video classification model, and output the classification result of the multi-frame video frame; the video classification model is obtained by training the above-mentioned video classification model training method;

Step S608: Determine the video category according to the classification result of the multiple video frames.

The video classification method provided by the embodiment of the application first obtains multiple video frames from a video to be classified according to a preset sampling interval; inputs the multiple video frames to a pre-trained video classification model, and outputs multiple frames The classification result of the video frame; and the classification of the video is determined according to the classification result of the multi-frame video frame. Since the video classification model uses a combination of convolutional neural networks and recurrent neural networks, the features are extracted through the combination of two-dimensional convolution and one-dimensional convolution. Compared with three-dimensional convolution, the amount of calculation can be greatly reduced, thereby increasing Model training and recognition efficiency; this method can also consider the associated information between video frames in the process of extracting features, so the extracted features can accurately characterize the video type, thereby improving the accuracy of video classification.

The classification result of the multi-frame video frame output by the above-mentioned video classification model may include one or more categories, and the classification result of the multi-frame video frame can be directly determined as the video category. In another way, the classification result of the multi-frame video frame includes a classification probability vector and a threshold result vector. In this case, the probability value of each category in the classification probability vector can be compared with the corresponding threshold in the threshold result vector to determine the video category. Specifically, calculate the category vector of the video according to the following formula

Among them, p _l is the l-th element of the classification probability vector; θ _l is the l-th element of the threshold result vector; and in the category vector, the category corresponding to the non-zero element is determined as the category of the video. Since the probability value of the category corresponding to the non-zero element is greater than the corresponding threshold, the category can be regarded as the category of the video.

In the above method, the model not only outputs the classification probability vector, but also the threshold result vector. Based on the comparison result of the two vectors, the video category is finally determined. Compared with the method of manually setting the threshold, the modulus output threshold is more accurate and reasonable. Helps improve the accuracy of video classification. Identifying tags for videos based on the classification result is helpful for users to quickly discover the content they are interested in, and also helpful for recommending videos of interest to users, which improves user experience.

It should be noted that the foregoing method embodiments are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other.

Corresponding to the foregoing method embodiment, refer to the schematic structural diagram of a training device for a video classification model shown in FIG. 7, and the device includes:

The training data determining module 70 is configured to determine the current training data based on a preset training set; the training data includes multiple video frames;

The training data input module 71 is configured to input training data to the initial model; the initial model includes a convolutional neural network, a recurrent neural network, and an output network;

The initial feature extraction module 72 is configured to extract features of multiple frames of video frames through a convolutional neural network as the initial features;

The final feature extraction module 73 is configured to extract features of multiple video frames from the initial features through a recurrent neural network as the final feature;

The prediction result output module 74 is configured to input the final feature to the output network and output the prediction result of the multi-frame video frame;

The loss value determination and training module 75 is configured to determine the loss value of the prediction result through a preset prediction loss function; the initial model is trained according to the loss value until the parameters in the initial model converge to obtain a video classification model.

The training device for the video classification model provided by the embodiment of the application uses a combination of a convolutional neural network and a recurrent neural network to extract features through a combination of two-dimensional convolution and one-dimensional convolution. Compared with three-dimensional convolution, The amount of calculation can be greatly reduced, thereby improving the efficiency of model training and recognition; this method can also consider the associated information between video frames in the process of extracting features, so the extracted features can accurately represent the video type, thereby improving the video The accuracy of classification.

In some embodiments, the above-mentioned convolutional neural network includes multiple groups of sub-networks, a global average pooling layer, and a fully connected layer that are sequentially connected; each group of sub-networks includes a batch normalization layer, an activation function layer, and a volume Multilayer and pooling layer; the initial parameters of the above-mentioned convolutional neural network are obtained by training on a preset data set.

In some embodiments, the above-mentioned initial model further includes a global average pooling network; the global average pooling network is set between the convolutional neural network and the recurrent neural network; the above-mentioned device further includes: a dimensionality reduction module set to pass the global average pooling The dimensionalization network performs dimensionality reduction processing on the initial features to obtain dimensionality reduction features; the final feature extraction module 73 is specifically configured to extract the features of the multi-frame video frame from the dimensionality reduction features through the recurrent neural network as the final feature .

In some embodiments, the above-mentioned cyclic neural network includes a long and short-term memory network.

In some embodiments, the above-mentioned output network includes a fully-connected classification layer; the initial model also includes a classification function; the above-mentioned prediction result output module is configured to: input the final feature into the fully-connected classification layer and output a classification result vector; the above device also includes : Probability vector output module, set to input the classification result vector to the classification function, and output the classification probability vector corresponding to the classification result vector.

In some embodiments, the aforementioned prediction loss function includes a classification loss function; the classification loss function is

Among them, ∑ represents the summation operation,

exp represents the exponential function with the natural constant e as the base, log represents the logarithmic operation; p _l is the lth element of the classification probability vector corresponding to the classification result vector in the prediction result; y _l is the pre-labeled multi-frame video frame standard The l-th element of the probability vector; r _l is the proportion of the category corresponding to y _l in the training set; τ is the preset hyperparameter.

In some embodiments, the aforementioned output network includes a threshold fully connected layer; the aforementioned prediction result output module is configured to: input the final feature to the threshold fully connected layer, and output a threshold result vector.

In some embodiments, the aforementioned prediction loss function includes a threshold loss function; the threshold loss function is

y _l is the l-th element of the standard probability vector of the pre-labeled multi-frame video frame; δ _l = element p _l- θ _l ); θ _l is the l-th element of the threshold result vector in the prediction result;

In some embodiments, the above prediction loss function includes a classification loss function and a threshold loss function; the above loss value determination and training module is configured to: perform a weighted summation on the function value of the classification loss function and the function value of the threshold loss function to obtain The loss value of the prediction result.

In some embodiments, the above-mentioned loss value determination and training module is configured to: update the parameters in the initial model according to the loss value; determine whether the updated parameters are all converged; if the updated parameters are all converged, set the updated initial parameters The model is determined to be a video classification model; if the updated parameters do not all converge, continue to perform the step of determining the current training data based on the preset training set until the updated parameters all converge.

In some embodiments, the aforementioned loss value determination and training module is configured to: determine the parameters to be updated from the initial model according to preset rules; calculate the derivative of the loss value to the parameters to be updated in the initial model

Among them, L is the loss value; W is the parameter to be updated;

Indicates partial derivative operation; update the parameter to be updated, and get the updated parameter to be updated

Among them, α is the preset coefficient.

See FIG. 8 for a schematic structural diagram of a video classification device; the device includes:

The video acquisition module 80 is configured to acquire the video to be classified;

The video frame obtaining module 81 is configured to obtain multiple video frames from the video according to a preset sampling interval;

The classification module 82 is configured to input a multi-frame video frame to the pre-trained video classification model, and output the classification result of the multi-frame video frame; the video classification model is obtained through training of the above-mentioned video classification model training method;

The category determining module 83 is configured to determine the category of the video according to the classification result of the multiple video frames.

In some embodiments, the classification result of the above-mentioned multi-frame video frame includes: a classification probability vector and a threshold result vector; the above-mentioned category determination module is configured to: calculate the category vector of the video

Among them, p _l is the l-th element of the classification probability vector; θ _l is the l-th element of the threshold result vector; in the category vector, the category corresponding to the non-zero element is determined as the category of the video.

The implementation principles and technical effects of the device provided in the embodiment of the application are the same as those of the foregoing method embodiment. For a brief description, for the parts not mentioned in the device embodiment, please refer to the corresponding content in the foregoing method embodiment.

An embodiment of the present application also provides an electronic device. As shown in FIG. 9, the electronic device includes a memory 100 and a processor 101, where the memory 100 is configured to store one or more computer instructions, and one or more computer instructions are The processor 101 executes to implement the training method of the video classification model or the steps of the video classification method.

Further, the electronic device shown in FIG. 9 further includes a bus 102 and a communication interface 103, and the processor 101, the communication interface 103, and the memory 100 are connected through the bus 102.

The memory 100 may include a high-speed random access memory (RAM, Random Access Memory), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the Internet, a wide area network, a local network, a metropolitan area network, etc. may be used. The bus 102 may be an ISA bus, PCI bus, EISA bus, or the like. The bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one bidirectional arrow is used in FIG. 9, but it does not mean that there is only one bus or one type of bus.

The processor 101 may be an integrated circuit chip with signal processing capability. In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 101 or instructions in the form of software. The aforementioned processor 101 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP for short). ), Application Specific Integrated Circuit (ASIC for short), Field-Programmable Gate Array (FPGA for short) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory 100, and the processor 101 reads information in the memory 100, and completes the steps of the method of the foregoing embodiment in combination with its hardware.

The embodiment of the present application also provides a machine-readable storage medium that stores machine-executable instructions. When the machine-executable instructions are called and executed by a processor, the machine-executable instructions prompt the processor to implement For the training method of the above-mentioned video classification model or the steps of the video classification method, for specific implementation, please refer to the method embodiment, which will not be repeated here.

The video classification method and its model training method, device, and computer program product of the electronic device provided by the embodiments of the present application include a computer-readable storage medium storing program code, and the instructions included in the program code can be set to execute the previous For the specific implementation of the method described in the method embodiment, please refer to the method embodiment, which will not be repeated here.

If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of this application essentially or the part that contributes to the related technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including several The instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code .

The embodiment of the present application provides an executable program code that is configured to be executed to execute any of the above-mentioned training methods for video classification models or the steps of any of the above-mentioned video classification methods.

Finally, it should be noted that the above-mentioned embodiments are only specific implementations of the application, which are used to illustrate the technical solutions of the application, rather than limit it. The scope of protection of the application is not limited thereto, although referring to the foregoing The examples describe the application in detail, and those of ordinary skill in the art should understand that any person skilled in the art can still modify the technical solutions described in the foregoing examples within the technical scope disclosed in this application. Or it is easy to think of changes, or equivalent replacements of some of the technical features; and these modifications, changes or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be covered in this application Within the scope of protection. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

The above are only the preferred embodiments of this application and are not intended to limit this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in this application Within the scope of protection.

Claims

A method for training a video classification model, the method comprising:

Determine current training data; the training data includes multiple video frames;

Input the training data to an initial model; the initial model includes a convolutional neural network, a recurrent neural network, and an output network;

Extracting the feature of the multi-frame video frame by the convolutional neural network as an initial feature;

Extracting the feature of the multi-frame video frame from the initial feature by using the recurrent neural network as the final feature;

Inputting the final feature to the output network, and outputting the prediction result of the multi-frame video frame;

The loss value of the prediction result is determined by a preset prediction loss function; the initial model is trained according to the loss value until the parameters in the initial model converge to obtain a video classification model.
The method according to claim 1, wherein the convolutional neural network includes multiple groups of sub-networks connected in sequence, a global average pooling layer, and a fully connected layer of classification; each group of the sub-networks includes a batch of sub-networks connected in sequence Layer, activation function layer, convolution layer and pooling layer;

The initial parameters of the convolutional neural network are obtained by training on a preset data set.
The method according to claim 1, wherein the initial model further comprises a global average pooling network; the global average pooling network is set between the convolutional neural network and the recurrent neural network;

The extracting the feature of the multi-frame video frame from the initial feature through the recurrent neural network as the final feature includes:

Performing dimensionality reduction processing on the initial features through the global average pooling network to obtain dimensionality reduction features;

The feature of the multi-frame video frame is extracted from the dimensionality reduction feature through the recurrent neural network as the final feature.
The method of claim 1, wherein the recurrent neural network comprises a long and short-term memory network.
The method according to claim 1, wherein the output network includes a classification fully connected layer; the initial model further includes a classification function;

The step of inputting the final feature to the output network and outputting the prediction result of the multi-frame video frame includes: inputting the final feature to the classification fully connected layer, and outputting a classification result vector;

The method further includes: inputting the classification result vector to the classification function, and outputting a classification probability vector corresponding to the classification result vector.
The method according to claim 5, wherein the prediction loss function includes a classification loss function;

The classification loss function is

among them,
p l is the l-th element of the classification probability vector corresponding to the classification result vector in the prediction result; y l is the l-th element of the pre-labeled standard probability vector of the multi-frame video frame; r l is the corresponding y l The proportion of the category in the training set; τ is a preset hyperparameter.
The method of claim 5, wherein the output network includes a threshold fully connected layer;

The step of inputting the final feature to the output network and outputting the prediction result of the multi-frame video frame includes: inputting the final feature to the threshold fully connected layer, and outputting a threshold result vector.
8. The method of claim 7, wherein the predicted loss function comprises a threshold loss function;

The threshold loss function is

y l is the lth element of the pre-marked standard probability vector of the multi-frame video frame; δ l = element p l- θ l ); θ l is the lth element of the threshold result vector in the prediction result;
The method according to claim 1, wherein the prediction loss function includes a classification loss function and a threshold loss function;

The step of determining the loss value of the prediction result through a preset prediction loss function includes:

A weighted summation is performed on the function value of the classification loss function and the function value of the threshold loss function to obtain the loss value of the prediction result.
The method according to claim 1, wherein the step of training the initial model according to the loss value until the parameters in the initial model converge to obtain a video classification model comprises:

Update the parameters in the initial model according to the loss value;

Determine whether the updated parameters are all converged;

If the updated parameters all converge, determining the initial model after the updated parameters as the video classification model;

If the updated parameters do not all converge, continue to perform the step of determining the current training data until the updated parameters all converge.
The method according to claim 10, wherein the step of updating the parameters in the initial model according to the loss value comprises:

According to preset rules, determine the parameters to be updated from the initial model;

Calculate the derivative of the loss value to the parameter to be updated in the initial model
Wherein, L is the loss value; W is the parameter to be updated;

Update the parameter to be updated to obtain the updated parameter to be updated
Among them, α is the preset coefficient.
A video classification method, the method includes:

Get the video to be classified;

Acquiring multiple video frames from the video according to a preset sampling interval;

Input the multi-frame video frame to the pre-trained video classification model, and output the classification result of the multi-frame video frame; the video classification model is trained by the video classification model of any one of claims 1-11 Method training;

The category of the video is determined according to the classification result of the multi-frame video frame.
The method according to claim 12, wherein the classification result of the multi-frame video frame comprises: a classification probability vector and a threshold result vector;

The step of determining the category of the video according to the classification result of the multi-frame video frame includes:

Calculate the category vector of the video
Where p l is the lth element of the classification probability vector; θ l is the lth element of the threshold result vector;

Determine the category corresponding to the non-zero element in the category vector as the category of the video.
A training device for a video classification model, including:

A training data determination module, configured to determine current training data; the training data includes multiple video frames;

A training data input module, configured to input the training data to an initial model; the initial model includes a convolutional neural network, a recurrent neural network, and an output network;

An initial feature extraction module, configured to extract features of the multi-frame video frame through the convolutional neural network as the initial feature;

A final feature extraction module, configured to extract the features of the multi-frame video frame from the initial features through the recurrent neural network as the final feature;

A prediction result output module, configured to input the final feature to the output network and output the prediction result of the multi-frame video frame;

The loss value determination and training module is configured to determine the loss value of the prediction result through a preset prediction loss function; train the initial model according to the loss value until the parameters in the initial model converge to obtain a video Classification model.
The device according to claim 14, wherein the convolutional neural network includes multiple groups of sub-networks connected in sequence, a global average pooling layer, and a fully connected classification layer; each group of the sub-networks includes a batch of sub-networks connected in sequence Layer, activation function layer, convolution layer and pooling layer;

The initial parameters of the convolutional neural network are obtained by training on a preset data set.
The apparatus according to claim 14, wherein the initial model further comprises a global average pooling network; the global average pooling network is set between the convolutional neural network and the recurrent neural network;

The device further includes: a dimensionality reduction module configured to perform dimensionality reduction processing on the initial feature through the global average pooling network to obtain a dimensionality reduction feature;

The final feature extraction module is specifically configured to extract the feature of the multi-frame video frame from the dimensionality reduction feature through the cyclic neural network as the final feature.
The device of claim 14, wherein the recurrent neural network comprises a long and short-term memory network.
The apparatus according to claim 14, wherein the output network includes a fully connected classification layer; the initial model further includes a classification function;

The prediction result output module is configured to: input the final feature to the classification fully connected layer, and output a classification result vector;

The device further includes: a probability vector output module, configured to input the classification result vector to the classification function, and output a classification probability vector corresponding to the classification result vector.
The apparatus according to claim 18, wherein the prediction loss function comprises a classification loss function;

The classification loss function is

among them,
p l is the l-th element of the classification probability vector corresponding to the classification result vector in the prediction result; y l is the l-th element of the pre-labeled standard probability vector of the multi-frame video frame; r l is the corresponding y l The proportion of the category in the training set; τ is a preset hyperparameter.
The apparatus of claim 18, wherein the output network includes a threshold fully connected layer;

The prediction result output module is configured to: input the final feature to the threshold fully connected layer, and output a threshold result vector.
The apparatus according to claim 20, wherein the prediction loss function comprises a threshold loss function;

The threshold loss function is
y l is the lth element of the pre-marked standard probability vector of the multi-frame video frame; δ l = element p l- θ l ); θ l is the lth element of the threshold result vector in the prediction result;
The apparatus according to claim 14, wherein the prediction loss function includes a classification loss function and a threshold loss function;

The loss value determination and training module is configured to: perform a weighted summation on the function value of the classification loss function and the function value of the threshold loss function to obtain the loss value of the prediction result.
The device according to claim 14, wherein the loss value determination and training module is set to:

Update the parameters in the initial model according to the loss value;

Determine whether the updated parameters are all converged;

If the updated parameters all converge, determining the initial model after the updated parameters as the video classification model;

If the updated parameters do not all converge, continue to perform the step of determining the current training data based on the preset training set until the updated parameters all converge.
The device according to claim 23, wherein the loss value determination and training module is set to:

According to preset rules, determine the parameters to be updated from the initial model;

Calculate the derivative of the loss value to the parameter to be updated in the initial model
Wherein, L is the loss value; W is the parameter to be updated;

Update the parameter to be updated to obtain the updated parameter to be updated
Among them, α is the preset coefficient.
A video classification device, the device includes:

The video acquisition module is set to acquire the video to be classified;

A video frame obtaining module, configured to obtain multiple video frames from the video according to a preset sampling interval;

The classification module is configured to input the multi-frame video frame to the pre-trained video classification model, and output the classification result of the multi-frame video frame; the video classification model is according to any one of claims 1-11 The training method of the video classification model is trained;

The category determining module is configured to determine the category of the video according to the classification result of the multi-frame video frame.
The apparatus according to claim 25, wherein the classification result of the multi-frame video frame comprises: a classification probability vector and a threshold result vector;

The category determination module is set to:

Calculate the category vector of the video
Wherein, p l is the lth element of the classification probability vector; θl is the lth element of the threshold result vector;

Determine the category corresponding to the non-zero element in the category vector as the category of the video.
An electronic device comprising a processor and a memory, the memory storing machine executable instructions that can be executed by the processor, and the processor executing the machine executable instructions to implement any one of claims 1 to 11 The training method of the video classification model, or the steps of the video classification method of claim 12 or 13.
A machine-readable storage medium, the machine-readable storage medium stores machine-executable instructions. When the machine-executable instructions are called and executed by a processor, the machine-executable instructions prompt the processor to implement any one of claims 1 to 11 The training method of the video classification model described in item, or the steps of the video classification method described in claim 12 or 13.
An executable program code configured to be run to execute the training method of the video classification model according to any one of claims 1 to 11, or the video classification method according to claim 12 or 13. step.