WO2021082743A1

WO2021082743A1 - Video classification method and apparatus, and electronic device

Info

Publication number: WO2021082743A1
Application number: PCT/CN2020/113860
Authority: WO
Inventors: 李果; 陈熊; 汪贤; 樊鸿飞; 蔡媛
Original assignee: 北京金山云网络技术有限公司; 北京金山云科技有限公司
Priority date: 2019-10-31
Filing date: 2020-09-08
Publication date: 2021-05-06
Also published as: CN110766096A; CN110766096B

Abstract

Provided are a video classification method and apparatus, and an electronic device. The method comprises: acquiring a video to be classified; according to a plurality of target image frames in said video, determining a target image set corresponding to said video, wherein the target image set comprises the plurality of target image frames; inputting the target image set into a target classification model, and obtaining a target video scenario that is output by the target classification model and corresponds to the target image set, wherein the target classification model is used to acquire an image feature corresponding to each target image frame in the target image set, and to determine the target video scenario according to the image feature corresponding to each target image frame; and according to the target video scenario corresponding to the target image set, determining a classification result for said video, wherein the classification result is used to indicate a video scenario of said video. According to the present disclosure, the accuracy of a video classification result can be effectively improved.

Description

Video classification method, device and electronic equipment

This disclosure claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201911059325.6, and the invention title is "Video Classification Method, Apparatus, and Electronic Equipment" on October 31, 2019, the entire content of which is incorporated into this disclosure by reference in.

Technical field

The present disclosure relates to the field of deep learning technology, and in particular to a video classification method, device and electronic equipment.

Background technique

In recent years, with the development of various video applications (Application, APP for short), the number of videos on the Internet has also grown rapidly, and the content is rich and diverse. By classifying videos, it is not only convenient for users to find the videos they need, but also helps Extract the information conveyed in the video. At present, when classifying videos, it is necessary to confirm the category to which the image frames extracted from the video belong, and then calculate the average value of the classification results of the extracted image frames to obtain the final video classification result. The inventor found through research that the accuracy of the method of determining the video classification result by calculating the average of the image frame classification results is not high.

Summary of the invention

In view of this, the purpose of the embodiments of the present disclosure is to provide a video classification method, device, and electronic equipment, which can effectively improve the accuracy of the video classification result.

In a first aspect, embodiments of the present disclosure provide a video classification method, including: acquiring a video to be classified; determining a target image set corresponding to the video to be classified according to multiple target image frames in the video to be classified, wherein The target image set includes the multiple target image frames; the target image set is input to a target classification model, and the target video scene corresponding to the target image set output by the target classification model is obtained, wherein The target classification model is used to obtain the image characteristics corresponding to each target image frame in the target image set, and determine the target video scene according to the image characteristics corresponding to each target image frame; The target video scene determines the classification result of the video to be classified, wherein the classification result is used to indicate the video scene of the video to be classified.

In an embodiment, the image features include one or more of shallow features, deep features, spatial features, and temporal features; the target classification model includes a feature fusion network, and is connected to the feature fusion network The feature extraction network; wherein the step of inputting the target image set into a target classification model and obtaining the target video scene corresponding to the target image set output by the target classification model includes: transferring the target image Set input to the feature fusion network of the target classification model, and extract the shallow features of each target image frame in the target image set through the feature extraction network; input the shallow features to the feature fusion network of the target classification model , Extracting the deep features, spatial features, and time series features of each target image frame in the target image set based on the shallow features through the feature fusion network, and based on the deep features, the spatial features, and the time sequence features The target video scene corresponding to the target image set is output; the feature level of the deep features is higher than the feature level of the shallow features.

In one embodiment, before the step of inputting the target image set into the feature fusion network of the target classification model, and extracting the shallow features of each target image frame in the target image set through the feature extraction network , The method further includes: obtaining a pre-training model, setting the network parameters of the pre-training model as the initial parameters of the feature extraction network; and training the specified layer of the feature extraction network after the initial parameters are set by back propagation , And use the trained feature extraction network as the feature extraction network in the target classification model.

In one embodiment, the feature extraction network includes a plurality of feature extraction sub-networks connected in sequence; the feature fusion network that inputs the target image set into the target classification model, and extracts the feature extraction network through the feature extraction network. The step of the shallow features of each target image frame in the target image set includes: inputting the target image set into the first feature extraction sub-network in the feature fusion network of the target classification model, and passing the first feature sub-network Perform feature extraction on each target image frame in the target image set; according to the connection sequence of the feature extraction sub-network, input the features extracted by the first feature extraction sub-network to the next feature extraction sub-network, and pass all The next feature sub-network performs feature extraction on the features extracted by the first feature extraction sub-network until the shallow features of each target image frame in the target image set are obtained.

In one embodiment, the feature fusion network extracts the deep features, spatial features, and time series features of each target image frame in the target image set based on the shallow features, and based on the deep features, The step of outputting the target video scene corresponding to the target image set by the spatial feature and the time sequence feature includes: the feature fusion network determines the first probability set corresponding to the target image set according to the deep features, wherein, The first probability set includes multiple first probabilities, each of the first probabilities is used to indicate the probability that the target image set belongs to a video scene; the feature fusion network determines the A second probability set corresponding to the target image set, wherein the second probability set includes a plurality of second probabilities, and each of the second probabilities is used to indicate the probability that the target image set belongs to a kind of video scene; The feature fusion network determines a third probability set corresponding to the target image set according to the time sequence feature, wherein the third probability set includes a plurality of third probabilities, and each third probability is used to indicate the target The probability that the image set belongs to a video scene; performing weighted calculation on the first probability, the second probability, and the third probability corresponding to the same video scene to obtain the weighted probability corresponding to each of the video scenes; The video scene corresponding to the maximum weighted probability is determined as the target video scene corresponding to the target image set.

In one embodiment, the feature fusion network includes a pooling layer, a second convolutional layer, and a third convolutional layer; the pooling layer, the second convolutional layer, and the third convolutional layer The input of is connected to the output of the feature extraction network; the second convolutional layer is a 2D convolutional layer; the third convolutional layer is a 3D convolutional layer; wherein, the feature fusion network is based on the deep layer Feature, the step of determining the first probability set corresponding to the target image set includes: the pooling layer in the feature fusion network determines the first probability corresponding to the target image set according to the deep features The step of determining the second probability set corresponding to the target image set by the feature fusion network according to the spatial feature includes: the second convolutional layer in the feature fusion network according to the spatial feature, Determining the second probability set corresponding to the target image; the feature fusion network determining the third probability set corresponding to the target image set according to the time series feature includes: the first probability set in the feature fusion network The three-convolutional layer determines the third probability set corresponding to the target image according to the time sequence feature.

In one embodiment, before the step of inputting the target image set to the target classification model, the method further includes: obtaining an image training set, and inputting the image training set to the initial classification model; The initial classification model calculates the loss function of the initial classification model based on the classification results output by the image training set; uses a backpropagation algorithm to calculate the derivative of the loss function with respect to the parameters of the initial classification model; uses a gradient descent algorithm And the derivative to update the parameters of the initial classification model to obtain the target classification model.

In a second aspect, embodiments of the present disclosure also provide a video classification device, including: a video acquisition module configured to acquire a video to be classified; an image set determining module configured to obtain multiple target image frames in the video to be classified, Determine the target image set corresponding to the video to be classified, wherein the target image set includes the multiple target image frames; an input module is configured to input the target image set into a target classification model, and obtain the target The target video scene corresponding to the target image set output by the classification model, wherein the target classification model is set to obtain the image feature corresponding to each target image frame in the target image set, and corresponding to each target image frame The target video scene is determined by the image feature; the classification determination module is configured to determine the classification result of the video to be classified according to the target video scene corresponding to the target image set, wherein the classification result is set to indicate the to-be-classified video The video scene of the classified video.

In a third aspect, embodiments of the present disclosure also provide an electronic device, including a processor and a memory; the memory stores a computer program, and the computer program executes any of the tasks provided in the first aspect when run by the processor. The method described in one item.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and the computer program executes the method provided in the first aspect when the computer program is run by a processor.

The embodiments of the present disclosure bring the following beneficial effects:

In the video classification method, device and electronic equipment provided by the embodiments of the present disclosure, the video to be classified is first obtained, and the target image set corresponding to the video to be classified is determined according to multiple target image frames (including multiple target image frames) in the video to be classified , By inputting the target image set to the target classification model, the target video scene corresponding to the target image set output by the target classification model is obtained, where the target classification model is used to obtain the image features corresponding to each target image frame in the target image set, and according to The image feature corresponding to each target image frame determines the target video scene, and finally, according to the target video scene corresponding to the target image set, the classification result used to indicate the video scene of the video to be classified is determined. Compared with the traditional video classification method, the embodiment of the present disclosure determines the target video scene of the target image set by extracting the image feature corresponding to each image frame by the target classification model, and further determines the video scene of the video to be classified on this basis. The classification results can effectively improve the efficiency and accuracy of video classification.

Other features and advantages of the embodiments of the present disclosure will be described in the following specification, and partly become obvious from the specification, or understood by implementing the embodiments of the present disclosure. The objectives and other advantages of the embodiments of the present disclosure are realized and obtained by the structures specifically pointed out in the specification, claims and drawings.

In order to make the above-mentioned objects, features and advantages of the embodiments of the present disclosure more obvious and understandable, preferred embodiments are specifically described below in conjunction with the accompanying drawings, which are described in detail as follows.

Description of the drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure or related technologies, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments or related technologies. Obviously, the accompanying drawings in the following description are only for the present disclosure. For some of the embodiments of the embodiments, for those of ordinary skill in the art, other drawings may be obtained based on these drawings without creative work.

FIG. 1 is a schematic flowchart of a video classification method provided by an embodiment of the disclosure;

2 is a schematic structural diagram of a target classification model provided by an embodiment of the disclosure;

3 is a schematic structural diagram of another target classification model provided by an embodiment of the disclosure;

4 is a schematic structural diagram of a video classification device provided by an embodiment of the disclosure;

FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions of the embodiments of the present disclosure will be described clearly and completely in conjunction with the embodiments. Obviously, the described embodiments are part of the embodiments of the present disclosure, and Not all examples. Based on the embodiments of the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the embodiments of the present disclosure.

Considering that the video classification result obtained by obtaining the average value of the image frame classification results has the problem of low accuracy. Based on this, the video classification, device and electronic equipment provided by the embodiments of the present disclosure can be Effectively improve the accuracy of video classification results.

In order to facilitate the understanding of this embodiment, a video classification method disclosed in the embodiment of the present disclosure is first introduced in detail. Refer to the flowchart of a video classification method shown in FIG. 1, the method may include the following steps:

S102: Obtain a video to be classified.

The video to be classified can be understood as a video whose video scene is unknown. Among them, the video scene can include video application scenes and video space scenes, such as sports, variety shows, games, film and television or animation and other video application scenes, indoors, forests or roads. Wait for the video space scene. In some embodiments, the video to be classified may be a video recorded by a user, or a video downloaded from various video apps or video websites.

S104: Determine a target image set corresponding to the video to be classified according to multiple target image frames in the video to be classified.

Among them, the target image set includes multiple target image frames. In one embodiment, each image frame in the video to be classified can be used as a target image frame to obtain a target image set containing all image frames of the video, or Extracting multiple target image frames at preset intervals from the video to be classified, and determining the extracted target image frames as target image frames included in the target image set.

S106: Input the target image set into the target classification model, and obtain the target video scene corresponding to the target image set output by the target classification model.

Among them, the target classification model is used to obtain the image characteristics corresponding to each target image frame in the target image set, and to determine the target video scene according to the image characteristics corresponding to each target image frame. The target classification model is obtained by pre-training. In the method, the image training set is obtained, where each image in the image training set carries a classification label, and the image training set is input to the initial classification model, so that the initial classification model learns between each image in the image set and the classification label The mapping relationship of, thus get the target classification model for video classification. In another embodiment, the image training set, image verification set, and image test set are obtained separately, and each image in the image training set, image verification set, and image test set carries a classification label. First, use the image training set to train In the initial classification model, multiple candidate classification models are obtained, and then the image verification set is input to each candidate classification model to select a candidate classification model with better classification effect from each candidate classification model, and finally the image test set is input to the selected In the candidate classification model, if the classification accuracy of the selected candidate classification model for the image test set is higher than the preset threshold, the selected candidate classification model is used as the target classification model.

S108: Determine a classification result of the video to be classified according to the target video scene corresponding to the target image set.

Among them, the classification result is used to indicate the video scene of the video to be classified. In practical applications, the target video scene corresponding to the target image set can be determined as the video scene of the video to be classified, and then the classification result of the video to be classified can be obtained, for example, Assuming that the video classification scene corresponding to the target image set is scene A, the classification result of the video to be classified will indicate that the video scene of the video to be classified is scene A.

The video classification method provided by the embodiments of the present disclosure first obtains the video to be classified, and determines the target image set corresponding to the video to be classified according to multiple target image frames (including multiple target image frames) in the video to be classified. The image set is input to the target classification model, and the target video scene corresponding to the target image set output by the target classification model is obtained. The target classification model is used to obtain the image features corresponding to each target image frame in the target image set, and according to each target image The image feature corresponding to the frame determines the target video scene, and finally, according to the target video scene corresponding to the target image set, the classification result used to indicate the video scene of the video to be classified is determined. Compared with the traditional video classification method, the embodiment of the present disclosure determines the target video scene of the target image set by extracting the image feature corresponding to each image frame by the target classification model, and further determines the video scene of the video to be classified on this basis. The classification results can effectively improve the efficiency and accuracy of video classification.

In order to facilitate the understanding of the video method provided in the above embodiments, the embodiments of the present disclosure also provide a target classification model, where the target classification model includes a feature fusion network and a feature extraction network connected to the feature fusion network, see FIG. 2 A schematic structural diagram of a target classification model is shown. Fig. 2 illustrates that the target classification model includes a feature extraction network and a feature fusion network connected in sequence.

In practical applications, the target classification model can extract the image features corresponding to each target image frame in the target image set, and the image features can include one or more of shallow features, deep features, spatial features, and temporal features. Among them, the shallow features can be understood as the basic features of the target image set, such as edges or contours; the deep features can be understood as the abstract features of the target image set, and the feature level of the deep features is higher than the feature level of the shallow features, for example, if If the target image frame contains a human face, the abstract feature can be the entire face; the spatial feature, that is, the spatial relationship feature, can be used to characterize the mutual position space or relative direction relationship between multiple targets in the image frame, such as multiple The relationship between the targets includes one or more of a connection relationship, an overlap relationship, or an inclusion relationship; the timing characteristics can be understood as characteristics of the timing data of the target image frame.

On the basis of Figure 2, the input of the feature extraction network is the target image set corresponding to the video to be classified, the output of the feature extraction network is the shallow feature corresponding to the target image set; the input of the feature fusion network is the target image set corresponding to the above For shallow features, the output of the feature fusion network is the target video scene corresponding to the target image set. Based on the network structure of the foregoing target classification model, the foregoing step S106 can be performed with reference to the following steps (1) to (2):

(1) Input the target image set into the feature fusion network of the target classification model, and extract the shallow features of each target image frame in the target image set through the feature extraction network.

Among them, the shallow feature of the target image set may be a feature map corresponding to each target image frame in the target image set. For example, the target image set contains N target image frames with a size of 224*224. At this time, the input of the feature extraction network is N images with a size of 224*224, and feature extraction is performed on each target image frame in the target image set. Then, output N feature maps with a size of 7*7, and the N feature maps with a size of 7*7 are the aforementioned shallow features.

In one embodiment, the feature extraction network includes ResNet (Residual Networks, residual network) or VGGNet (Visual Geometry Group Network, visual geometry group network), taking into account the traditional convolutional neural network (CNN, Convolutional Neural Networks) in There is a problem of loss of characteristic information during information transmission. The embodiments of the present disclosure adopt ResNet network or VGG network. Among them, Resnet network and VGG network are not only more suitable for image processing, but also Resnet network can effectively protect by directly transmitting input to output. The integrity of feature information helps to a certain extent alleviate the problem of loss of feature information between frames of images in related technologies.

In addition, the feature extraction network provided by the embodiments of the present disclosure is obtained based on the migration learning algorithm and the fine tune algorithm training, where the fine tune algorithm can be understood as freezing the network weights of some layers in the feature extraction network, and through reverse Modify the network weight of the target layer to the propagation algorithm. In practical applications, before executing the feature fusion network that inputs the target image set into the target classification model, and extracts the shallow features of each target image frame in the target image set through the feature extraction network, first obtain the pre-training model, and The network parameters of the training model are set as the initial parameters of the feature extraction network. Among them, the pre-training model can be trained using the ImageNet data set; then the specified layer of the feature extraction network after setting the initial parameters is trained through backpropagation, and the training is performed The latter feature extraction network is used as the feature extraction network in the target classification model. The embodiment of the present disclosure uses the migration learning algorithm and the finetune algorithm to help improve the training efficiency of the feature extraction network pre-training and reduce the amount of data required in the ImageNet data set. It can also strengthen the generalization of the feature extraction network.

In another embodiment, the feature extraction network includes a plurality of feature extraction sub-networks connected in sequence, and each feature extraction sub-network includes a first convolution layer, a normalization layer, an activation function layer, and a residual that are connected in sequence. Connection layer. Among them, the first convolutional layer is used for convolution processing on the input of the feature extraction sub-network, the normalization layer is used for batch normalization processing on the input of the feature extraction sub-network, and the activation function layer is used for the feature extraction sub-network. The input of the network is processed by the activation function, and the residual connection layer is used to perform the residual connection processing on the input of the feature extraction sub-network.

On this basis, the embodiments of the present disclosure provide a feature fusion network that inputs a target image set into a target classification model, and a specific implementation method for extracting the shallow features of each target image frame in the target image set through the feature extraction network, see The following steps (1) to (2): (1) Input the target image set to the first feature extraction sub-network in the feature fusion network of the target classification model, and use the first feature sub-network to analyze each target image in the target image set Frame feature extraction, where the input of the first feature extraction sub-network is each target image frame in the target image set, and the output is the first layer feature of each target image frame; (2) The connection of the sub-network is extracted according to the feature In order, input the features extracted by the first feature extraction sub-network to the next feature extraction sub-network, and perform feature extraction on the features extracted by the first feature extraction sub-network through the next feature sub-network, until each target image set is obtained The shallow features of the target image frame, for each remaining feature extraction sub-network except the first feature extraction sub-network, the input of the feature extraction sub-network is the feature output of the previous feature extraction sub-network of the feature extraction sub-network , By performing feature extraction on the input features again, and inputting the extracted features into the next feature extraction sub-network of the feature extraction sub-network. For example, the feature extraction network includes 5 feature extraction sub-networks connected in sequence, that is, the feature extraction sub-network is divided into 5 stages, and each stage outputs feature maps of different sizes in turn to obtain the shallow layer corresponding to each image in the image set feature.

(2) Input the shallow features into the feature fusion network of the target classification model, and extract the deep features, spatial features, and time sequence features of each target image frame in the target image set based on the shallow features through the feature fusion network, and based on the deep features, space The feature and timing feature output the target video scene corresponding to the target image set. For ease of understanding, the embodiments of the present disclosure also provide another target classification model. Refer to the schematic structural diagram of another target classification model shown in FIG. 3. FIG. 3 shows that the feature fusion network includes a pooling layer and a second volume. The input of the accumulation layer and the third convolutional layer; the input of the pooling layer, the second convolutional layer and the third convolutional layer are all connected to the output of the feature extraction network.

Based on the network structure of the target classification model described above, the above step (2) can be performed with reference to the following steps 1 to 5:

Step 1. The feature fusion network determines the first probability set corresponding to the target image set according to the deep features. Among them, the first probability set includes multiple first probabilities, and each first probability is used to indicate the probability that the target image set belongs to a video scene. In practical applications, the pooling layer in the feature fusion network can be based on deep features. , Determine the first probability set corresponding to the target image set. Deep features can also be understood as the key features of each image frame in the target image set. For example, the first probability set includes 70% for indicating that the target image set belongs to variety shows, 50% for indicating that the target image set belongs to sports, 20% for indicating that the target image set belongs to animation, and 20%. It indicates that the probability that the target image set belongs to the game is 20%, etc.

Step 2. The feature fusion network determines the second probability set corresponding to the target image set according to the spatial characteristics, where the second probability set includes multiple second probabilities, and each second probability is used to indicate that the target image set belongs to a video scene In practical applications, the second convolutional layer in the feature fusion network determines the second probability set corresponding to the target image according to the spatial features. The second convolutional layer extracts the spatial features of each target image frame in the target image set on the basis of the shallow features, and outputs the second probability set based on the spatial features. Wherein, the spatial feature is a 2-dimensional feature obtained by further extracting on the basis of the above-mentioned shallow feature, and the second convolutional layer is a 2D convolutional layer.

Step 3. The feature fusion network determines a third probability set corresponding to the target image set according to the timing characteristics, where the third probability set includes multiple third probabilities, and each third probability is used to indicate that the target image set belongs to a video scene In a specific implementation, the third convolutional layer in the feature fusion network determines the third probability set corresponding to the target image according to the time sequence feature. The third convolutional layer extracts the temporal features of the image set based on the shallow features, and outputs the third probability set based on the temporal features. Among them, the time series feature is a 3-dimensional feature further extracted on the basis of the above-mentioned shallow feature, and the third convolutional layer is a 3D convolutional layer.

Step 4: Perform a weighted calculation on the first probability, the second probability, and the third probability corresponding to the same video scene to obtain the weighted probability corresponding to each video scene. By weighting and averaging the outputs of the aforementioned pooling layer, the second convolutional layer, and the third convolutional layer, a more accurate probability of all possible categories of the video to be classified can be obtained. For example, the first probability, second probability, and third probability corresponding to the variety show scene are weighted and calculated, and the weighted probability of the variety show scene is 75%, and the first probability, second probability, and third probability corresponding to the game scene are weighted. By calculation, the weighted probability of the game scene is 20%. By weighting the first probability, the second probability, and the third probability corresponding to each video scene, the weighted probability corresponding to each video scene can be obtained.

Step 5: Determine the video scene corresponding to the maximum weighted probability as the target video scene corresponding to the target image set. Assuming that the weighted probability of the variety show scene is the largest, the target video scene corresponding to the target image set is the variety show scene. Compared with the existing video classification method, which ignores the correlation between different frame images, the embodiment of the present disclosure can fully extract the image set through the pooling layer, the second convolutional layer and the third convolutional layer in the feature fusion network. The feature information of different levels and sizes (that is, the above-mentioned depth feature, spatial feature, and temporal feature) can also be used to fuse feature information between frames of images in the image set using a feature fusion network, thereby effectively improving the accuracy of the video classification result.

Before performing the step of inputting the target image set into the target classification model, the embodiment of the present disclosure also provides a training process for training the target classification model as shown in FIG. 3, and the process can be performed with reference to the following steps a to d:

Step a: Obtain the image training set, and input the image training set to the initial classification model. In practical applications, image test sets and image verification sets can also be obtained. Among them, the image training set is used to train the initial classification model. By adjusting the training parameters, multiple candidate classification models with different parameters can be obtained. The training parameters can include the training rate; the image verification set is used to select one from multiple candidate classification models. The candidate classification model with better classification effect; the image test set is used to test the classification ability of the selected candidate classification model. The embodiments of the present disclosure provide a method for obtaining an image training set, an image verification set, and an image test set, including the following steps: (1) Obtain the original video carrying classification tags, considering that there is no public data for video classification. Therefore, a large number of related videos can be obtained by category from the Internet. In order to ensure the generalization of the target classification network, the obtained video categories should be as wide as possible. For example, the data set of the game category can be obtained separately Dozens of related videos of different games; (2) Divide the original video into the first video set, the second video set and the third video set according to the preset ratio; (3) Cut the original video in the first video set into the first video set A first video with a preset duration, and multiple frame images from the first video are extracted to obtain an image training set; (4) Cut the original video in the second video set into a second video with a second preset duration, and Extract multiple frame images in the second video to obtain an image verification set; (5) Cut the original video in the third video set into a third video with a third preset duration, and extract multiple frame images in the third video , Get the image test set. Wherein, the aforementioned first preset duration, second preset duration, and third preset duration may be 5 to 15 seconds to divide the original videos in the first video set, the second video set, and the third video set into different groups. A short video with a length of time, and multiple frames of images are extracted at equal intervals on the obtained short videos respectively to obtain the above-mentioned image training set, image verification set, and image test set. In addition, first divide the original video into the first video set, the second video set, and the third video set, and then cut the original video in the video set to ensure the images in the image training set, image verification set, and image test set. From different original videos, a target classification model with better classification effect can be obtained.

Step b: Calculate the loss function of the initial classification model according to the classification result output by the initial classification model for the image training set. Because each image in the image training set carries a classification label, the initial classification model can learn the mapping relationship between the image and the classification label, and multiple candidate classification models with different weights can be obtained by adjusting the training parameters. In specific implementation, first, the loss function of the initial classification model is calculated according to the classification result output by the initial classification model for the image training set, where the loss function uses cross entropy loss.

Step c: Use the backpropagation algorithm to calculate the derivative of the loss function with respect to the parameters of the initial classification model

Step d, using the gradient descent (Adam) algorithm and derivatives to update the parameters of the initial classification model to obtain the target classification model. In specific implementation, calculate the descent rate α according to the above derivative, and use the descent rate α to update the weight parameters of the initial classification model. When the obtained descent rate α is different, multiple candidate classification models will be obtained, and the descent rate α is calculated according to the above derivative. The formula is as follows:

In order to further determine the target classification model, the image verification set can be input to each candidate classification model, and based on the classification results output by each candidate classification model for the image verification set, a candidate classification model is selected from multiple candidate classification models, and then the image The test set is input to the selected candidate classification model, and based on the classification results of the selected candidate classification model for the image test set output, the classification accuracy of the selected candidate classification model is calculated. If the classification accuracy is higher than the preset threshold, the selected candidate classification model will be selected. The candidate classification model is determined as the target classification model obtained by training.

Considering that unused training parameters will affect the training of the initial classification model, multiple candidate classification models with different parameters will be obtained; in addition, even if the same training parameters are used to train the initial classification model, the model will still exist during subsequent convergence With small fluctuations, multiple candidate classification models with different parameters are obtained. Therefore, the image verification set is required to select a classification model with a better classification effect from the multiple candidate classification models. For example, after selecting a candidate classification model from multiple candidate classification models, use an image test set to test the selected candidate classification model. The images in the image test set are derived from 4 types of videos, including game categories and show scenes. Category, variety show and sports category, and the number of videos in each category is 40. The test results are shown in Table 1 below, and the average accuracy of the classification results has reached more than 90%.

类别category	游戏类别Game category	秀场类别Show category	综艺类别Variety category	体育类别Sports category
精度Precision	97.5％97.5%	80％80%	90％90%	97.5％97.5%

Table 1

On the basis of the above-mentioned embodiments, the embodiments of the present disclosure provide a specific application example of a target classification model. For example, the target classification model is used to implement video coding, and in a specific implementation manner, a segmented video stream is obtained. Input the segmented video stream into the preset first thread and second thread, where the above-mentioned target classification model is deployed in the first thread, and the video corresponding to the segmented video stream is determined by the target classification model in the first thread Scene, and then encode the segmented video stream on the basis of the video scene corresponding to the segmented video stream through the second thread. When there are multiple video frame images, the feature fusion layer fuses the feature parameters of multiple video frame images to obtain the fusion features of multiple video frame images, classifies the fusion features, and obtains the video corresponding to the multiple video frame images as a whole Scene, and the video scene corresponding to the multiple video frame images as a whole is determined as the first video scene of the first segmented video stream. When there are multiple video frame images and the video scenes corresponding to the multiple video frame images are not the same, the video scene is usually expressed as a probability value. For example, the probability that the video scene corresponding to a certain video frame image is an animation is 80%. The probability is 20%. Therefore, the video scene with the highest probability value can be determined as the first video scene of the first segmented video stream; or, the sum of the probabilities of each video scene can be calculated for multiple video frame images, and then the sum of the probabilities can be maximized The video scene of is determined as the first video scene of the first segmented video stream. By using the target classification model provided by the embodiments of the present disclosure to classify the segmented video stream, more accurate classification results can be obtained, and the encoded segmented video stream can be better in the current video scene.

In summary, the embodiments of the present disclosure use the pooling layer, 2D convolutional layer, and 3D convolutional layer in the feature fusion network to more comprehensively extract the feature information of the image concentration, which is ignored compared to the existing video classification methods. For the correlation between different frame images, the embodiment of the present disclosure adopts a feature fusion network to better extract and fuse feature information between different frame images in an image set, and can effectively improve the accuracy of the video classification result.

Regarding the video classification method provided in the foregoing embodiment, an embodiment of the present disclosure also provides a video classification device. Referring to the schematic structural diagram of a video classification device shown in FIG. 4, the device may include the following parts:

The video acquisition module 402 is configured to acquire videos to be classified.

The image set determining module 404 is configured to determine a target image set corresponding to the video to be classified according to multiple target image frames in the video to be classified, wherein the target image set includes multiple target image frames.

The input module 406 is configured to input the target image set into the target classification model, and obtain the target video scene corresponding to the target image set output by the target classification model, wherein the target classification model is set to obtain the corresponding target image frame in the target image set. Image characteristics, and determine the target video scene according to the image characteristics corresponding to each target image frame.

The classification determining module 408 is configured to determine the classification result of the video to be classified according to the target video scene corresponding to the target image set, wherein the classification result is set to indicate the video scene of the video to be classified.

Compared with the traditional video classification method, the video classification device provided by the embodiments of the present disclosure determines the target video scene of the target image set by extracting the image features corresponding to each image frame by the target classification model. The above further determines the classification result of the video scene of the video to be classified, which can effectively improve the efficiency and accuracy of video classification.

In an embodiment, the image features include one or more of shallow features, deep features, spatial features, and temporal features; the target classification model includes a feature fusion network, and a feature extraction network connected to the feature fusion network; the foregoing The input module 406 is further configured to: input the target image set into the feature fusion network of the target classification model, extract the shallow features of each target image frame in the target image set through the feature extraction network; and input the shallow features into the features of the target classification model Fusion network, through the feature fusion network based on shallow features to extract the deep features, spatial features and timing features of each target image frame in the target image set, and output the target video scene corresponding to the target image set based on the deep features, spatial features and timing features; The feature level of deep features is higher than that of shallow features.

In one embodiment, the above-mentioned video classification device further includes a first training module configured to: obtain a pre-training model, and set the network parameters of the pre-training model as the initial parameters of the feature extraction network; and set the initial parameters through backpropagation. The specified layer of the subsequent feature extraction network is trained, and the trained feature extraction network is used as the feature extraction network in the target classification model.

In one embodiment, the feature extraction network includes a plurality of feature extraction sub-networks connected in sequence; the above-mentioned input module 406 is further configured to: input the target image set into the first feature extraction sub-network in the feature fusion network of the target classification model , Perform feature extraction on each target image frame in the target image set through the first feature sub-network; according to the connection order of the feature extraction sub-network, input the features extracted by the first feature extraction sub-network to the next feature extraction sub-network, Feature extraction is performed on the features extracted by the first feature extraction sub-network through the next feature sub-network, until the shallow features of each target image frame in the target image set are obtained.

In an embodiment, the above-mentioned input module 406 is further configured to: the feature fusion network determines the first probability set corresponding to the target image set according to the deep features, wherein the first probability set includes a plurality of first probabilities, and each first probability set The probability is set to indicate the probability that the target image set belongs to a video scene; the feature fusion network determines the second probability set corresponding to the target image set according to the spatial characteristics, where the second probability set includes multiple second probabilities, and each second probability set The probability is set to indicate the probability that the target image set belongs to a video scene; the feature fusion network determines the third probability set corresponding to the target image set according to the timing characteristics, where the third probability set includes multiple third probabilities, and each third probability set The probability is set to indicate the probability that the target image set belongs to a video scene; the first probability, the second probability, and the third probability corresponding to the same video scene are weighted and calculated to obtain the weighted probability corresponding to each video scene; the maximum weighted probability corresponds to The video scene of is determined as the target video scene corresponding to the target image set.

In one embodiment, the feature fusion network includes a pooling layer, a second convolutional layer, and a third convolutional layer; the inputs of the pooling layer, the second convolutional layer, and the third convolutional layer are all the same as those of the feature extraction network. The output is connected; the second convolutional layer is a 2D convolutional layer; the third convolutional layer is a 3D convolutional layer; the input module 406 is also set to: the pooling layer in the feature fusion network determines the target image set corresponding to the deep features The first probability set of the feature fusion network according to the spatial characteristics, the step of determining the second probability set corresponding to the target image set includes: the second convolutional layer in the feature fusion network determines the second probability corresponding to the target image according to the spatial characteristics The step of the feature fusion network determining the third probability set corresponding to the target image set according to the time sequence feature includes: the third convolution layer in the feature fusion network determines the third probability set corresponding to the target image according to the time sequence feature.

In an embodiment, the above-mentioned video classification device further includes a second training module configured to: obtain an image training set, and input the image training set to the initial classification model; according to the classification result output by the initial classification model for the image training set, Calculate the loss function of the initial classification model; use the backpropagation algorithm to calculate the derivative of the loss function relative to the parameters of the initial classification model; use the gradient descent algorithm and the derivative to update the parameters of the initial classification model to obtain the target classification model.

The implementation principles and technical effects of the device provided in the embodiments of the present disclosure are the same as those of the foregoing method embodiments. For a brief description, for the parts not mentioned in the device embodiments, please refer to the corresponding content in the foregoing method embodiments.

The device is an electronic device. Specifically, the electronic device includes a processor and a storage device; a computer program is stored on the storage device, and the computer program executes any one of the above-mentioned embodiments when being run by the processor. The method described.

5 is a schematic structural diagram of an electronic device provided by an embodiment of the disclosure. The electronic device 100 includes a processor 50, a memory 51, a bus 52, and a communication interface 53, through which the processor 50, the communication interface 53 and the memory 51 pass The bus 52 is connected; the processor 50 is used to execute an executable module stored in the memory 51, such as a computer program.

The memory 51 may include a high-speed random access memory (RAM, Random Access Memory), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and at least one other network element is realized through at least one communication interface 53 (which may be wired or wireless), and the Internet, a wide area network, a local network, a metropolitan area network, etc. may be used.

The bus 52 may be an ISA bus, a PCI bus, an EISA bus, or the like. The bus can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one bidirectional arrow is used to indicate in FIG. 5, but it does not mean that there is only one bus or one type of bus.

Wherein, the memory 51 is used to store a program, and the processor 50 executes the program after receiving the execution instruction. The method executed by the flow process definition apparatus disclosed in any of the foregoing embodiments of the present disclosure can be applied to processing In the device 50, or implemented by the processor 50.

The processor 50 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in the processor 50 or instructions in the form of software. The above-mentioned processor 50 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP for short), etc. ), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of the present disclosure can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present disclosure may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory 51, and the processor 50 reads the information in the memory 51, and completes the steps of the above method in combination with its hardware.

The computer program product of the readable storage medium provided by the embodiments of the present disclosure includes a computer readable storage medium storing program code. The instructions included in the program code can be used to execute the method described in the previous method embodiment, and the specific implementation Please refer to the foregoing method embodiment, which will not be repeated here.

Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the above-described system, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the related technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. .

As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the embodiments are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Industrial applicability

Based on the video classification method, device and electronic equipment provided by the embodiments of the present disclosure, the image feature corresponding to each image frame can be extracted through the target classification model to determine the target video scene of the target image set, and on this basis, the target video scene to be classified is further determined The classification result of the video scene of the video can effectively improve the efficiency and accuracy of video classification.

Claims

A video classification method, including:

Get the video to be classified;

Determining a target image set corresponding to the video to be classified according to multiple target image frames in the video to be classified, wherein the target image set includes the multiple target image frames;

The target image set is input to a target classification model, and the target video scene corresponding to the target image set output by the target classification model is obtained, wherein the target classification model is used to obtain each target in the target image set Image features corresponding to the image frames, and determining the target video scene according to the image features corresponding to each target image frame;

Determine the classification result of the video to be classified according to the target video scene corresponding to the target image set, where the classification result is used to indicate the video scene of the video to be classified.
The method according to claim 1, wherein the image features include one or more of shallow features, deep features, spatial features, and temporal features; the target classification model includes a feature fusion network, and the Feature extraction network connected by feature fusion network; among them,

The step of inputting the target image set into a target classification model and obtaining the target video scene corresponding to the target image set output by the target classification model includes:

Inputting the target image set to a feature fusion network of a target classification model, and extracting the shallow features of each target image frame in the target image set through the feature extraction network;

The shallow features are input to the feature fusion network of the target classification model, and the deep features, spatial features, and time series features of each target image frame in the target image set are extracted based on the shallow features through the feature fusion network , And output the target video scene corresponding to the target image set based on the deep feature, the spatial feature, and the time sequence feature; the feature level of the deep feature is higher than the feature level of the shallow feature.
The method according to claim 2, wherein, in the feature fusion network that inputs the target image set to the target classification model, the shallow layer of each target image frame in the target image set is extracted through the feature extraction network Before the step of characterizing, the method further includes:

Acquiring a pre-training model, and setting the network parameters of the pre-training model as the initial parameters of the feature extraction network;

The specified layer of the feature extraction network after the initial parameters is set is trained through back propagation, and the trained feature extraction network is used as the feature extraction network in the target classification model.
The method according to any one of claims 2-3, wherein the feature extraction network comprises a plurality of feature extraction sub-networks connected in sequence;

The step of inputting the target image set into the feature fusion network of the target classification model, and extracting the shallow features of each target image frame in the target image set through the feature extraction network includes:

Input the target image set to the first feature extraction sub-network in the feature fusion network of the target classification model, and perform feature extraction on each target image frame in the target image set through the first feature extraction sub-network;

According to the connection sequence of the feature extraction sub-network, the features extracted by the first feature extraction sub-network are input to the next feature extraction sub-network, and the first feature is extracted through the next feature extraction sub-network. The features extracted by the sub-network are subjected to feature extraction until the shallow features of each target image frame in the target image set are obtained.
The method according to any one of claims 2 to 4, wherein the feature fusion network extracts the deep features, spatial features, and time series features of each target image frame in the target image set based on the shallow features , And outputting the target video scene corresponding to the target image set based on the deep feature, the spatial feature, and the time sequence feature, including:

The feature fusion network determines a first probability set corresponding to the target image set according to the deep features, wherein the first probability set includes a plurality of first probabilities, and each of the first probabilities is set to indicate The probability that the target image set belongs to a video scene;

The feature fusion network determines a second probability set corresponding to the target image set according to the spatial feature, wherein the second probability set includes a plurality of second probabilities, and each of the second probabilities is set to indicate The probability that the target image set belongs to a video scene;

The feature fusion network determines a third probability set corresponding to the target image set according to the time sequence feature, wherein the third probability set includes a plurality of third probabilities, and each of the third probabilities is set to indicate The probability that the target image set belongs to a video scene;

Performing a weighted calculation on the first probability, the second probability, and the third probability corresponding to the same video scene to obtain the weighted probability corresponding to each of the video scenes;

The video scene corresponding to the maximum weighted probability is determined as the target video scene corresponding to the target image set.
The method according to claim 5, wherein the feature fusion network includes a pooling layer, a second convolutional layer, and a third convolutional layer; the pooling layer, the second convolutional layer, and the first convolutional layer; The inputs of the three convolutional layers are all connected to the output of the feature extraction network; the second convolutional layer is a 2D convolutional layer; the third convolutional layer is a 3D convolutional layer; wherein,

The step of the feature fusion network determining the first probability set corresponding to the target image set according to the deep features includes: the pooling layer in the feature fusion network determines the target according to the deep features The first probability set corresponding to the image set;

The step of the feature fusion network determining the second probability set corresponding to the target image set according to the spatial feature includes: the second convolutional layer in the feature fusion network determines the second probability set according to the spatial feature The second probability set corresponding to the target image;

The step of the feature fusion network determining the third probability set corresponding to the target image set according to the time sequence feature includes: the third convolutional layer in the feature fusion network determines the set of probability The third probability set corresponding to the target image.
The method according to any one of claims 1 to 6, wherein, before the step of inputting the target image set into a target classification model, the method further comprises:

Acquiring an image training set, and inputting the image training set to the initial classification model;

Calculating a loss function of the initial classification model according to the classification result output by the initial classification model for the image training set;

Calculating the derivative of the loss function with respect to the parameters of the initial classification model by using a backpropagation algorithm;

The parameters of the initial classification model are updated by using the gradient descent algorithm and the derivative to obtain the target classification model.
A video classification device includes:

The video acquisition module is set to acquire videos to be classified;

An image set determining module, configured to determine a target image set corresponding to the video to be classified according to multiple target image frames in the video to be classified, wherein the target image set includes the multiple target image frames;

The input module is configured to input the target image set into a target classification model and obtain the target video scene corresponding to the target image set output by the target classification model, wherein the target classification model is set to obtain the target Image feature corresponding to each target image frame in the image set, and determining the target video scene according to the image feature corresponding to each target image frame;

The classification determination module is configured to determine the classification result of the video to be classified according to the target video scene corresponding to the target image set, wherein the classification result is set to indicate the video scene of the video to be classified.
An electronic device including a processor and a memory;

A computer program is stored on the memory, and the computer program executes the method according to any one of claims 1 to 7 when the computer program is run by the processor.
A computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is run by a processor, the method according to any one of claims 1-7 is executed.