WO2021082743A1 - Video classification method and apparatus, and electronic device - Google Patents

Video classification method and apparatus, and electronic device Download PDF

Info

Publication number
WO2021082743A1
WO2021082743A1 PCT/CN2020/113860 CN2020113860W WO2021082743A1 WO 2021082743 A1 WO2021082743 A1 WO 2021082743A1 CN 2020113860 W CN2020113860 W CN 2020113860W WO 2021082743 A1 WO2021082743 A1 WO 2021082743A1
Authority
WO
WIPO (PCT)
Prior art keywords
target image
target
video
feature
network
Prior art date
Application number
PCT/CN2020/113860
Other languages
French (fr)
Chinese (zh)
Inventor
李果
陈熊
汪贤
樊鸿飞
蔡媛
Original Assignee
北京金山云网络技术有限公司
北京金山云科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京金山云网络技术有限公司, 北京金山云科技有限公司 filed Critical 北京金山云网络技术有限公司
Publication of WO2021082743A1 publication Critical patent/WO2021082743A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Definitions

  • the present disclosure relates to the field of deep learning technology, and in particular to a video classification method, device and electronic equipment.
  • the purpose of the embodiments of the present disclosure is to provide a video classification method, device, and electronic equipment, which can effectively improve the accuracy of the video classification result.
  • embodiments of the present disclosure provide a video classification method, including: acquiring a video to be classified; determining a target image set corresponding to the video to be classified according to multiple target image frames in the video to be classified, wherein The target image set includes the multiple target image frames; the target image set is input to a target classification model, and the target video scene corresponding to the target image set output by the target classification model is obtained, wherein The target classification model is used to obtain the image characteristics corresponding to each target image frame in the target image set, and determine the target video scene according to the image characteristics corresponding to each target image frame; The target video scene determines the classification result of the video to be classified, wherein the classification result is used to indicate the video scene of the video to be classified.
  • the image features include one or more of shallow features, deep features, spatial features, and temporal features;
  • the target classification model includes a feature fusion network, and is connected to the feature fusion network The feature extraction network; wherein the step of inputting the target image set into a target classification model and obtaining the target video scene corresponding to the target image set output by the target classification model includes: transferring the target image Set input to the feature fusion network of the target classification model, and extract the shallow features of each target image frame in the target image set through the feature extraction network; input the shallow features to the feature fusion network of the target classification model , Extracting the deep features, spatial features, and time series features of each target image frame in the target image set based on the shallow features through the feature fusion network, and based on the deep features, the spatial features, and the time sequence features
  • the target video scene corresponding to the target image set is output; the feature level of the deep features is higher than the feature level of the shallow features.
  • the method before the step of inputting the target image set into the feature fusion network of the target classification model, and extracting the shallow features of each target image frame in the target image set through the feature extraction network , The method further includes: obtaining a pre-training model, setting the network parameters of the pre-training model as the initial parameters of the feature extraction network; and training the specified layer of the feature extraction network after the initial parameters are set by back propagation , And use the trained feature extraction network as the feature extraction network in the target classification model.
  • the feature extraction network includes a plurality of feature extraction sub-networks connected in sequence; the feature fusion network that inputs the target image set into the target classification model, and extracts the feature extraction network through the feature extraction network.
  • the step of the shallow features of each target image frame in the target image set includes: inputting the target image set into the first feature extraction sub-network in the feature fusion network of the target classification model, and passing the first feature sub-network Perform feature extraction on each target image frame in the target image set; according to the connection sequence of the feature extraction sub-network, input the features extracted by the first feature extraction sub-network to the next feature extraction sub-network, and pass all
  • the next feature sub-network performs feature extraction on the features extracted by the first feature extraction sub-network until the shallow features of each target image frame in the target image set are obtained.
  • the feature fusion network extracts the deep features, spatial features, and time series features of each target image frame in the target image set based on the shallow features, and based on the deep features
  • the step of outputting the target video scene corresponding to the target image set by the spatial feature and the time sequence feature includes: the feature fusion network determines the first probability set corresponding to the target image set according to the deep features, wherein, The first probability set includes multiple first probabilities, each of the first probabilities is used to indicate the probability that the target image set belongs to a video scene; the feature fusion network determines the A second probability set corresponding to the target image set, wherein the second probability set includes a plurality of second probabilities, and each of the second probabilities is used to indicate the probability that the target image set belongs to a kind of video scene; The feature fusion network determines a third probability set corresponding to the target image set according to the time sequence feature, wherein the third probability set includes a plurality of third probabilities, and each third probability is used to indicate the target The probability that the image
  • the feature fusion network includes a pooling layer, a second convolutional layer, and a third convolutional layer; the pooling layer, the second convolutional layer, and the third convolutional layer The input of is connected to the output of the feature extraction network; the second convolutional layer is a 2D convolutional layer; the third convolutional layer is a 3D convolutional layer; wherein, the feature fusion network is based on the deep layer Feature, the step of determining the first probability set corresponding to the target image set includes: the pooling layer in the feature fusion network determines the first probability corresponding to the target image set according to the deep features The step of determining the second probability set corresponding to the target image set by the feature fusion network according to the spatial feature includes: the second convolutional layer in the feature fusion network according to the spatial feature, Determining the second probability set corresponding to the target image; the feature fusion network determining the third probability set corresponding to the target image set according to the time series feature includes: the first probability set in the feature
  • the method before the step of inputting the target image set to the target classification model, the method further includes: obtaining an image training set, and inputting the image training set to the initial classification model;
  • the initial classification model calculates the loss function of the initial classification model based on the classification results output by the image training set; uses a backpropagation algorithm to calculate the derivative of the loss function with respect to the parameters of the initial classification model; uses a gradient descent algorithm And the derivative to update the parameters of the initial classification model to obtain the target classification model.
  • embodiments of the present disclosure also provide a video classification device, including: a video acquisition module configured to acquire a video to be classified; an image set determining module configured to obtain multiple target image frames in the video to be classified, Determine the target image set corresponding to the video to be classified, wherein the target image set includes the multiple target image frames; an input module is configured to input the target image set into a target classification model, and obtain the target The target video scene corresponding to the target image set output by the classification model, wherein the target classification model is set to obtain the image feature corresponding to each target image frame in the target image set, and corresponding to each target image frame
  • the target video scene is determined by the image feature; the classification determination module is configured to determine the classification result of the video to be classified according to the target video scene corresponding to the target image set, wherein the classification result is set to indicate the to-be-classified video The video scene of the classified video.
  • embodiments of the present disclosure also provide an electronic device, including a processor and a memory; the memory stores a computer program, and the computer program executes any of the tasks provided in the first aspect when run by the processor. The method described in one item.
  • embodiments of the present disclosure provide a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and the computer program executes the method provided in the first aspect when the computer program is run by a processor.
  • the video to be classified is first obtained, and the target image set corresponding to the video to be classified is determined according to multiple target image frames (including multiple target image frames) in the video to be classified ,
  • the target image set corresponding to the target image set output by the target classification model is obtained, where the target classification model is used to obtain the image features corresponding to each target image frame in the target image set, and according to The image feature corresponding to each target image frame determines the target video scene, and finally, according to the target video scene corresponding to the target image set, the classification result used to indicate the video scene of the video to be classified is determined.
  • the embodiment of the present disclosure determines the target video scene of the target image set by extracting the image feature corresponding to each image frame by the target classification model, and further determines the video scene of the video to be classified on this basis.
  • the classification results can effectively improve the efficiency and accuracy of video classification.
  • FIG. 1 is a schematic flowchart of a video classification method provided by an embodiment of the disclosure
  • FIG. 2 is a schematic structural diagram of a target classification model provided by an embodiment of the disclosure.
  • FIG. 3 is a schematic structural diagram of another target classification model provided by an embodiment of the disclosure.
  • FIG. 4 is a schematic structural diagram of a video classification device provided by an embodiment of the disclosure.
  • FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the disclosure.
  • the video classification result obtained by obtaining the average value of the image frame classification results has the problem of low accuracy. Based on this, the video classification, device and electronic equipment provided by the embodiments of the present disclosure can be Effectively improve the accuracy of video classification results.
  • a video classification method disclosed in the embodiment of the present disclosure is first introduced in detail.
  • the method may include the following steps:
  • the video to be classified can be understood as a video whose video scene is unknown.
  • the video scene can include video application scenes and video space scenes, such as sports, variety shows, games, film and television or animation and other video application scenes, indoors, forests or roads. Wait for the video space scene.
  • the video to be classified may be a video recorded by a user, or a video downloaded from various video apps or video websites.
  • S104 Determine a target image set corresponding to the video to be classified according to multiple target image frames in the video to be classified.
  • the target image set includes multiple target image frames.
  • each image frame in the video to be classified can be used as a target image frame to obtain a target image set containing all image frames of the video, or Extracting multiple target image frames at preset intervals from the video to be classified, and determining the extracted target image frames as target image frames included in the target image set.
  • S106 Input the target image set into the target classification model, and obtain the target video scene corresponding to the target image set output by the target classification model.
  • the target classification model is used to obtain the image characteristics corresponding to each target image frame in the target image set, and to determine the target video scene according to the image characteristics corresponding to each target image frame.
  • the target classification model is obtained by pre-training.
  • the image training set is obtained, where each image in the image training set carries a classification label, and the image training set is input to the initial classification model, so that the initial classification model learns between each image in the image set and the classification label The mapping relationship of, thus get the target classification model for video classification.
  • the image training set, image verification set, and image test set are obtained separately, and each image in the image training set, image verification set, and image test set carries a classification label.
  • the image training set to train In the initial classification model, multiple candidate classification models are obtained, and then the image verification set is input to each candidate classification model to select a candidate classification model with better classification effect from each candidate classification model, and finally the image test set is input to the selected In the candidate classification model, if the classification accuracy of the selected candidate classification model for the image test set is higher than the preset threshold, the selected candidate classification model is used as the target classification model.
  • S108 Determine a classification result of the video to be classified according to the target video scene corresponding to the target image set.
  • the classification result is used to indicate the video scene of the video to be classified.
  • the target video scene corresponding to the target image set can be determined as the video scene of the video to be classified, and then the classification result of the video to be classified can be obtained, for example, Assuming that the video classification scene corresponding to the target image set is scene A, the classification result of the video to be classified will indicate that the video scene of the video to be classified is scene A.
  • the video classification method provided by the embodiments of the present disclosure first obtains the video to be classified, and determines the target image set corresponding to the video to be classified according to multiple target image frames (including multiple target image frames) in the video to be classified.
  • the image set is input to the target classification model, and the target video scene corresponding to the target image set output by the target classification model is obtained.
  • the target classification model is used to obtain the image features corresponding to each target image frame in the target image set, and according to each target image
  • the image feature corresponding to the frame determines the target video scene, and finally, according to the target video scene corresponding to the target image set, the classification result used to indicate the video scene of the video to be classified is determined.
  • the embodiment of the present disclosure determines the target video scene of the target image set by extracting the image feature corresponding to each image frame by the target classification model, and further determines the video scene of the video to be classified on this basis.
  • the classification results can effectively improve the efficiency and accuracy of video classification.
  • the embodiments of the present disclosure also provide a target classification model, where the target classification model includes a feature fusion network and a feature extraction network connected to the feature fusion network, see FIG. 2 A schematic structural diagram of a target classification model is shown. Fig. 2 illustrates that the target classification model includes a feature extraction network and a feature fusion network connected in sequence.
  • the target classification model can extract the image features corresponding to each target image frame in the target image set, and the image features can include one or more of shallow features, deep features, spatial features, and temporal features.
  • the shallow features can be understood as the basic features of the target image set, such as edges or contours
  • the deep features can be understood as the abstract features of the target image set, and the feature level of the deep features is higher than the feature level of the shallow features, for example, if If the target image frame contains a human face, the abstract feature can be the entire face
  • the spatial feature that is, the spatial relationship feature, can be used to characterize the mutual position space or relative direction relationship between multiple targets in the image frame, such as multiple
  • the relationship between the targets includes one or more of a connection relationship, an overlap relationship, or an inclusion relationship
  • the timing characteristics can be understood as characteristics of the timing data of the target image frame.
  • the input of the feature extraction network is the target image set corresponding to the video to be classified
  • the output of the feature extraction network is the shallow feature corresponding to the target image set
  • the input of the feature fusion network is the target image set corresponding to the above
  • the output of the feature fusion network is the target video scene corresponding to the target image set.
  • the shallow feature of the target image set may be a feature map corresponding to each target image frame in the target image set.
  • the target image set contains N target image frames with a size of 224*224.
  • the input of the feature extraction network is N images with a size of 224*224, and feature extraction is performed on each target image frame in the target image set.
  • output N feature maps with a size of 7*7, and the N feature maps with a size of 7*7 are the aforementioned shallow features.
  • the feature extraction network includes ResNet (Residual Networks, residual network) or VGGNet (Visual Geometry Group Network, visual geometry group network), taking into account the traditional convolutional neural network (CNN, Convolutional Neural Networks) in There is a problem of loss of characteristic information during information transmission.
  • the embodiments of the present disclosure adopt ResNet network or VGG network.
  • Resnet network and VGG network are not only more suitable for image processing, but also Resnet network can effectively protect by directly transmitting input to output. The integrity of feature information helps to a certain extent alleviate the problem of loss of feature information between frames of images in related technologies.
  • the feature extraction network provided by the embodiments of the present disclosure is obtained based on the migration learning algorithm and the fine tune algorithm training, where the fine tune algorithm can be understood as freezing the network weights of some layers in the feature extraction network, and through reverse Modify the network weight of the target layer to the propagation algorithm.
  • the fine tune algorithm can be understood as freezing the network weights of some layers in the feature extraction network, and through reverse Modify the network weight of the target layer to the propagation algorithm.
  • the fine tune algorithm can be understood as freezing the network weights of some layers in the feature extraction network, and through reverse Modify the network weight of the target layer to the propagation algorithm.
  • the fine tune algorithm can be understood as freezing the network weights of some layers in the feature extraction network, and through reverse Modify the network weight of the target layer to the propagation algorithm.
  • the pre-training model can be trained using the ImageNet data set; then the specified layer of the feature extraction network after setting the initial parameters is trained through backpropagation, and the training is performed
  • the latter feature extraction network is used as the feature extraction network in the target classification model.
  • the embodiment of the present disclosure uses the migration learning algorithm and the finetune algorithm to help improve the training efficiency of the feature extraction network pre-training and reduce the amount of data required in the ImageNet data set. It can also strengthen the generalization of the feature extraction network.
  • the feature extraction network includes a plurality of feature extraction sub-networks connected in sequence, and each feature extraction sub-network includes a first convolution layer, a normalization layer, an activation function layer, and a residual that are connected in sequence. Connection layer.
  • the first convolutional layer is used for convolution processing on the input of the feature extraction sub-network
  • the normalization layer is used for batch normalization processing on the input of the feature extraction sub-network
  • the activation function layer is used for the feature extraction sub-network.
  • the input of the network is processed by the activation function
  • the residual connection layer is used to perform the residual connection processing on the input of the feature extraction sub-network.
  • the embodiments of the present disclosure provide a feature fusion network that inputs a target image set into a target classification model, and a specific implementation method for extracting the shallow features of each target image frame in the target image set through the feature extraction network, see The following steps (1) to (2): (1) Input the target image set to the first feature extraction sub-network in the feature fusion network of the target classification model, and use the first feature sub-network to analyze each target image in the target image set Frame feature extraction, where the input of the first feature extraction sub-network is each target image frame in the target image set, and the output is the first layer feature of each target image frame; (2) The connection of the sub-network is extracted according to the feature In order, input the features extracted by the first feature extraction sub-network to the next feature extraction sub-network, and perform feature extraction on the features extracted by the first feature extraction sub-network through the next feature sub-network, until each target image set is obtained The shallow features of the target image frame, for each remaining feature extraction sub-network except the first feature extraction sub
  • the embodiments of the present disclosure also provide another target classification model.
  • FIG. 3 shows that the feature fusion network includes a pooling layer and a second volume.
  • the input of the accumulation layer and the third convolutional layer; the input of the pooling layer, the second convolutional layer and the third convolutional layer are all connected to the output of the feature extraction network.
  • step (2) can be performed with reference to the following steps 1 to 5:
  • the feature fusion network determines the first probability set corresponding to the target image set according to the deep features.
  • the first probability set includes multiple first probabilities, and each first probability is used to indicate the probability that the target image set belongs to a video scene.
  • the pooling layer in the feature fusion network can be based on deep features.
  • Deep features can also be understood as the key features of each image frame in the target image set.
  • the first probability set includes 70% for indicating that the target image set belongs to variety shows, 50% for indicating that the target image set belongs to sports, 20% for indicating that the target image set belongs to animation, and 20%. It indicates that the probability that the target image set belongs to the game is 20%, etc.
  • the feature fusion network determines the second probability set corresponding to the target image set according to the spatial characteristics, where the second probability set includes multiple second probabilities, and each second probability is used to indicate that the target image set belongs to a video scene
  • the second convolutional layer in the feature fusion network determines the second probability set corresponding to the target image according to the spatial features.
  • the second convolutional layer extracts the spatial features of each target image frame in the target image set on the basis of the shallow features, and outputs the second probability set based on the spatial features.
  • the spatial feature is a 2-dimensional feature obtained by further extracting on the basis of the above-mentioned shallow feature
  • the second convolutional layer is a 2D convolutional layer.
  • the feature fusion network determines a third probability set corresponding to the target image set according to the timing characteristics, where the third probability set includes multiple third probabilities, and each third probability is used to indicate that the target image set belongs to a video scene
  • the third convolutional layer in the feature fusion network determines the third probability set corresponding to the target image according to the time sequence feature.
  • the third convolutional layer extracts the temporal features of the image set based on the shallow features, and outputs the third probability set based on the temporal features.
  • the time series feature is a 3-dimensional feature further extracted on the basis of the above-mentioned shallow feature
  • the third convolutional layer is a 3D convolutional layer.
  • Step 4 Perform a weighted calculation on the first probability, the second probability, and the third probability corresponding to the same video scene to obtain the weighted probability corresponding to each video scene.
  • a more accurate probability of all possible categories of the video to be classified can be obtained.
  • the first probability, second probability, and third probability corresponding to the variety show scene are weighted and calculated, and the weighted probability of the variety show scene is 75%, and the first probability, second probability, and third probability corresponding to the game scene are weighted.
  • the weighted probability of the game scene is 20%.
  • the weighted probability corresponding to each video scene can be obtained.
  • Step 5 Determine the video scene corresponding to the maximum weighted probability as the target video scene corresponding to the target image set. Assuming that the weighted probability of the variety show scene is the largest, the target video scene corresponding to the target image set is the variety show scene.
  • the embodiment of the present disclosure can fully extract the image set through the pooling layer, the second convolutional layer and the third convolutional layer in the feature fusion network.
  • the feature information of different levels and sizes that is, the above-mentioned depth feature, spatial feature, and temporal feature
  • the embodiment of the present disclosure Before performing the step of inputting the target image set into the target classification model, the embodiment of the present disclosure also provides a training process for training the target classification model as shown in FIG. 3, and the process can be performed with reference to the following steps a to d:
  • Step a Obtain the image training set, and input the image training set to the initial classification model.
  • image test sets and image verification sets can also be obtained.
  • the image training set is used to train the initial classification model.
  • the training parameters can include the training rate; the image verification set is used to select one from multiple candidate classification models.
  • the candidate classification model with better classification effect; the image test set is used to test the classification ability of the selected candidate classification model.
  • the embodiments of the present disclosure provide a method for obtaining an image training set, an image verification set, and an image test set, including the following steps: (1) Obtain the original video carrying classification tags, considering that there is no public data for video classification.
  • the obtained video categories should be as wide as possible.
  • the data set of the game category can be obtained separately Dozens of related videos of different games; (2) Divide the original video into the first video set, the second video set and the third video set according to the preset ratio; (3) Cut the original video in the first video set into the first video set A first video with a preset duration, and multiple frame images from the first video are extracted to obtain an image training set; (4) Cut the original video in the second video set into a second video with a second preset duration, and Extract multiple frame images in the second video to obtain an image verification set; (5) Cut the original video in the third video set into a third video with a third preset duration, and extract multiple frame images in the third video , Get the image test set.
  • the aforementioned first preset duration, second preset duration, and third preset duration may be 5 to 15 seconds to divide the original videos in the first video set, the second video set, and the third video set into different groups.
  • a short video with a length of time, and multiple frames of images are extracted at equal intervals on the obtained short videos respectively to obtain the above-mentioned image training set, image verification set, and image test set.
  • Step b Calculate the loss function of the initial classification model according to the classification result output by the initial classification model for the image training set. Because each image in the image training set carries a classification label, the initial classification model can learn the mapping relationship between the image and the classification label, and multiple candidate classification models with different weights can be obtained by adjusting the training parameters. In specific implementation, first, the loss function of the initial classification model is calculated according to the classification result output by the initial classification model for the image training set, where the loss function uses cross entropy loss.
  • Step c Use the backpropagation algorithm to calculate the derivative of the loss function with respect to the parameters of the initial classification model
  • Step d using the gradient descent (Adam) algorithm and derivatives to update the parameters of the initial classification model to obtain the target classification model.
  • calculate the descent rate ⁇ according to the above derivative and use the descent rate ⁇ to update the weight parameters of the initial classification model.
  • the obtained descent rate ⁇ is different, multiple candidate classification models will be obtained, and the descent rate ⁇ is calculated according to the above derivative.
  • the image verification set can be input to each candidate classification model, and based on the classification results output by each candidate classification model for the image verification set, a candidate classification model is selected from multiple candidate classification models, and then the image
  • the test set is input to the selected candidate classification model, and based on the classification results of the selected candidate classification model for the image test set output, the classification accuracy of the selected candidate classification model is calculated. If the classification accuracy is higher than the preset threshold, the selected candidate classification model will be selected.
  • the candidate classification model is determined as the target classification model obtained by training.
  • the image verification set is required to select a classification model with a better classification effect from the multiple candidate classification models.
  • the images in the image test set are derived from 4 types of videos, including game categories and show scenes. Category, variety show and sports category, and the number of videos in each category is 40.
  • the test results are shown in Table 1 below, and the average accuracy of the classification results has reached more than 90%.
  • the embodiments of the present disclosure provide a specific application example of a target classification model.
  • the target classification model is used to implement video coding, and in a specific implementation manner, a segmented video stream is obtained.
  • the feature fusion layer fuses the feature parameters of multiple video frame images to obtain the fusion features of multiple video frame images, classifies the fusion features, and obtains the video corresponding to the multiple video frame images as a whole Scene, and the video scene corresponding to the multiple video frame images as a whole is determined as the first video scene of the first segmented video stream.
  • the video scene is usually expressed as a probability value. For example, the probability that the video scene corresponding to a certain video frame image is an animation is 80%. The probability is 20%.
  • the video scene with the highest probability value can be determined as the first video scene of the first segmented video stream; or, the sum of the probabilities of each video scene can be calculated for multiple video frame images, and then the sum of the probabilities can be maximized
  • the video scene of is determined as the first video scene of the first segmented video stream.
  • the embodiments of the present disclosure use the pooling layer, 2D convolutional layer, and 3D convolutional layer in the feature fusion network to more comprehensively extract the feature information of the image concentration, which is ignored compared to the existing video classification methods.
  • the embodiment of the present disclosure adopts a feature fusion network to better extract and fuse feature information between different frame images in an image set, and can effectively improve the accuracy of the video classification result.
  • an embodiment of the present disclosure also provides a video classification device.
  • the device may include the following parts:
  • the video acquisition module 402 is configured to acquire videos to be classified.
  • the image set determining module 404 is configured to determine a target image set corresponding to the video to be classified according to multiple target image frames in the video to be classified, wherein the target image set includes multiple target image frames.
  • the input module 406 is configured to input the target image set into the target classification model, and obtain the target video scene corresponding to the target image set output by the target classification model, wherein the target classification model is set to obtain the corresponding target image frame in the target image set. Image characteristics, and determine the target video scene according to the image characteristics corresponding to each target image frame.
  • the classification determining module 408 is configured to determine the classification result of the video to be classified according to the target video scene corresponding to the target image set, wherein the classification result is set to indicate the video scene of the video to be classified.
  • the video classification device determines the target video scene of the target image set by extracting the image features corresponding to each image frame by the target classification model. The above further determines the classification result of the video scene of the video to be classified, which can effectively improve the efficiency and accuracy of video classification.
  • the image features include one or more of shallow features, deep features, spatial features, and temporal features;
  • the target classification model includes a feature fusion network, and a feature extraction network connected to the feature fusion network; the foregoing
  • the input module 406 is further configured to: input the target image set into the feature fusion network of the target classification model, extract the shallow features of each target image frame in the target image set through the feature extraction network; and input the shallow features into the features of the target classification model Fusion network, through the feature fusion network based on shallow features to extract the deep features, spatial features and timing features of each target image frame in the target image set, and output the target video scene corresponding to the target image set based on the deep features, spatial features and timing features;
  • the feature level of deep features is higher than that of shallow features.
  • the above-mentioned video classification device further includes a first training module configured to: obtain a pre-training model, and set the network parameters of the pre-training model as the initial parameters of the feature extraction network; and set the initial parameters through backpropagation.
  • the specified layer of the subsequent feature extraction network is trained, and the trained feature extraction network is used as the feature extraction network in the target classification model.
  • the feature extraction network includes a plurality of feature extraction sub-networks connected in sequence; the above-mentioned input module 406 is further configured to: input the target image set into the first feature extraction sub-network in the feature fusion network of the target classification model , Perform feature extraction on each target image frame in the target image set through the first feature sub-network; according to the connection order of the feature extraction sub-network, input the features extracted by the first feature extraction sub-network to the next feature extraction sub-network, Feature extraction is performed on the features extracted by the first feature extraction sub-network through the next feature sub-network, until the shallow features of each target image frame in the target image set are obtained.
  • the above-mentioned input module 406 is further configured to: the feature fusion network determines the first probability set corresponding to the target image set according to the deep features, wherein the first probability set includes a plurality of first probabilities, and each first probability set The probability is set to indicate the probability that the target image set belongs to a video scene; the feature fusion network determines the second probability set corresponding to the target image set according to the spatial characteristics, where the second probability set includes multiple second probabilities, and each second probability set The probability is set to indicate the probability that the target image set belongs to a video scene; the feature fusion network determines the third probability set corresponding to the target image set according to the timing characteristics, where the third probability set includes multiple third probabilities, and each third probability set The probability is set to indicate the probability that the target image set belongs to a video scene; the first probability, the second probability, and the third probability corresponding to the same video scene are weighted and calculated to obtain the weighted probability corresponding to each video scene; the maximum weighted probability corresponds to The video
  • the feature fusion network includes a pooling layer, a second convolutional layer, and a third convolutional layer; the inputs of the pooling layer, the second convolutional layer, and the third convolutional layer are all the same as those of the feature extraction network.
  • the output is connected; the second convolutional layer is a 2D convolutional layer; the third convolutional layer is a 3D convolutional layer; the input module 406 is also set to: the pooling layer in the feature fusion network determines the target image set corresponding to the deep features
  • the first probability set of the feature fusion network according to the spatial characteristics, the step of determining the second probability set corresponding to the target image set includes: the second convolutional layer in the feature fusion network determines the second probability corresponding to the target image according to the spatial characteristics
  • the step of the feature fusion network determining the third probability set corresponding to the target image set according to the time sequence feature includes: the third convolution layer in the feature fusion network determines the third probability set corresponding to the target image according to the time sequence feature.
  • the above-mentioned video classification device further includes a second training module configured to: obtain an image training set, and input the image training set to the initial classification model; according to the classification result output by the initial classification model for the image training set, Calculate the loss function of the initial classification model; use the backpropagation algorithm to calculate the derivative of the loss function relative to the parameters of the initial classification model; use the gradient descent algorithm and the derivative to update the parameters of the initial classification model to obtain the target classification model.
  • a second training module configured to: obtain an image training set, and input the image training set to the initial classification model; according to the classification result output by the initial classification model for the image training set, Calculate the loss function of the initial classification model; use the backpropagation algorithm to calculate the derivative of the loss function relative to the parameters of the initial classification model; use the gradient descent algorithm and the derivative to update the parameters of the initial classification model to obtain the target classification model.
  • the device is an electronic device.
  • the electronic device includes a processor and a storage device; a computer program is stored on the storage device, and the computer program executes any one of the above-mentioned embodiments when being run by the processor. The method described.
  • the electronic device 100 includes a processor 50, a memory 51, a bus 52, and a communication interface 53, through which the processor 50, the communication interface 53 and the memory 51 pass
  • the bus 52 is connected; the processor 50 is used to execute an executable module stored in the memory 51, such as a computer program.
  • the memory 51 may include a high-speed random access memory (RAM, Random Access Memory), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
  • RAM Random Access Memory
  • non-volatile memory such as at least one disk memory.
  • the communication connection between the system network element and at least one other network element is realized through at least one communication interface 53 (which may be wired or wireless), and the Internet, a wide area network, a local network, a metropolitan area network, etc. may be used.
  • the bus 52 may be an ISA bus, a PCI bus, an EISA bus, or the like.
  • the bus can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one bidirectional arrow is used to indicate in FIG. 5, but it does not mean that there is only one bus or one type of bus.
  • the memory 51 is used to store a program, and the processor 50 executes the program after receiving the execution instruction.
  • the method executed by the flow process definition apparatus disclosed in any of the foregoing embodiments of the present disclosure can be applied to processing In the device 50, or implemented by the processor 50.
  • the processor 50 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in the processor 50 or instructions in the form of software.
  • the above-mentioned processor 50 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP for short), etc. ), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present disclosure can be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present disclosure may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 51, and the processor 50 reads the information in the memory 51, and completes the steps of the above method in combination with its hardware.
  • the computer program product of the readable storage medium provided by the embodiments of the present disclosure includes a computer readable storage medium storing program code.
  • the instructions included in the program code can be used to execute the method described in the previous method embodiment, and the specific implementation Please refer to the foregoing method embodiment, which will not be repeated here.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the related technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. .
  • the image feature corresponding to each image frame can be extracted through the target classification model to determine the target video scene of the target image set, and on this basis, the target video scene to be classified is further determined
  • the classification result of the video scene of the video can effectively improve the efficiency and accuracy of video classification.

Abstract

Provided are a video classification method and apparatus, and an electronic device. The method comprises: acquiring a video to be classified; according to a plurality of target image frames in said video, determining a target image set corresponding to said video, wherein the target image set comprises the plurality of target image frames; inputting the target image set into a target classification model, and obtaining a target video scenario that is output by the target classification model and corresponds to the target image set, wherein the target classification model is used to acquire an image feature corresponding to each target image frame in the target image set, and to determine the target video scenario according to the image feature corresponding to each target image frame; and according to the target video scenario corresponding to the target image set, determining a classification result for said video, wherein the classification result is used to indicate a video scenario of said video. According to the present disclosure, the accuracy of a video classification result can be effectively improved.

Description

视频分类方法、装置及电子设备Video classification method, device and electronic equipment
本公开要求于2019年10月31日提交中国专利局、申请号为201911059325.6、发明名称为“视频分类方法、装置及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。This disclosure claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201911059325.6, and the invention title is "Video Classification Method, Apparatus, and Electronic Equipment" on October 31, 2019, the entire content of which is incorporated into this disclosure by reference in.
技术领域Technical field
本公开涉及深度学习技术领域,具体涉及一种视频分类方法、装置及电子设备。The present disclosure relates to the field of deep learning technology, and in particular to a video classification method, device and electronic equipment.
背景技术Background technique
近年来,随着各类视频应用程序(Application,简称APP)的发展,互联网中的视频数量也迅速增长,且内容丰富多样,通过对视频进行分类不仅便于用户查找所需视频,还有助于提取视频中所传达的信息。目前对视频进行分类时,需要对视频中抽取的图像帧所属的类别进行确认,再计算抽取的图像帧的分类结果的平均值,以得到最终的视频分类结果。发明人经研究发现,通过求取图像帧分类结果的平均值以确定视频分类结果的方法,准确度并不高。In recent years, with the development of various video applications (Application, APP for short), the number of videos on the Internet has also grown rapidly, and the content is rich and diverse. By classifying videos, it is not only convenient for users to find the videos they need, but also helps Extract the information conveyed in the video. At present, when classifying videos, it is necessary to confirm the category to which the image frames extracted from the video belong, and then calculate the average value of the classification results of the extracted image frames to obtain the final video classification result. The inventor found through research that the accuracy of the method of determining the video classification result by calculating the average of the image frame classification results is not high.
发明内容Summary of the invention
有鉴于此,本公开实施例的目的在于提供一种视频分类方法、装置及电子设备,可以有效提高视频分类结果的准确度。In view of this, the purpose of the embodiments of the present disclosure is to provide a video classification method, device, and electronic equipment, which can effectively improve the accuracy of the video classification result.
第一方面,本公开实施例提供了一种视频分类方法,包括:获取待分类视频;根据所述待分类视频中的多个目标图像帧,确定所述待分类视频对应的目标图像集,其中,所述目标图像集中包括所述多个目标图像帧;将所述目标图像集输入至目标分类模型,并获得所述目标分类模型输出的所述目标图像集对应的目标视频场景,其中,所述目标分类模型用于获取所述目标图像集中每个目标图像帧对应的图像特征,并根据所述每个目标图像帧对应的图像特征确定所述目标视频场景;根据所述目标图像集对应的所述目标视频场景,确定所述待分类视频的分类结果,其中,所述分类结果用于指示所述待分类视频的视频场景。In a first aspect, embodiments of the present disclosure provide a video classification method, including: acquiring a video to be classified; determining a target image set corresponding to the video to be classified according to multiple target image frames in the video to be classified, wherein The target image set includes the multiple target image frames; the target image set is input to a target classification model, and the target video scene corresponding to the target image set output by the target classification model is obtained, wherein The target classification model is used to obtain the image characteristics corresponding to each target image frame in the target image set, and determine the target video scene according to the image characteristics corresponding to each target image frame; The target video scene determines the classification result of the video to be classified, wherein the classification result is used to indicate the video scene of the video to be classified.
在一种实施方式中,所述图像特征包括浅层特征、深层特征、空间特征和时序特征中的一种或多种;所述目标分类模型包括特征融合网络,以及与所述 特征融合网络连接的特征提取网络;其中,所述将所述目标图像集输入至目标分类模型,并获得所述目标分类模型输出的所述目标图像集对应的目标视频场景的步骤,包括:将所述目标图像集输入至目标分类模型的特征融合网络,通过所述特征提取网络提取所述目标图像集中每个目标图像帧的浅层特征;将所述浅层特征输入至所述目标分类模型的特征融合网络,通过所述特征融合网络基于所述浅层特征提取所述目标图像集中每个目标图像帧的深层特征、空间特征和时序特征,并基于所述深层特征、所述空间特征和所述时序特征输出所述目标图像集对应的目标视频场景;所述深层特征的特征层次高于所述浅层特征的特征层次。In an embodiment, the image features include one or more of shallow features, deep features, spatial features, and temporal features; the target classification model includes a feature fusion network, and is connected to the feature fusion network The feature extraction network; wherein the step of inputting the target image set into a target classification model and obtaining the target video scene corresponding to the target image set output by the target classification model includes: transferring the target image Set input to the feature fusion network of the target classification model, and extract the shallow features of each target image frame in the target image set through the feature extraction network; input the shallow features to the feature fusion network of the target classification model , Extracting the deep features, spatial features, and time series features of each target image frame in the target image set based on the shallow features through the feature fusion network, and based on the deep features, the spatial features, and the time sequence features The target video scene corresponding to the target image set is output; the feature level of the deep features is higher than the feature level of the shallow features.
在一种实施方式中,在所述将所述目标图像集输入至目标分类模型的特征融合网络,通过所述特征提取网络提取所述目标图像集中每个目标图像帧的浅层特征的步骤之前,所述方法还包括:获取预训练模型,将所述预训练模型的网络参数设置为所述特征提取网络的初始参数;通过反向传播对设置初始参数后的特征提取网络的指定层进行训练,并将训练后的特征提取网络作为所述目标分类模型中的特征提取网络。In one embodiment, before the step of inputting the target image set into the feature fusion network of the target classification model, and extracting the shallow features of each target image frame in the target image set through the feature extraction network , The method further includes: obtaining a pre-training model, setting the network parameters of the pre-training model as the initial parameters of the feature extraction network; and training the specified layer of the feature extraction network after the initial parameters are set by back propagation , And use the trained feature extraction network as the feature extraction network in the target classification model.
在一种实施方式中,所述特征提取网络包括依次连接的多个特征提取子网络;所述将所述目标图像集输入至目标分类模型的特征融合网络,通过所述特征提取网络提取所述目标图像集中每个目标图像帧的浅层特征的步骤,包括:将所述目标图像集输入至目标分类模型的特征融合网络中第一个特征提取子网络,通过所述第一个特征子网络对所述目标图像集中每个目标图像帧进行特征提取;按照所述特征提取子网络的连接顺序,将所述第一个特征提取子网络提取的特征输入至下一特征提取子网络,通过所述下一特征子网络对所述第一个特征提取子网络提取的特征进行特征提取,直至得到所述目标图像集中每个目标图像帧的浅层特征。In one embodiment, the feature extraction network includes a plurality of feature extraction sub-networks connected in sequence; the feature fusion network that inputs the target image set into the target classification model, and extracts the feature extraction network through the feature extraction network. The step of the shallow features of each target image frame in the target image set includes: inputting the target image set into the first feature extraction sub-network in the feature fusion network of the target classification model, and passing the first feature sub-network Perform feature extraction on each target image frame in the target image set; according to the connection sequence of the feature extraction sub-network, input the features extracted by the first feature extraction sub-network to the next feature extraction sub-network, and pass all The next feature sub-network performs feature extraction on the features extracted by the first feature extraction sub-network until the shallow features of each target image frame in the target image set are obtained.
在一种实施方式中,所述通过所述特征融合网络基于所述浅层特征提取所述目标图像集中每个目标图像帧的深层特征、空间特征和时序特征,并基于所述深层特征、所述空间特征和所述时序特征输出所述目标图像集对应的目标视频场景的步骤,包括:所述特征融合网络根据所述深层特征,确定所述目标图像集对应的第一概率集,其中,所述第一概率集中包括多个第一概率,每个所 述第一概率用于指示所述目标图像集属于一种视频场景的概率;所述特征融合网络根据所述空间特征,确定所述目标图像集对应的第二概率集,其中,所述第二概率集中包括多个第二概率,每个所述第二概率用于指示所述目标图像集属于一种视频场景的概率;所述特征融合网络根据所述时序特征,确定所述目标图像集对应的第三概率集,其中,所述第三概率集中包括多个第三概率,每个所述第三概率用于指示所述目标图像集属于一种视频场景的概率;对同一所述视频场景对应的所述第一概率、所述第二概率和所述第三概率进行加权计算,得到各个所述视频场景对应的加权概率;将最大加权概率对应的视频场景,确定为所述目标图像集对应的目标视频场景。In one embodiment, the feature fusion network extracts the deep features, spatial features, and time series features of each target image frame in the target image set based on the shallow features, and based on the deep features, The step of outputting the target video scene corresponding to the target image set by the spatial feature and the time sequence feature includes: the feature fusion network determines the first probability set corresponding to the target image set according to the deep features, wherein, The first probability set includes multiple first probabilities, each of the first probabilities is used to indicate the probability that the target image set belongs to a video scene; the feature fusion network determines the A second probability set corresponding to the target image set, wherein the second probability set includes a plurality of second probabilities, and each of the second probabilities is used to indicate the probability that the target image set belongs to a kind of video scene; The feature fusion network determines a third probability set corresponding to the target image set according to the time sequence feature, wherein the third probability set includes a plurality of third probabilities, and each third probability is used to indicate the target The probability that the image set belongs to a video scene; performing weighted calculation on the first probability, the second probability, and the third probability corresponding to the same video scene to obtain the weighted probability corresponding to each of the video scenes; The video scene corresponding to the maximum weighted probability is determined as the target video scene corresponding to the target image set.
在一种实施方式中,所述特征融合网络包括池化层、第二卷积层和第三卷积层;所述池化层、所述第二卷积层和所述第三卷积层的输入均与所述特征提取网络的输出相连;所述第二卷积层为2D卷积层;所述第三卷积层为3D卷积层;其中,所述特征融合网络根据所述深层特征,确定所述目标图像集对应的第一概率集的步骤,包括:所述特征融合网络中的所述池化层根据所述深层特征,确定所述目标图像集对应的所述第一概率集;所述特征融合网络根据所述空间特征,确定所述目标图像集对应的第二概率集的步骤,包括:所述特征融合网络中的所述第二卷积层根据所述空间特征,确定所述目标图像对应的第二概率集;所述特征融合网络根据所述时序特征,确定所述目标图像集对应的第三概率集的步骤,包括:所述特征融合网络中的所述第三卷积层根据所述时序特征,确定所述目标图像对应的第三概率集。In one embodiment, the feature fusion network includes a pooling layer, a second convolutional layer, and a third convolutional layer; the pooling layer, the second convolutional layer, and the third convolutional layer The input of is connected to the output of the feature extraction network; the second convolutional layer is a 2D convolutional layer; the third convolutional layer is a 3D convolutional layer; wherein, the feature fusion network is based on the deep layer Feature, the step of determining the first probability set corresponding to the target image set includes: the pooling layer in the feature fusion network determines the first probability corresponding to the target image set according to the deep features The step of determining the second probability set corresponding to the target image set by the feature fusion network according to the spatial feature includes: the second convolutional layer in the feature fusion network according to the spatial feature, Determining the second probability set corresponding to the target image; the feature fusion network determining the third probability set corresponding to the target image set according to the time series feature includes: the first probability set in the feature fusion network The three-convolutional layer determines the third probability set corresponding to the target image according to the time sequence feature.
在一种实施方式中,在将所述目标图像集输入至目标分类模型的步骤之前,所述方法还包括:获取图像训练集,并将所述图像训练集输入至初始分类模型;根据所述初始分类模型针对所述图像训练集输出的分类结果,计算所述初始分类模型的损失函数;利用反向传播算法计算所述损失函数相对于所述初始分类模型的参数的导数;利用梯度下降算法和所述导数更新所述初始分类模型的参数,得到目标分类模型。In one embodiment, before the step of inputting the target image set to the target classification model, the method further includes: obtaining an image training set, and inputting the image training set to the initial classification model; The initial classification model calculates the loss function of the initial classification model based on the classification results output by the image training set; uses a backpropagation algorithm to calculate the derivative of the loss function with respect to the parameters of the initial classification model; uses a gradient descent algorithm And the derivative to update the parameters of the initial classification model to obtain the target classification model.
第二方面,本公开实施例还提供一种视频分类装置,包括:视频获取模块,设置为获取待分类视频;图像集确定模块,设置为根据所述待分类视频中的多个目标图像帧,确定所述待分类视频对应的目标图像集,其中,所述目标图像 集中包括所述多个目标图像帧;输入模块,设置为将所述目标图像集输入至目标分类模型,并获得所述目标分类模型输出的所述目标图像集对应的目标视频场景,其中,所述目标分类模型设置为获取所述目标图像集中每个目标图像帧对应的图像特征,并根据所述每个目标图像帧对应的图像特征确定所述目标视频场景;分类确定模块,设置为根据所述目标图像集对应的目标视频场景,确定所述待分类视频的分类结果,其中,所述分类结果设置为指示所述待分类视频的视频场景。In a second aspect, embodiments of the present disclosure also provide a video classification device, including: a video acquisition module configured to acquire a video to be classified; an image set determining module configured to obtain multiple target image frames in the video to be classified, Determine the target image set corresponding to the video to be classified, wherein the target image set includes the multiple target image frames; an input module is configured to input the target image set into a target classification model, and obtain the target The target video scene corresponding to the target image set output by the classification model, wherein the target classification model is set to obtain the image feature corresponding to each target image frame in the target image set, and corresponding to each target image frame The target video scene is determined by the image feature; the classification determination module is configured to determine the classification result of the video to be classified according to the target video scene corresponding to the target image set, wherein the classification result is set to indicate the to-be-classified video The video scene of the classified video.
第三方面,本公开实施例还提供一种电子设备,包括处理器和存储器;所述存储器上存储有计算机程序,所述计算机程序在被所述处理器运行时执行如第一方面提供的任一项所述的方法。In a third aspect, embodiments of the present disclosure also provide an electronic device, including a processor and a memory; the memory stores a computer program, and the computer program executes any of the tasks provided in the first aspect when run by the processor. The method described in one item.
第四方面,本公开实施例提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器运行时执行上述第一方面提供的方法。In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and the computer program executes the method provided in the first aspect when the computer program is run by a processor.
本公开实施例带来了以下有益效果:The embodiments of the present disclosure bring the following beneficial effects:
本公开实施例提供的视频分类方法、装置及电子设备,首先获取待分类视频,根据待分类视频中的多个目标图像帧(包括多个目标图像帧),确定待分类视频对应的目标图像集,通过将目标图像集输入至目标分类模型,获得目标分类模型输出的目标图像集对应的目标视频场景,其中,目标分类模型用于获取目标图像集中每个目标图像帧对应的图像特征,并根据每个目标图像帧对应的图像特征确定目标视频场景,最终根据目标图像集对应的目标视频场景,确定用于指示待分类视频的视频场景的分类结果。相比于传统的视频分类方法,本公开实施例通过目标分类模型提取每个图像帧对应的图像特征确定了目标图像集的目标视频场景,并在此基础上进一步确定了待分类视频的视频场景的分类结果,可有效提升视频分类效率和准确率。In the video classification method, device and electronic equipment provided by the embodiments of the present disclosure, the video to be classified is first obtained, and the target image set corresponding to the video to be classified is determined according to multiple target image frames (including multiple target image frames) in the video to be classified , By inputting the target image set to the target classification model, the target video scene corresponding to the target image set output by the target classification model is obtained, where the target classification model is used to obtain the image features corresponding to each target image frame in the target image set, and according to The image feature corresponding to each target image frame determines the target video scene, and finally, according to the target video scene corresponding to the target image set, the classification result used to indicate the video scene of the video to be classified is determined. Compared with the traditional video classification method, the embodiment of the present disclosure determines the target video scene of the target image set by extracting the image feature corresponding to each image frame by the target classification model, and further determines the video scene of the video to be classified on this basis. The classification results can effectively improve the efficiency and accuracy of video classification.
本公开实施例的其他特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本公开实施例而了解。本公开实施例的目的和其他优点在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Other features and advantages of the embodiments of the present disclosure will be described in the following specification, and partly become obvious from the specification, or understood by implementing the embodiments of the present disclosure. The objectives and other advantages of the embodiments of the present disclosure are realized and obtained by the structures specifically pointed out in the specification, claims and drawings.
为使本公开实施例的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。In order to make the above-mentioned objects, features and advantages of the embodiments of the present disclosure more obvious and understandable, preferred embodiments are specifically described below in conjunction with the accompanying drawings, which are described in detail as follows.
附图说明Description of the drawings
为了更清楚的说明本公开实施例或相关技术的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本公开实施例的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure or related technologies, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments or related technologies. Obviously, the accompanying drawings in the following description are only for the present disclosure. For some of the embodiments of the embodiments, for those of ordinary skill in the art, other drawings may be obtained based on these drawings without creative work.
图1为本公开实施例提供的一种视频分类方法的流程示意图;FIG. 1 is a schematic flowchart of a video classification method provided by an embodiment of the disclosure;
图2为本公开实施例提供的一种目标分类模型的结构示意图;2 is a schematic structural diagram of a target classification model provided by an embodiment of the disclosure;
图3为本公开实施例提供的另一种目标分类模型的结构示意图;3 is a schematic structural diagram of another target classification model provided by an embodiment of the disclosure;
图4为本公开实施例提供的一种视频分类装置的结构示意图;4 is a schematic structural diagram of a video classification device provided by an embodiment of the disclosure;
图5为本公开实施例提供的一种电子设备的结构示意图。FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the disclosure.
具体实施方式Detailed ways
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合实施例对本公开实施例的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开实施例保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions of the embodiments of the present disclosure will be described clearly and completely in conjunction with the embodiments. Obviously, the described embodiments are part of the embodiments of the present disclosure, and Not all examples. Based on the embodiments of the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the embodiments of the present disclosure.
考虑到通过求取图像帧分类结果的平均值,并根据平均值得到的视频分类结果存在准确度较低的问题,基于此,本公开实施例提供的一种视频分类、装置及电子设备,可以有效提高视频分类结果的准确度。Considering that the video classification result obtained by obtaining the average value of the image frame classification results has the problem of low accuracy. Based on this, the video classification, device and electronic equipment provided by the embodiments of the present disclosure can be Effectively improve the accuracy of video classification results.
为便于对本实施例进行理解,首先对本公开实施例所公开的一种视频分类方法进行详细介绍,参见图1所示的一种视频分类方法的流程示意图,该方法可以包括以下步骤:In order to facilitate the understanding of this embodiment, a video classification method disclosed in the embodiment of the present disclosure is first introduced in detail. Refer to the flowchart of a video classification method shown in FIG. 1, the method may include the following steps:
S102,获取待分类视频。S102: Obtain a video to be classified.
待分类视频可以理解为视频场景未知的视频,其中,视频场景可以包括视频应用场景和视频空间场景等多种类别,例如体育、综艺、游戏、影视或动漫等视频应用场景,室内、森林或马路等视频空间场景。在一些实施方式中,待分类视频可以为用户录制的视频,也可以为从各类视频APP或视频网站中下 载的视频。The video to be classified can be understood as a video whose video scene is unknown. Among them, the video scene can include video application scenes and video space scenes, such as sports, variety shows, games, film and television or animation and other video application scenes, indoors, forests or roads. Wait for the video space scene. In some embodiments, the video to be classified may be a video recorded by a user, or a video downloaded from various video apps or video websites.
S104,根据待分类视频中的多个目标图像帧,确定待分类视频对应的目标图像集。S104: Determine a target image set corresponding to the video to be classified according to multiple target image frames in the video to be classified.
其中,目标图像集中包括多个目标图像帧,在一种实施方式中,可以将待分类视频中的每个图像帧均作为目标图像帧,得到包含有视频所有图像帧的目标图像集,也可以从待分类视频中按照预设间隔抽取多张目标图像帧,并将抽取的目标图像帧确定为目标图像集中包括的目标图像帧。Among them, the target image set includes multiple target image frames. In one embodiment, each image frame in the video to be classified can be used as a target image frame to obtain a target image set containing all image frames of the video, or Extracting multiple target image frames at preset intervals from the video to be classified, and determining the extracted target image frames as target image frames included in the target image set.
S106,将目标图像集输入至目标分类模型,并获得目标分类模型输出的目标图像集对应的目标视频场景。S106: Input the target image set into the target classification model, and obtain the target video scene corresponding to the target image set output by the target classification model.
其中,目标分类模型用于获取目标图像集中每个目标图像帧对应的图像特征,并根据每个目标图像帧对应的图像特征确定目标视频场景,目标分类模型是预先训练得到的,在一种实施方式中,获取图像训练集,其中,图像训练集中的每张图像均携带有分类标签,将该图像训练集输入至初始分类模型,以使初始分类模型学习图像集中每张图像与分类标签之间的映射关系,从而得到用于视频分类的目标分类模型。在另一种实施方式中,分别获取图像训练集、图像验证集和图像测试集,且图像训练集、图像验证集和图像测试集中的每张图像均携带有分类标签,首先利用图像训练集训练初始分类模型,得到多个候选分类模型,再将图像验证集输入至各候选分类模型,以从各候选分类模型中选取分类效果较佳的一个候选分类模型,最后将图像测试集输入至选取的候选分类模型中,若选取的候选分类模型针对图像测试集的分类准确率高于预设阈值,则将选取的候选分类模型作为目标分类模型。Among them, the target classification model is used to obtain the image characteristics corresponding to each target image frame in the target image set, and to determine the target video scene according to the image characteristics corresponding to each target image frame. The target classification model is obtained by pre-training. In the method, the image training set is obtained, where each image in the image training set carries a classification label, and the image training set is input to the initial classification model, so that the initial classification model learns between each image in the image set and the classification label The mapping relationship of, thus get the target classification model for video classification. In another embodiment, the image training set, image verification set, and image test set are obtained separately, and each image in the image training set, image verification set, and image test set carries a classification label. First, use the image training set to train In the initial classification model, multiple candidate classification models are obtained, and then the image verification set is input to each candidate classification model to select a candidate classification model with better classification effect from each candidate classification model, and finally the image test set is input to the selected In the candidate classification model, if the classification accuracy of the selected candidate classification model for the image test set is higher than the preset threshold, the selected candidate classification model is used as the target classification model.
S108,根据目标图像集对应的目标视频场景,确定待分类视频的分类结果。S108: Determine a classification result of the video to be classified according to the target video scene corresponding to the target image set.
其中,分类结果用于指示待分类视频的视频场景,在实际应用中,可以将目标图像集对应的目标视频场景确定为待分类视频的视频场景,进而可以得到待分类视频的分类结果,例如,假设目标图像集对应的视频分类场景为场景A,则待分类视频的分类结果将指示待分类视频的视频场景为场景A。Among them, the classification result is used to indicate the video scene of the video to be classified. In practical applications, the target video scene corresponding to the target image set can be determined as the video scene of the video to be classified, and then the classification result of the video to be classified can be obtained, for example, Assuming that the video classification scene corresponding to the target image set is scene A, the classification result of the video to be classified will indicate that the video scene of the video to be classified is scene A.
本公开实施例提供的上述视频分类方法,首先获取待分类视频,根据待分类视频中的多个目标图像帧(包括多个目标图像帧),确定待分类视频对应的目标图像集,通过将目标图像集输入至目标分类模型,获得目标分类模型输出 的目标图像集对应的目标视频场景,其中,目标分类模型用于获取目标图像集中每个目标图像帧对应的图像特征,并根据每个目标图像帧对应的图像特征确定目标视频场景,最终根据目标图像集对应的目标视频场景,确定用于指示待分类视频的视频场景的分类结果。相比于传统的视频分类方法,本公开实施例通过目标分类模型提取每个图像帧对应的图像特征确定了目标图像集的目标视频场景,并在此基础上进一步确定了待分类视频的视频场景的分类结果,可有效提升视频分类效率和准确率。The video classification method provided by the embodiments of the present disclosure first obtains the video to be classified, and determines the target image set corresponding to the video to be classified according to multiple target image frames (including multiple target image frames) in the video to be classified. The image set is input to the target classification model, and the target video scene corresponding to the target image set output by the target classification model is obtained. The target classification model is used to obtain the image features corresponding to each target image frame in the target image set, and according to each target image The image feature corresponding to the frame determines the target video scene, and finally, according to the target video scene corresponding to the target image set, the classification result used to indicate the video scene of the video to be classified is determined. Compared with the traditional video classification method, the embodiment of the present disclosure determines the target video scene of the target image set by extracting the image feature corresponding to each image frame by the target classification model, and further determines the video scene of the video to be classified on this basis. The classification results can effectively improve the efficiency and accuracy of video classification.
为便于对上述实施例提供的视频方法进行理解,本公开实施例还提供了一种目标分类模型,其中,目标分类模型包括特征融合网络,以及与特征融合网络连接的特征提取网络,参见图2所示的一种目标分类模型的结构示意图,图2示意出目标分类模型包括依次连接的特征提取网络和特征融合网络。In order to facilitate the understanding of the video method provided in the above embodiments, the embodiments of the present disclosure also provide a target classification model, where the target classification model includes a feature fusion network and a feature extraction network connected to the feature fusion network, see FIG. 2 A schematic structural diagram of a target classification model is shown. Fig. 2 illustrates that the target classification model includes a feature extraction network and a feature fusion network connected in sequence.
在实际应用中,目标分类模型可以提取目标图像集中每个目标图像帧对应的图像特征,图像特征又可以包括浅层特征、深层特征、空间特征和时序特征中的一种或多种。其中,浅层特征可以理解为目标图像集的基础特征,诸如边缘或轮廓等;深层特征可以理解为目标图像集的抽象特征,深层特征的特征层次高于浅层特征的特征层次,例如,若目标图像帧中包含有人脸,则抽象特征可以为整个脸型;空间特征也即空间关系特征,可以用于表征图像帧中多个目标之间的相互的位置空间或相对方向关系等,例如多个目标之间的关系包括连接关系、交叠关系或包含关系中的一种或多种;时序特征可以理解为目标图像帧的时序数据的特征。In practical applications, the target classification model can extract the image features corresponding to each target image frame in the target image set, and the image features can include one or more of shallow features, deep features, spatial features, and temporal features. Among them, the shallow features can be understood as the basic features of the target image set, such as edges or contours; the deep features can be understood as the abstract features of the target image set, and the feature level of the deep features is higher than the feature level of the shallow features, for example, if If the target image frame contains a human face, the abstract feature can be the entire face; the spatial feature, that is, the spatial relationship feature, can be used to characterize the mutual position space or relative direction relationship between multiple targets in the image frame, such as multiple The relationship between the targets includes one or more of a connection relationship, an overlap relationship, or an inclusion relationship; the timing characteristics can be understood as characteristics of the timing data of the target image frame.
在图2的基础上,上述特征提取网络的输入为待分类视频对应的目标图像集,特征提取网络的输出为目标图像集对应的浅层特征;特征融合网络的输入为上述目标图像集对应的浅层特征,特征融合网络的输出为目标图像集对应的目标视频场景。基于上述目标分类模型的网络结构,上述步骤S106可以参照如下步骤(一)至(二)执行:On the basis of Figure 2, the input of the feature extraction network is the target image set corresponding to the video to be classified, the output of the feature extraction network is the shallow feature corresponding to the target image set; the input of the feature fusion network is the target image set corresponding to the above For shallow features, the output of the feature fusion network is the target video scene corresponding to the target image set. Based on the network structure of the foregoing target classification model, the foregoing step S106 can be performed with reference to the following steps (1) to (2):
(一)将目标图像集输入至目标分类模型的特征融合网络,通过特征提取网络提取目标图像集中每个目标图像帧的浅层特征。(1) Input the target image set into the feature fusion network of the target classification model, and extract the shallow features of each target image frame in the target image set through the feature extraction network.
其中,目标图像集的浅层特征可以为目标图像集中每个目标图像帧对应的特征图。例如,目标图像集中包含有N张尺寸为224*224的目标图像帧,此 时特征提取网络的输入为N张尺寸为224*224的图像,对目标图像集中的每个目标图像帧进行特征提取后,输出N张尺寸为7*7的特征图,该N张尺寸为7*7的特征图即为前述浅层特征。Among them, the shallow feature of the target image set may be a feature map corresponding to each target image frame in the target image set. For example, the target image set contains N target image frames with a size of 224*224. At this time, the input of the feature extraction network is N images with a size of 224*224, and feature extraction is performed on each target image frame in the target image set. Then, output N feature maps with a size of 7*7, and the N feature maps with a size of 7*7 are the aforementioned shallow features.
在一种实施方式中,特征提取网络包括ResNet(Residual Networks,残差网络)或VGGNet(Visual Geometry Group Network,视觉几何组网络),考虑到传统的卷积神经网络(CNN,Convolutional Neural Networks)在信息传递时存在特征信息丢失的问题,本公开实施例采用ResNet网络或VGG网络,其中,Resnet网络和VGG网络不仅更为适合进行图像处理,而且Resnet网络通过直接将输入传输至输出,可以有效保护特征信息的完整性,在一定程度上有助于缓解相关技术中损失各帧图像之间特征信息的问题。In one embodiment, the feature extraction network includes ResNet (Residual Networks, residual network) or VGGNet (Visual Geometry Group Network, visual geometry group network), taking into account the traditional convolutional neural network (CNN, Convolutional Neural Networks) in There is a problem of loss of characteristic information during information transmission. The embodiments of the present disclosure adopt ResNet network or VGG network. Among them, Resnet network and VGG network are not only more suitable for image processing, but also Resnet network can effectively protect by directly transmitting input to output. The integrity of feature information helps to a certain extent alleviate the problem of loss of feature information between frames of images in related technologies.
另外,本公开实施例提供的特征提取网络是基于迁移学习算法和fine tune算法训练得到的,其中,fine tune算法可以理解为将特征提取网络中的部分层的网络权值进行冻结,并通过反向传播算法修改目标层的网络权值。在实际应用中,在执行将目标图像集输入至目标分类模型的特征融合网络,通过特征提取网络提取目标图像集中每个目标图像帧的浅层特征的步骤之前,首先获取预训练模型,将预训练模型的网络参数设置为特征提取网络的初始参数,其中,预训练模型可以采用ImageNet数据集训练得到;然后通过反向传播对设置初始参数后的特征提取网络的指定层进行训练,并将训练后的特征提取网络作为目标分类模型中的特征提取网络,本公开实施例利用迁移学习算法和finetune算法有助于提高特征提取网络预训练的训练效率,并减少ImageNet数据集中所需的数据量,还可以加强特征提取网络的泛化性。In addition, the feature extraction network provided by the embodiments of the present disclosure is obtained based on the migration learning algorithm and the fine tune algorithm training, where the fine tune algorithm can be understood as freezing the network weights of some layers in the feature extraction network, and through reverse Modify the network weight of the target layer to the propagation algorithm. In practical applications, before executing the feature fusion network that inputs the target image set into the target classification model, and extracts the shallow features of each target image frame in the target image set through the feature extraction network, first obtain the pre-training model, and The network parameters of the training model are set as the initial parameters of the feature extraction network. Among them, the pre-training model can be trained using the ImageNet data set; then the specified layer of the feature extraction network after setting the initial parameters is trained through backpropagation, and the training is performed The latter feature extraction network is used as the feature extraction network in the target classification model. The embodiment of the present disclosure uses the migration learning algorithm and the finetune algorithm to help improve the training efficiency of the feature extraction network pre-training and reduce the amount of data required in the ImageNet data set. It can also strengthen the generalization of the feature extraction network.
在另一种实施方式中,特征提取网络包括依次连接的多个特征提取子网络,且各特征提取子网络均包括依次连接的第一卷积层、归一化层、激活函数层和残差连接层。其中,第一卷积层用于对特征提取子网络的输入进行卷积处理,归一化层用于对特征提取子网络的输入进行批归一化处理,激活函数层用于对特征提取子网络的输入进行激活函数处理,残差连接层用于对特征提取子网络的输入进行残差连接处理。In another embodiment, the feature extraction network includes a plurality of feature extraction sub-networks connected in sequence, and each feature extraction sub-network includes a first convolution layer, a normalization layer, an activation function layer, and a residual that are connected in sequence. Connection layer. Among them, the first convolutional layer is used for convolution processing on the input of the feature extraction sub-network, the normalization layer is used for batch normalization processing on the input of the feature extraction sub-network, and the activation function layer is used for the feature extraction sub-network. The input of the network is processed by the activation function, and the residual connection layer is used to perform the residual connection processing on the input of the feature extraction sub-network.
在此基础上,本公开实施例提供了一种将目标图像集输入至目标分类模型的特征融合网络,通过特征提取网络提取目标图像集中每个目标图像帧的浅层 特征的具体实现方式,参见如下步骤(1)至(2):(1)将目标图像集输入至目标分类模型的特征融合网络中第一个特征提取子网络,通过第一个特征子网络对目标图像集中每个目标图像帧进行特征提取,其中,第一个特征提取子网络的输入为目标图像集中的每个目标图像帧,输出为每个目标图像帧的第一层特征;(2)按照特征提取子网络的连接顺序,将第一个特征提取子网络提取的特征输入至下一特征提取子网络,通过下一特征子网络对第一个特征提取子网络提取的特征进行特征提取,直至得到目标图像集中每个目标图像帧的浅层特征,对于除第一个特征提取子网络外剩余的每个特征提取子网络,该特征提取子网络的输入为该特征提取子网络的前一特征提取子网络输出的特征,通过对输入的特征再次进行特征提取,并将提取得到的特征输入至该特征提取子网络的下一特征提取子网络。例如,特征提取网络包括依次连接的5个特征提取子网络,也即特征提取子网络分为5个阶段,每个阶段依次输出不同尺寸的特征图,以得到图像集中每张图像对应的浅层特征。On this basis, the embodiments of the present disclosure provide a feature fusion network that inputs a target image set into a target classification model, and a specific implementation method for extracting the shallow features of each target image frame in the target image set through the feature extraction network, see The following steps (1) to (2): (1) Input the target image set to the first feature extraction sub-network in the feature fusion network of the target classification model, and use the first feature sub-network to analyze each target image in the target image set Frame feature extraction, where the input of the first feature extraction sub-network is each target image frame in the target image set, and the output is the first layer feature of each target image frame; (2) The connection of the sub-network is extracted according to the feature In order, input the features extracted by the first feature extraction sub-network to the next feature extraction sub-network, and perform feature extraction on the features extracted by the first feature extraction sub-network through the next feature sub-network, until each target image set is obtained The shallow features of the target image frame, for each remaining feature extraction sub-network except the first feature extraction sub-network, the input of the feature extraction sub-network is the feature output of the previous feature extraction sub-network of the feature extraction sub-network , By performing feature extraction on the input features again, and inputting the extracted features into the next feature extraction sub-network of the feature extraction sub-network. For example, the feature extraction network includes 5 feature extraction sub-networks connected in sequence, that is, the feature extraction sub-network is divided into 5 stages, and each stage outputs feature maps of different sizes in turn to obtain the shallow layer corresponding to each image in the image set feature.
(二)将浅层特征输入至目标分类模型的特征融合网络,通过特征融合网络基于浅层特征提取目标图像集中每个目标图像帧的深层特征、空间特征和时序特征,并基于深层特征、空间特征和时序特征输出目标图像集对应的目标视频场景。为便于理解,本公开实施例还提供了另一种目标分类模型,参见图3所示的另一种目标分类模型的结构示意图,图3示意出了特征融合网络包括池化层、第二卷积层和第三卷积层;池化层、第二卷积层和第三卷积层的输入均与特征提取网络的输出相连。(2) Input the shallow features into the feature fusion network of the target classification model, and extract the deep features, spatial features, and time sequence features of each target image frame in the target image set based on the shallow features through the feature fusion network, and based on the deep features, space The feature and timing feature output the target video scene corresponding to the target image set. For ease of understanding, the embodiments of the present disclosure also provide another target classification model. Refer to the schematic structural diagram of another target classification model shown in FIG. 3. FIG. 3 shows that the feature fusion network includes a pooling layer and a second volume. The input of the accumulation layer and the third convolutional layer; the input of the pooling layer, the second convolutional layer and the third convolutional layer are all connected to the output of the feature extraction network.
基于如上所述的目标分类模型的网络结构,上述步骤(2)可以参照如下步骤1至步骤5执行:Based on the network structure of the target classification model described above, the above step (2) can be performed with reference to the following steps 1 to 5:
步骤1,特征融合网络根据深层特征,确定目标图像集对应的第一概率集。其中,第一概率集中包括多个第一概率,每个第一概率用于指示目标图像集属于一种视频场景的概率,在实际应用中,可以通过特征融合网络中的池化层根据深层特征,确定目标图像集对应的第一概率集。深层特征也可以理解为目标图像集中各图像帧的重点特征。例如,第一概率集包括用于指示目标图像集属于综艺的概率为70%、用于指示目标图像集属于体育的概率为50%、用于指示目标图像集属于动漫的概率为20%和用于指示目标图像集属于游戏的概率 为20%等。Step 1. The feature fusion network determines the first probability set corresponding to the target image set according to the deep features. Among them, the first probability set includes multiple first probabilities, and each first probability is used to indicate the probability that the target image set belongs to a video scene. In practical applications, the pooling layer in the feature fusion network can be based on deep features. , Determine the first probability set corresponding to the target image set. Deep features can also be understood as the key features of each image frame in the target image set. For example, the first probability set includes 70% for indicating that the target image set belongs to variety shows, 50% for indicating that the target image set belongs to sports, 20% for indicating that the target image set belongs to animation, and 20%. It indicates that the probability that the target image set belongs to the game is 20%, etc.
步骤2,特征融合网络根据空间特征,确定目标图像集对应的第二概率集,其中,第二概率集中包括多个第二概率,每个第二概率用于指示目标图像集属于一种视频场景的概率,在实际应用中,特征融合网络中的第二卷积层根据空间特征,确定目标图像对应的第二概率集。通过第二卷积层在浅层特征的基础上提取目标图像集中每个目标图像帧的空间特征,并基于空间特征输出第二概率集。其中,空间特征是在上述浅层特征的基础上进一步提取得到的2维特征,第二卷积层为2D卷积层。Step 2. The feature fusion network determines the second probability set corresponding to the target image set according to the spatial characteristics, where the second probability set includes multiple second probabilities, and each second probability is used to indicate that the target image set belongs to a video scene In practical applications, the second convolutional layer in the feature fusion network determines the second probability set corresponding to the target image according to the spatial features. The second convolutional layer extracts the spatial features of each target image frame in the target image set on the basis of the shallow features, and outputs the second probability set based on the spatial features. Wherein, the spatial feature is a 2-dimensional feature obtained by further extracting on the basis of the above-mentioned shallow feature, and the second convolutional layer is a 2D convolutional layer.
步骤3,特征融合网络根据时序特征,确定目标图像集对应的第三概率集,其中,第三概率集中包括多个第三概率,每个第三概率用于指示目标图像集属于一种视频场景的概率,在一种具体的实施方式中,特征融合网络中的第三卷积层根据时序特征,确定目标图像对应的第三概率集。通过第三卷积层在浅层特征的基础上提取图像集的时序特征,并基于时序特征输出第三概率集。其中,时序特征是在上述浅层特征的基础上进一步提取得到的3维特征,第三卷积层为3D卷积层。Step 3. The feature fusion network determines a third probability set corresponding to the target image set according to the timing characteristics, where the third probability set includes multiple third probabilities, and each third probability is used to indicate that the target image set belongs to a video scene In a specific implementation, the third convolutional layer in the feature fusion network determines the third probability set corresponding to the target image according to the time sequence feature. The third convolutional layer extracts the temporal features of the image set based on the shallow features, and outputs the third probability set based on the temporal features. Among them, the time series feature is a 3-dimensional feature further extracted on the basis of the above-mentioned shallow feature, and the third convolutional layer is a 3D convolutional layer.
步骤4,对同一视频场景对应的第一概率、第二概率和第三概率进行加权计算,得到各个视频场景对应的加权概率。通过对上述池化层、第二卷积层和第三卷积层的输出进行加权平均,可以得到更为准确的待分类视频所有可能类别的概率。例如,对于综艺场景对应的第一概率、第二概率和第三概率进行加权计算,得到综艺场景的加权概率为75%,对于游戏场景对应的第一概率、第二概率和第三概率进行加权计算,得到游戏场景的加权概率为20%,通过对每个视频场景对应的第一概率、第二概率和第三概率进行加权计算,既可以得到每个视频场景对应的加权概率。Step 4: Perform a weighted calculation on the first probability, the second probability, and the third probability corresponding to the same video scene to obtain the weighted probability corresponding to each video scene. By weighting and averaging the outputs of the aforementioned pooling layer, the second convolutional layer, and the third convolutional layer, a more accurate probability of all possible categories of the video to be classified can be obtained. For example, the first probability, second probability, and third probability corresponding to the variety show scene are weighted and calculated, and the weighted probability of the variety show scene is 75%, and the first probability, second probability, and third probability corresponding to the game scene are weighted. By calculation, the weighted probability of the game scene is 20%. By weighting the first probability, the second probability, and the third probability corresponding to each video scene, the weighted probability corresponding to each video scene can be obtained.
步骤5,将最大加权概率对应的视频场景,确定为目标图像集对应的目标视频场景。假设综艺场景的加权概率最大,则目标图像集对应的目标视频场景即为综艺场景。相较于现有的视频分类方式忽略了不同帧图像之间的关联性,本公开实施例通过特征融合网络中的池化层、第二卷积层和第三卷积层可以充分提取图像集中不同级别不同尺寸的特征信息(也即,上述深度特征、空间特征和时间特征),还可以利用特征融合网络融合图像集中各帧图像之间的特征 信息,进而有效提高视频分类结果的准确度。Step 5: Determine the video scene corresponding to the maximum weighted probability as the target video scene corresponding to the target image set. Assuming that the weighted probability of the variety show scene is the largest, the target video scene corresponding to the target image set is the variety show scene. Compared with the existing video classification method, which ignores the correlation between different frame images, the embodiment of the present disclosure can fully extract the image set through the pooling layer, the second convolutional layer and the third convolutional layer in the feature fusion network. The feature information of different levels and sizes (that is, the above-mentioned depth feature, spatial feature, and temporal feature) can also be used to fuse feature information between frames of images in the image set using a feature fusion network, thereby effectively improving the accuracy of the video classification result.
在执行将目标图像集输入至目标分类模型的步骤之前,本公开实施例还提供了一种训练如图3所示的目标分类模型的训练过程,该过程可以参照如下步骤a至步骤d执行:Before performing the step of inputting the target image set into the target classification model, the embodiment of the present disclosure also provides a training process for training the target classification model as shown in FIG. 3, and the process can be performed with reference to the following steps a to d:
步骤a,获取图像训练集,并将图像训练集输入至初始分类模型。在实际应用中,还可以获取图像测试集和图像验证集。其中,图像训练集用于训练初始分类模型,通过调节训练参数可以得到多个不同参数的多个候选分类模型,训练参数可以包括训练速率;图像验证集用于从多个候选分类模型中选取一个分类效果较佳的候选分类模型;图像测试集用于测试选取的候选分类模型的分类能力。本公开实施例提供了一种获取图像训练集、图像验证集和图像测试集的方法,包括如下步骤:(1)获取携带有分类标签的原始视频,考虑到目前尚无用于视频分类的公开数据集(也即,前述原始视频),所以可从互联网上按类别获取大量相关视频,为了保证目标分类网络的泛化性,获取的视频类别应尽量广泛,例如游戏类别的数据集,可分别获取数十种不同游戏的相关视频;(2)按照预设比例将原始视频划分为第一视频集、第二视频集和第三视频集;(3)将第一视频集中的原始视频切割为第一预设时长的第一视频,并抽取第一视频中的多张帧图像,得到图像训练集;(4)将第二视频集中的原始视频切割为第二预设时长的第二视频,并抽取第二视频中的多张帧图像,得到图像验证集;(5)将第三视频集中的原始视频切割为第三预设时长的第三视频,并抽取第三视频中的多张帧图像,得到图像测试集。其中,上述第一预设时长、第二预设时长和第三预设时长可以为5至15秒,以将第一视频集、第二视频集和第三视频集中的原始视频切分为不同时长的短视频,并分别对得到的短视频进行等间隔抽取多张帧图像,即可得到上述图像训练集、图像验证集和图像测试集。另外,先将原始视频划分为第一视频集、第二视频集和第三视频集,再对视频集内的原始视频进行切割,可以保证图像训练集、图像验证集和图像测试集内的图像来源于不同原始视频,进而可以得到分类效果更佳的目标分类模型。Step a: Obtain the image training set, and input the image training set to the initial classification model. In practical applications, image test sets and image verification sets can also be obtained. Among them, the image training set is used to train the initial classification model. By adjusting the training parameters, multiple candidate classification models with different parameters can be obtained. The training parameters can include the training rate; the image verification set is used to select one from multiple candidate classification models. The candidate classification model with better classification effect; the image test set is used to test the classification ability of the selected candidate classification model. The embodiments of the present disclosure provide a method for obtaining an image training set, an image verification set, and an image test set, including the following steps: (1) Obtain the original video carrying classification tags, considering that there is no public data for video classification. Therefore, a large number of related videos can be obtained by category from the Internet. In order to ensure the generalization of the target classification network, the obtained video categories should be as wide as possible. For example, the data set of the game category can be obtained separately Dozens of related videos of different games; (2) Divide the original video into the first video set, the second video set and the third video set according to the preset ratio; (3) Cut the original video in the first video set into the first video set A first video with a preset duration, and multiple frame images from the first video are extracted to obtain an image training set; (4) Cut the original video in the second video set into a second video with a second preset duration, and Extract multiple frame images in the second video to obtain an image verification set; (5) Cut the original video in the third video set into a third video with a third preset duration, and extract multiple frame images in the third video , Get the image test set. Wherein, the aforementioned first preset duration, second preset duration, and third preset duration may be 5 to 15 seconds to divide the original videos in the first video set, the second video set, and the third video set into different groups. A short video with a length of time, and multiple frames of images are extracted at equal intervals on the obtained short videos respectively to obtain the above-mentioned image training set, image verification set, and image test set. In addition, first divide the original video into the first video set, the second video set, and the third video set, and then cut the original video in the video set to ensure the images in the image training set, image verification set, and image test set. From different original videos, a target classification model with better classification effect can be obtained.
步骤b,根据初始分类模型针对图像训练集输出的分类结果,计算初始分类模型的损失函数。因为图像训练集中的每张图像均携带有分类标签,可以使 初始分类模型学习图像与分类标签之间的映射关系,通过调节训练参数得到多个不同权重的候选分类模型。具体实施时,首先根据初始分类模型针对图像训练集输出的分类结果,计算初始分类模型的损失函数,其中,损失函数使用交叉熵loss。Step b: Calculate the loss function of the initial classification model according to the classification result output by the initial classification model for the image training set. Because each image in the image training set carries a classification label, the initial classification model can learn the mapping relationship between the image and the classification label, and multiple candidate classification models with different weights can be obtained by adjusting the training parameters. In specific implementation, first, the loss function of the initial classification model is calculated according to the classification result output by the initial classification model for the image training set, where the loss function uses cross entropy loss.
步骤c,利用反向传播算法计算损失函数相对于初始分类模型的参数的导数
Figure PCTCN2020113860-appb-000001
Step c: Use the backpropagation algorithm to calculate the derivative of the loss function with respect to the parameters of the initial classification model
Figure PCTCN2020113860-appb-000001
步骤d,利用梯度下降(Adam)算法和导数更新初始分类模型的参数,得到目标分类模型。具体实施时,根据上述导数计算下降速率α,并利用下降速率α更新初始分类模型的权重参数,当得到的下降速率α不同时,将得到多个候选分类模型,其中根据上述导数计算下降速率α的公式如下所示:
Figure PCTCN2020113860-appb-000002
为进一步确定目标分类模型,可以将图像验证集输入至各候选分类模型,并基于各候选分类模型针对图像验证集输出的分类结果,从多个候选分类模型中选取一个候选分类模型,再将图像测试集输入至选取的候选分类模型,并基于选取的候选分类模型针对图像测试集输出的分类结果,计算选取的候选分类模型的分类准确率,如果分类准确率高于预设阈值,将选取的候选分类模型确定为训练得到的目标分类模型。
Step d, using the gradient descent (Adam) algorithm and derivatives to update the parameters of the initial classification model to obtain the target classification model. In specific implementation, calculate the descent rate α according to the above derivative, and use the descent rate α to update the weight parameters of the initial classification model. When the obtained descent rate α is different, multiple candidate classification models will be obtained, and the descent rate α is calculated according to the above derivative. The formula is as follows:
Figure PCTCN2020113860-appb-000002
In order to further determine the target classification model, the image verification set can be input to each candidate classification model, and based on the classification results output by each candidate classification model for the image verification set, a candidate classification model is selected from multiple candidate classification models, and then the image The test set is input to the selected candidate classification model, and based on the classification results of the selected candidate classification model for the image test set output, the classification accuracy of the selected candidate classification model is calculated. If the classification accuracy is higher than the preset threshold, the selected candidate classification model will be selected. The candidate classification model is determined as the target classification model obtained by training.
考虑到不用的训练参数会对初始分类模型的训练产生影响,会得到多个不同参数的候选分类模型;另外,即使采用相同的训练参数对初始分类模型进行训练,在后续收敛时模型也会存在小幅度的波动,得到多个不同参数的候选分类模型,因此需要图像验证集从多个候选分类模型中选取出一个分类效果较佳的分类模型。例如,从多个候选分类模型中选取一个候选分类模型后,利用图像测试集对选取的候选分类模型进行测试,其中,图像测试集中的图像来源于4中类型的视频,包括游戏类别、秀场类别、综艺类别和体育类别,且每类视频的个数为40个。测试结果如下表1所示,分类结果的平均精度已达到90% 以上。Considering that unused training parameters will affect the training of the initial classification model, multiple candidate classification models with different parameters will be obtained; in addition, even if the same training parameters are used to train the initial classification model, the model will still exist during subsequent convergence With small fluctuations, multiple candidate classification models with different parameters are obtained. Therefore, the image verification set is required to select a classification model with a better classification effect from the multiple candidate classification models. For example, after selecting a candidate classification model from multiple candidate classification models, use an image test set to test the selected candidate classification model. The images in the image test set are derived from 4 types of videos, including game categories and show scenes. Category, variety show and sports category, and the number of videos in each category is 40. The test results are shown in Table 1 below, and the average accuracy of the classification results has reached more than 90%.
类别category 游戏类别Game category 秀场类别Show category 综艺类别Variety category 体育类别Sports category
精度Precision 97.5%97.5% 80%80% 90%90% 97.5%97.5%
表1Table 1
在上述实施例的基础上,本公开实施例提供了一种目标分类模型的具体应用实例,例如,利用该目标分类模型实现视频编码,在一种具体的实施方式中,获取分段视频流,将该分段视频流输入至预设的第一线程和第二线程中,其中,第一线程中部署有上述目标分类模型,通过第一线程中的目标分类模型确定分段视频流对应的视频场景,进而通过第二线程在分段视频流对应的视频场景的基础上对分段视频流进行编码。当视频帧图像为多张时,特征融合层对多张视频帧图像的特征参数进行融合,得到多张视频帧图像的融合特征,对融合特征进行分类,得到多张视频帧图像整体对应的视频场景,并将多张视频帧图像整体对应的视频场景确定为第一分段视频流的第一视频场景。当视频帧图像为多张且多张视频帧图像对应的视频场景不相同时,由于视频场景通常表示为概率值,比如某一张视频帧图像对应的视频场景为动漫的概率为80%,游戏的概率为20%。因此,可以将概率值最高的视频场景确定为第一分段视频流的第一视频场景;或者,还可以先针对多张视频帧图像计算每一种视频场景的概率总和,然后将概率总和最大的视频场景确定为第一分段视频流的第一视频场景。通过利用本公开实施例提供的目标分类模型对分段视频流进行分类,可以得到更为准确的分类结果,进而可以使编码后的分段视频流更好地当前的视频场景。On the basis of the above-mentioned embodiments, the embodiments of the present disclosure provide a specific application example of a target classification model. For example, the target classification model is used to implement video coding, and in a specific implementation manner, a segmented video stream is obtained. Input the segmented video stream into the preset first thread and second thread, where the above-mentioned target classification model is deployed in the first thread, and the video corresponding to the segmented video stream is determined by the target classification model in the first thread Scene, and then encode the segmented video stream on the basis of the video scene corresponding to the segmented video stream through the second thread. When there are multiple video frame images, the feature fusion layer fuses the feature parameters of multiple video frame images to obtain the fusion features of multiple video frame images, classifies the fusion features, and obtains the video corresponding to the multiple video frame images as a whole Scene, and the video scene corresponding to the multiple video frame images as a whole is determined as the first video scene of the first segmented video stream. When there are multiple video frame images and the video scenes corresponding to the multiple video frame images are not the same, the video scene is usually expressed as a probability value. For example, the probability that the video scene corresponding to a certain video frame image is an animation is 80%. The probability is 20%. Therefore, the video scene with the highest probability value can be determined as the first video scene of the first segmented video stream; or, the sum of the probabilities of each video scene can be calculated for multiple video frame images, and then the sum of the probabilities can be maximized The video scene of is determined as the first video scene of the first segmented video stream. By using the target classification model provided by the embodiments of the present disclosure to classify the segmented video stream, more accurate classification results can be obtained, and the encoded segmented video stream can be better in the current video scene.
综上所述,本公开实施例利用特征融合网络中的池化层、2D卷积层和3D卷积层可以更为全面的提取图像集中特征信息,相较于现有的视频分类方法忽略了不同帧图像之间的关联性,本公开实施例采用特征融合网络能够较好地提取并融合图像集中不同帧图像之间的特征信息,可以有效提高视频分类结果的准确度。In summary, the embodiments of the present disclosure use the pooling layer, 2D convolutional layer, and 3D convolutional layer in the feature fusion network to more comprehensively extract the feature information of the image concentration, which is ignored compared to the existing video classification methods. For the correlation between different frame images, the embodiment of the present disclosure adopts a feature fusion network to better extract and fuse feature information between different frame images in an image set, and can effectively improve the accuracy of the video classification result.
对于上述实施例提供的视频分类方法,本公开实施例还提供了一种视频分类装置,参见图4所示的一种视频分类装置的结构示意图,该装置可以包括以下部分:Regarding the video classification method provided in the foregoing embodiment, an embodiment of the present disclosure also provides a video classification device. Referring to the schematic structural diagram of a video classification device shown in FIG. 4, the device may include the following parts:
视频获取模块402,设置为获取待分类视频。The video acquisition module 402 is configured to acquire videos to be classified.
图像集确定模块404,设置为根据待分类视频中的多个目标图像帧,确定待分类视频对应的目标图像集,其中,目标图像集中包括多个目标图像帧。The image set determining module 404 is configured to determine a target image set corresponding to the video to be classified according to multiple target image frames in the video to be classified, wherein the target image set includes multiple target image frames.
输入模块406,设置为将目标图像集输入至目标分类模型,并获得目标分类模型输出的目标图像集对应的目标视频场景,其中,目标分类模型设置为获取目标图像集中每个目标图像帧对应的图像特征,并根据每个目标图像帧对应的图像特征确定目标视频场景。The input module 406 is configured to input the target image set into the target classification model, and obtain the target video scene corresponding to the target image set output by the target classification model, wherein the target classification model is set to obtain the corresponding target image frame in the target image set. Image characteristics, and determine the target video scene according to the image characteristics corresponding to each target image frame.
分类确定模块408,设置为根据目标图像集对应的目标视频场景,确定待分类视频的分类结果,其中,分类结果设置为指示待分类视频的视频场景。The classification determining module 408 is configured to determine the classification result of the video to be classified according to the target video scene corresponding to the target image set, wherein the classification result is set to indicate the video scene of the video to be classified.
本公开实施例提供的视频分类装置,相比于传统的视频分类方法,本公开实施例通过目标分类模型提取每个图像帧对应的图像特征确定了目标图像集的目标视频场景,并在此基础上进一步确定了待分类视频的视频场景的分类结果,可有效提升视频分类效率和准确率。Compared with the traditional video classification method, the video classification device provided by the embodiments of the present disclosure determines the target video scene of the target image set by extracting the image features corresponding to each image frame by the target classification model. The above further determines the classification result of the video scene of the video to be classified, which can effectively improve the efficiency and accuracy of video classification.
在一种实施方式中,图像特征包括浅层特征、深层特征、空间特征和时序特征中的一种或多种;目标分类模型包括特征融合网络,以及与特征融合网络连接的特征提取网络;上述输入模块406还设置为:将目标图像集输入至目标分类模型的特征融合网络,通过特征提取网络提取目标图像集中每个目标图像帧的浅层特征;将浅层特征输入至目标分类模型的特征融合网络,通过特征融合网络基于浅层特征提取目标图像集中每个目标图像帧的深层特征、空间特征和时序特征,并基于深层特征、空间特征和时序特征输出目标图像集对应的目标视频场景;深层特征的特征层次高于浅层特征的特征层次。In an embodiment, the image features include one or more of shallow features, deep features, spatial features, and temporal features; the target classification model includes a feature fusion network, and a feature extraction network connected to the feature fusion network; the foregoing The input module 406 is further configured to: input the target image set into the feature fusion network of the target classification model, extract the shallow features of each target image frame in the target image set through the feature extraction network; and input the shallow features into the features of the target classification model Fusion network, through the feature fusion network based on shallow features to extract the deep features, spatial features and timing features of each target image frame in the target image set, and output the target video scene corresponding to the target image set based on the deep features, spatial features and timing features; The feature level of deep features is higher than that of shallow features.
在一种实施方式中,上述视频分类装置还包括第一训练模块,设置为:获取预训练模型,将预训练模型的网络参数设置为特征提取网络的初始参数;通过反向传播对设置初始参数后的特征提取网络的指定层进行训练,并将训练后的特征提取网络作为目标分类模型中的特征提取网络。In one embodiment, the above-mentioned video classification device further includes a first training module configured to: obtain a pre-training model, and set the network parameters of the pre-training model as the initial parameters of the feature extraction network; and set the initial parameters through backpropagation. The specified layer of the subsequent feature extraction network is trained, and the trained feature extraction network is used as the feature extraction network in the target classification model.
在一种实施方式中,特征提取网络包括依次连接的多个特征提取子网络;上述输入模块406还设置为:将目标图像集输入至目标分类模型的特征融合网络中第一个特征提取子网络,通过第一个特征子网络对目标图像集中每个目标图像帧进行特征提取;按照特征提取子网络的连接顺序,将第一个特征提取子网络提取的特征输入至下一特征提取子网络,通过下一特征子网络对第一个特 征提取子网络提取的特征进行特征提取,直至得到目标图像集中每个目标图像帧的浅层特征。In one embodiment, the feature extraction network includes a plurality of feature extraction sub-networks connected in sequence; the above-mentioned input module 406 is further configured to: input the target image set into the first feature extraction sub-network in the feature fusion network of the target classification model , Perform feature extraction on each target image frame in the target image set through the first feature sub-network; according to the connection order of the feature extraction sub-network, input the features extracted by the first feature extraction sub-network to the next feature extraction sub-network, Feature extraction is performed on the features extracted by the first feature extraction sub-network through the next feature sub-network, until the shallow features of each target image frame in the target image set are obtained.
在一种实施方式中,上述输入模块406还设置为:特征融合网络根据深层特征,确定目标图像集对应的第一概率集,其中,第一概率集中包括多个第一概率,每个第一概率设置为指示目标图像集属于一种视频场景的概率;特征融合网络根据空间特征,确定目标图像集对应的第二概率集,其中,第二概率集中包括多个第二概率,每个第二概率设置为指示目标图像集属于一种视频场景的概率;特征融合网络根据时序特征,确定目标图像集对应的第三概率集,其中,第三概率集中包括多个第三概率,每个第三概率设置为指示目标图像集属于一种视频场景的概率;对同一视频场景对应的第一概率、第二概率和第三概率进行加权计算,得到各个视频场景对应的加权概率;将最大加权概率对应的视频场景,确定为目标图像集对应的目标视频场景。In an embodiment, the above-mentioned input module 406 is further configured to: the feature fusion network determines the first probability set corresponding to the target image set according to the deep features, wherein the first probability set includes a plurality of first probabilities, and each first probability set The probability is set to indicate the probability that the target image set belongs to a video scene; the feature fusion network determines the second probability set corresponding to the target image set according to the spatial characteristics, where the second probability set includes multiple second probabilities, and each second probability set The probability is set to indicate the probability that the target image set belongs to a video scene; the feature fusion network determines the third probability set corresponding to the target image set according to the timing characteristics, where the third probability set includes multiple third probabilities, and each third probability set The probability is set to indicate the probability that the target image set belongs to a video scene; the first probability, the second probability, and the third probability corresponding to the same video scene are weighted and calculated to obtain the weighted probability corresponding to each video scene; the maximum weighted probability corresponds to The video scene of is determined as the target video scene corresponding to the target image set.
在一种实施方式中,特征融合网络包括池化层、第二卷积层和第三卷积层;池化层、第二卷积层和第三卷积层的输入均与特征提取网络的输出相连;第二卷积层为2D卷积层;第三卷积层为3D卷积层;上述输入模块406还设置为:特征融合网络中的池化层根据深层特征,确定目标图像集对应的第一概率集;特征融合网络根据空间特征,确定目标图像集对应的第二概率集的步骤,包括:特征融合网络中的第二卷积层根据空间特征,确定目标图像对应的第二概率集;特征融合网络根据时序特征,确定目标图像集对应的第三概率集的步骤,包括:特征融合网络中的第三卷积层根据时序特征,确定目标图像对应的第三概率集。In one embodiment, the feature fusion network includes a pooling layer, a second convolutional layer, and a third convolutional layer; the inputs of the pooling layer, the second convolutional layer, and the third convolutional layer are all the same as those of the feature extraction network. The output is connected; the second convolutional layer is a 2D convolutional layer; the third convolutional layer is a 3D convolutional layer; the input module 406 is also set to: the pooling layer in the feature fusion network determines the target image set corresponding to the deep features The first probability set of the feature fusion network according to the spatial characteristics, the step of determining the second probability set corresponding to the target image set includes: the second convolutional layer in the feature fusion network determines the second probability corresponding to the target image according to the spatial characteristics The step of the feature fusion network determining the third probability set corresponding to the target image set according to the time sequence feature includes: the third convolution layer in the feature fusion network determines the third probability set corresponding to the target image according to the time sequence feature.
在一种实施方式中,上述视频分类装置还包括第二训练模块,设置为:获取图像训练集,并将图像训练集输入至初始分类模型;根据初始分类模型针对图像训练集输出的分类结果,计算初始分类模型的损失函数;利用反向传播算法计算损失函数相对于初始分类模型的参数的导数;利用梯度下降算法和导数更新初始分类模型的参数,得到目标分类模型。In an embodiment, the above-mentioned video classification device further includes a second training module configured to: obtain an image training set, and input the image training set to the initial classification model; according to the classification result output by the initial classification model for the image training set, Calculate the loss function of the initial classification model; use the backpropagation algorithm to calculate the derivative of the loss function relative to the parameters of the initial classification model; use the gradient descent algorithm and the derivative to update the parameters of the initial classification model to obtain the target classification model.
本公开实施例所提供的装置,其实现原理及产生的技术效果和前述方法实施例相同,为简要描述,装置实施例部分未提及之处,可参考前述方法实施例中相应内容。The implementation principles and technical effects of the device provided in the embodiments of the present disclosure are the same as those of the foregoing method embodiments. For a brief description, for the parts not mentioned in the device embodiments, please refer to the corresponding content in the foregoing method embodiments.
该设备为一种电子设备,具体的,该电子设备包括处理器和存储装置;存储装置上存储有计算机程序,计算机程序在被所述处理器运行时执行如上所述实施方式的任一项所述的方法。The device is an electronic device. Specifically, the electronic device includes a processor and a storage device; a computer program is stored on the storage device, and the computer program executes any one of the above-mentioned embodiments when being run by the processor. The method described.
图5为本公开实施例提供的一种电子设备的结构示意图,该电子设备100包括:处理器50,存储器51,总线52和通信接口53,所述处理器50、通信接口53和存储器51通过总线52连接;处理器50用于执行存储器51中存储的可执行模块,例如计算机程序。5 is a schematic structural diagram of an electronic device provided by an embodiment of the disclosure. The electronic device 100 includes a processor 50, a memory 51, a bus 52, and a communication interface 53, through which the processor 50, the communication interface 53 and the memory 51 pass The bus 52 is connected; the processor 50 is used to execute an executable module stored in the memory 51, such as a computer program.
其中,存储器51可能包含高速随机存取存储器(RAM,Random Access Memory),也可能还包括非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。通过至少一个通信接口53(可以是有线或者无线)实现该系统网元与至少一个其他网元之间的通信连接,可以使用互联网,广域网,本地网,城域网等。The memory 51 may include a high-speed random access memory (RAM, Random Access Memory), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and at least one other network element is realized through at least one communication interface 53 (which may be wired or wireless), and the Internet, a wide area network, a local network, a metropolitan area network, etc. may be used.
总线52可以是ISA总线、PCI总线或EISA总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图5中仅用一个双向箭头表示,但并不表示仅有一根总线或一种类型的总线。The bus 52 may be an ISA bus, a PCI bus, an EISA bus, or the like. The bus can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one bidirectional arrow is used to indicate in FIG. 5, but it does not mean that there is only one bus or one type of bus.
其中,存储器51用于存储程序,所述处理器50在接收到执行指令后,执行所述程序,前述本公开实施例任一实施例揭示的流过程定义的装置所执行的方法可以应用于处理器50中,或者由处理器50实现。Wherein, the memory 51 is used to store a program, and the processor 50 executes the program after receiving the execution instruction. The method executed by the flow process definition apparatus disclosed in any of the foregoing embodiments of the present disclosure can be applied to processing In the device 50, or implemented by the processor 50.
处理器50可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器50中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器50可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、网络处理器(Network Processor,简称NP)等;还可以是数字信号处理器(Digital Signal Processing,简称DSP)、专用集成电路(Application Specific Integrated Circuit,简称ASIC)、现成可编程门阵列(Field-Programmable Gate Array,简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本公开实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本公开实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合 执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器51,处理器50读取存储器51中的信息,结合其硬件完成上述方法的步骤。The processor 50 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in the processor 50 or instructions in the form of software. The above-mentioned processor 50 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP for short), etc. ), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of the present disclosure can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present disclosure may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory 51, and the processor 50 reads the information in the memory 51, and completes the steps of the above method in combination with its hardware.
本公开实施例所提供的可读存储介质的计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令可用于执行前面方法实施例中所述的方法,具体实现可参见前述方法实施例,在此不再赘述。The computer program product of the readable storage medium provided by the embodiments of the present disclosure includes a computer readable storage medium storing program code. The instructions included in the program code can be used to execute the method described in the previous method embodiment, and the specific implementation Please refer to the foregoing method embodiment, which will not be repeated here.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the above-described system, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对相关技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储 介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the related technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. .
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the embodiments are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.
工业实用性Industrial applicability
基于本公开实施例提供的视频分类方法、装置及电子设备,能够通过目标分类模型提取每个图像帧对应的图像特征确定了目标图像集的目标视频场景,并在此基础上进一步确定了待分类视频的视频场景的分类结果,可有效提升视频分类效率和准确率。Based on the video classification method, device and electronic equipment provided by the embodiments of the present disclosure, the image feature corresponding to each image frame can be extracted through the target classification model to determine the target video scene of the target image set, and on this basis, the target video scene to be classified is further determined The classification result of the video scene of the video can effectively improve the efficiency and accuracy of video classification.

Claims (10)

  1. 一种视频分类方法,包括:A video classification method, including:
    获取待分类视频;Get the video to be classified;
    根据所述待分类视频中的多个目标图像帧,确定所述待分类视频对应的目标图像集,其中,所述目标图像集中包括所述多个目标图像帧;Determining a target image set corresponding to the video to be classified according to multiple target image frames in the video to be classified, wherein the target image set includes the multiple target image frames;
    将所述目标图像集输入至目标分类模型,并获得所述目标分类模型输出的所述目标图像集对应的目标视频场景,其中,所述目标分类模型用于获取所述目标图像集中每个目标图像帧对应的图像特征,并根据所述每个目标图像帧对应的图像特征确定所述目标视频场景;The target image set is input to a target classification model, and the target video scene corresponding to the target image set output by the target classification model is obtained, wherein the target classification model is used to obtain each target in the target image set Image features corresponding to the image frames, and determining the target video scene according to the image features corresponding to each target image frame;
    根据所述目标图像集对应的所述目标视频场景,确定所述待分类视频的分类结果,其中,所述分类结果用于指示所述待分类视频的视频场景。Determine the classification result of the video to be classified according to the target video scene corresponding to the target image set, where the classification result is used to indicate the video scene of the video to be classified.
  2. 根据权利要求1所述的方法,其中,所述图像特征包括浅层特征、深层特征、空间特征和时序特征中的一种或多种;所述目标分类模型包括特征融合网络,以及与所述特征融合网络连接的特征提取网络;其中,The method according to claim 1, wherein the image features include one or more of shallow features, deep features, spatial features, and temporal features; the target classification model includes a feature fusion network, and the Feature extraction network connected by feature fusion network; among them,
    所述将所述目标图像集输入至目标分类模型,并获得所述目标分类模型输出的所述目标图像集对应的目标视频场景的步骤,包括:The step of inputting the target image set into a target classification model and obtaining the target video scene corresponding to the target image set output by the target classification model includes:
    将所述目标图像集输入至目标分类模型的特征融合网络,通过所述特征提取网络提取所述目标图像集中每个目标图像帧的浅层特征;Inputting the target image set to a feature fusion network of a target classification model, and extracting the shallow features of each target image frame in the target image set through the feature extraction network;
    将所述浅层特征输入至所述目标分类模型的特征融合网络,通过所述特征融合网络基于所述浅层特征提取所述目标图像集中每个目标图像帧的深层特征、空间特征和时序特征,并基于所述深层特征、所述空间特征和所述时序特征输出所述目标图像集对应的目标视频场景;所述深层特征的特征层次高于所述浅层特征的特征层次。The shallow features are input to the feature fusion network of the target classification model, and the deep features, spatial features, and time series features of each target image frame in the target image set are extracted based on the shallow features through the feature fusion network , And output the target video scene corresponding to the target image set based on the deep feature, the spatial feature, and the time sequence feature; the feature level of the deep feature is higher than the feature level of the shallow feature.
  3. 根据权利要求2所述的方法,其中,在所述将所述目标图像集输入至目标分类模型的特征融合网络,通过所述特征提取网络提取所述目标图像集中每个目标图像帧的浅层特征的步骤之前,所述方法还包括:The method according to claim 2, wherein, in the feature fusion network that inputs the target image set to the target classification model, the shallow layer of each target image frame in the target image set is extracted through the feature extraction network Before the step of characterizing, the method further includes:
    获取预训练模型,将所述预训练模型的网络参数设置为所述特征提取网络的初始参数;Acquiring a pre-training model, and setting the network parameters of the pre-training model as the initial parameters of the feature extraction network;
    通过反向传播对设置初始参数后的特征提取网络的指定层进行训练,并将 训练后的特征提取网络作为所述目标分类模型中的特征提取网络。The specified layer of the feature extraction network after the initial parameters is set is trained through back propagation, and the trained feature extraction network is used as the feature extraction network in the target classification model.
  4. 根据权利要求2-3任一项所述的方法,其中,所述特征提取网络包括依次连接的多个特征提取子网络;The method according to any one of claims 2-3, wherein the feature extraction network comprises a plurality of feature extraction sub-networks connected in sequence;
    所述将所述目标图像集输入至目标分类模型的特征融合网络,通过所述特征提取网络提取所述目标图像集中每个目标图像帧的浅层特征的步骤,包括:The step of inputting the target image set into the feature fusion network of the target classification model, and extracting the shallow features of each target image frame in the target image set through the feature extraction network includes:
    将所述目标图像集输入至目标分类模型的特征融合网络中第一个特征提取子网络,通过所述第一个特征提取子网络对所述目标图像集中每个目标图像帧进行特征提取;Input the target image set to the first feature extraction sub-network in the feature fusion network of the target classification model, and perform feature extraction on each target image frame in the target image set through the first feature extraction sub-network;
    按照所述特征提取子网络的连接顺序,将所述第一个特征提取子网络提取的特征输入至下一特征提取子网络,通过所述下一特征提取子网络对所述第一个特征提取子网络提取的特征进行特征提取,直至得到所述目标图像集中每个目标图像帧的浅层特征。According to the connection sequence of the feature extraction sub-network, the features extracted by the first feature extraction sub-network are input to the next feature extraction sub-network, and the first feature is extracted through the next feature extraction sub-network. The features extracted by the sub-network are subjected to feature extraction until the shallow features of each target image frame in the target image set are obtained.
  5. 根据权利要求2-4任一项所述的方法,其中,所述通过所述特征融合网络基于所述浅层特征提取所述目标图像集中每个目标图像帧的深层特征、空间特征和时序特征,并基于所述深层特征、所述空间特征和所述时序特征输出所述目标图像集对应的目标视频场景的步骤,包括:The method according to any one of claims 2 to 4, wherein the feature fusion network extracts the deep features, spatial features, and time series features of each target image frame in the target image set based on the shallow features , And outputting the target video scene corresponding to the target image set based on the deep feature, the spatial feature, and the time sequence feature, including:
    所述特征融合网络根据所述深层特征,确定所述目标图像集对应的第一概率集,其中,所述第一概率集中包括多个第一概率,每个所述第一概率设置为指示所述目标图像集属于一种视频场景的概率;The feature fusion network determines a first probability set corresponding to the target image set according to the deep features, wherein the first probability set includes a plurality of first probabilities, and each of the first probabilities is set to indicate The probability that the target image set belongs to a video scene;
    所述特征融合网络根据所述空间特征,确定所述目标图像集对应的第二概率集,其中,所述第二概率集中包括多个第二概率,每个所述第二概率设置为指示所述目标图像集属于一种视频场景的概率;The feature fusion network determines a second probability set corresponding to the target image set according to the spatial feature, wherein the second probability set includes a plurality of second probabilities, and each of the second probabilities is set to indicate The probability that the target image set belongs to a video scene;
    所述特征融合网络根据所述时序特征,确定所述目标图像集对应的第三概率集,其中,所述第三概率集中包括多个第三概率,每个所述第三概率设置为指示所述目标图像集属于一种视频场景的概率;The feature fusion network determines a third probability set corresponding to the target image set according to the time sequence feature, wherein the third probability set includes a plurality of third probabilities, and each of the third probabilities is set to indicate The probability that the target image set belongs to a video scene;
    对同一所述视频场景对应的所述第一概率、所述第二概率和所述第三概率进行加权计算,得到各个所述视频场景对应的加权概率;Performing a weighted calculation on the first probability, the second probability, and the third probability corresponding to the same video scene to obtain the weighted probability corresponding to each of the video scenes;
    将最大加权概率对应的视频场景,确定为所述目标图像集对应的目标视频场景。The video scene corresponding to the maximum weighted probability is determined as the target video scene corresponding to the target image set.
  6. 根据权利要求5所述的方法,其中,所述特征融合网络包括池化层、第二卷积层和第三卷积层;所述池化层、所述第二卷积层和所述第三卷积层的输入均与所述特征提取网络的输出相连;所述第二卷积层为2D卷积层;所述第三卷积层为3D卷积层;其中,The method according to claim 5, wherein the feature fusion network includes a pooling layer, a second convolutional layer, and a third convolutional layer; the pooling layer, the second convolutional layer, and the first convolutional layer; The inputs of the three convolutional layers are all connected to the output of the feature extraction network; the second convolutional layer is a 2D convolutional layer; the third convolutional layer is a 3D convolutional layer; wherein,
    所述特征融合网络根据所述深层特征,确定所述目标图像集对应的第一概率集的步骤,包括:所述特征融合网络中的所述池化层根据所述深层特征,确定所述目标图像集对应的所述第一概率集;The step of the feature fusion network determining the first probability set corresponding to the target image set according to the deep features includes: the pooling layer in the feature fusion network determines the target according to the deep features The first probability set corresponding to the image set;
    所述特征融合网络根据所述空间特征,确定所述目标图像集对应的第二概率集的步骤,包括:所述特征融合网络中的所述第二卷积层根据所述空间特征,确定所述目标图像对应的第二概率集;The step of the feature fusion network determining the second probability set corresponding to the target image set according to the spatial feature includes: the second convolutional layer in the feature fusion network determines the second probability set according to the spatial feature The second probability set corresponding to the target image;
    所述特征融合网络根据所述时序特征,确定所述目标图像集对应的第三概率集的步骤,包括:所述特征融合网络中的所述第三卷积层根据所述时序特征,确定所述目标图像对应的第三概率集。The step of the feature fusion network determining the third probability set corresponding to the target image set according to the time sequence feature includes: the third convolutional layer in the feature fusion network determines the set of probability The third probability set corresponding to the target image.
  7. 根据权利要求1-6任一项所述的方法,其中,在将所述目标图像集输入至目标分类模型的步骤之前,所述方法还包括:The method according to any one of claims 1 to 6, wherein, before the step of inputting the target image set into a target classification model, the method further comprises:
    获取图像训练集,并将所述图像训练集输入至初始分类模型;Acquiring an image training set, and inputting the image training set to the initial classification model;
    根据所述初始分类模型针对所述图像训练集输出的分类结果,计算所述初始分类模型的损失函数;Calculating a loss function of the initial classification model according to the classification result output by the initial classification model for the image training set;
    利用反向传播算法计算所述损失函数相对于所述初始分类模型的参数的导数;Calculating the derivative of the loss function with respect to the parameters of the initial classification model by using a backpropagation algorithm;
    利用梯度下降算法和所述导数更新所述初始分类模型的参数,得到目标分类模型。The parameters of the initial classification model are updated by using the gradient descent algorithm and the derivative to obtain the target classification model.
  8. 一种视频分类装置,包括:A video classification device includes:
    视频获取模块,设置为获取待分类视频;The video acquisition module is set to acquire videos to be classified;
    图像集确定模块,设置为根据所述待分类视频中的多个目标图像帧,确定所述待分类视频对应的目标图像集,其中,所述目标图像集中包括所述多个目标图像帧;An image set determining module, configured to determine a target image set corresponding to the video to be classified according to multiple target image frames in the video to be classified, wherein the target image set includes the multiple target image frames;
    输入模块,设置为将所述目标图像集输入至目标分类模型,并获得所述目标分类模型输出的所述目标图像集对应的目标视频场景,其中,所述目标分类 模型设置为获取所述目标图像集中每个目标图像帧对应的图像特征,并根据所述每个目标图像帧对应的图像特征确定所述目标视频场景;The input module is configured to input the target image set into a target classification model and obtain the target video scene corresponding to the target image set output by the target classification model, wherein the target classification model is set to obtain the target Image feature corresponding to each target image frame in the image set, and determining the target video scene according to the image feature corresponding to each target image frame;
    分类确定模块,设置为根据所述目标图像集对应的目标视频场景,确定所述待分类视频的分类结果,其中,所述分类结果设置为指示所述待分类视频的视频场景。The classification determination module is configured to determine the classification result of the video to be classified according to the target video scene corresponding to the target image set, wherein the classification result is set to indicate the video scene of the video to be classified.
  9. 一种电子设备,包括处理器和存储器;An electronic device including a processor and a memory;
    所述存储器上存储有计算机程序,所述计算机程序在被所述处理器运行时执行如权利要求1至7任一项所述的方法。A computer program is stored on the memory, and the computer program executes the method according to any one of claims 1 to 7 when the computer program is run by the processor.
  10. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器运行时执行权利要求1-7中任一项所述的方法。A computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is run by a processor, the method according to any one of claims 1-7 is executed.
PCT/CN2020/113860 2019-10-31 2020-09-08 Video classification method and apparatus, and electronic device WO2021082743A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911059325.6A CN110766096B (en) 2019-10-31 2019-10-31 Video classification method and device and electronic equipment
CN201911059325.6 2019-10-31

Publications (1)

Publication Number Publication Date
WO2021082743A1 true WO2021082743A1 (en) 2021-05-06

Family

ID=69335278

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113860 WO2021082743A1 (en) 2019-10-31 2020-09-08 Video classification method and apparatus, and electronic device

Country Status (2)

Country Link
CN (1) CN110766096B (en)
WO (1) WO2021082743A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449148A (en) * 2021-06-24 2021-09-28 北京百度网讯科技有限公司 Video classification method and device, electronic equipment and storage medium
CN113569684A (en) * 2021-07-20 2021-10-29 上海明略人工智能(集团)有限公司 Short video scene classification method and system, electronic equipment and storage medium
CN113691863A (en) * 2021-07-05 2021-11-23 浙江工业大学 Lightweight method for extracting video key frames
CN114611396A (en) * 2022-03-15 2022-06-10 国网安徽省电力有限公司蚌埠供电公司 Line loss analysis method based on big data
CN114782797A (en) * 2022-06-21 2022-07-22 深圳市万物云科技有限公司 House scene classification method, device and equipment and readable storage medium
CN115035462A (en) * 2022-08-09 2022-09-09 阿里巴巴(中国)有限公司 Video identification method, device, equipment and storage medium
CN115410048A (en) * 2022-09-29 2022-11-29 昆仑芯(北京)科技有限公司 Training method, device, equipment and medium of image classification model and image classification method, device and equipment

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110766096B (en) * 2019-10-31 2022-09-23 北京金山云网络技术有限公司 Video classification method and device and electronic equipment
CN111241864A (en) * 2020-02-17 2020-06-05 重庆忽米网络科技有限公司 Code scanning-free identification analysis method and system based on 5G communication technology
CN111291692B (en) * 2020-02-17 2023-10-20 咪咕文化科技有限公司 Video scene recognition method and device, electronic equipment and storage medium
CN111444819B (en) * 2020-03-24 2024-01-23 北京百度网讯科技有限公司 Cut frame determining method, network training method, device, equipment and storage medium
CN113497978B (en) * 2020-04-07 2023-11-28 北京达佳互联信息技术有限公司 Video scene classification method, device, server and storage medium
CN113497953A (en) * 2020-04-07 2021-10-12 北京达佳互联信息技术有限公司 Music scene recognition method, device, server and storage medium
CN113158710A (en) * 2020-05-22 2021-07-23 西安天和防务技术股份有限公司 Video classification method, device, terminal and storage medium
CN111859023A (en) * 2020-06-11 2020-10-30 中国科学院深圳先进技术研究院 Video classification method, device, equipment and computer readable storage medium
CN111797912B (en) * 2020-06-23 2023-09-22 山东浪潮超高清视频产业有限公司 System and method for identifying film age type and construction method of identification model
CN115082930A (en) * 2021-03-11 2022-09-20 腾讯科技(深圳)有限公司 Image classification method and device, electronic equipment and storage medium
CN112800278B (en) * 2021-03-30 2021-07-09 腾讯科技(深圳)有限公司 Video type determination method and device and electronic equipment
CN113095194A (en) * 2021-04-02 2021-07-09 北京车和家信息技术有限公司 Image classification method and device, storage medium and electronic equipment
CN113221690A (en) * 2021-04-28 2021-08-06 上海哔哩哔哩科技有限公司 Video classification method and device
CN113591647B (en) * 2021-07-22 2023-08-15 中广核工程有限公司 Human motion recognition method, device, computer equipment and storage medium
CN113473628B (en) * 2021-08-05 2022-08-09 深圳市虎瑞科技有限公司 Communication method and system of intelligent platform
CN114612712A (en) * 2022-03-03 2022-06-10 北京百度网讯科技有限公司 Object classification method, device, equipment and storage medium
CN117714712A (en) * 2024-02-01 2024-03-15 浙江华创视讯科技有限公司 Data steganography method, equipment and storage medium for video conference

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550699A (en) * 2015-12-08 2016-05-04 北京工业大学 CNN-based video identification and classification method through time-space significant information fusion
US20190138830A1 (en) * 2015-01-09 2019-05-09 Irvine Sensors Corp. Methods and Devices for Cognitive-based Image Data Analytics in Real Time Comprising Convolutional Neural Network
CN110032926A (en) * 2019-02-22 2019-07-19 哈尔滨工业大学(深圳) A kind of video classification methods and equipment based on deep learning
CN110163115A (en) * 2019-04-26 2019-08-23 腾讯科技(深圳)有限公司 A kind of method for processing video frequency, device and computer readable storage medium
CN110691246A (en) * 2019-10-31 2020-01-14 北京金山云网络技术有限公司 Video coding method and device and electronic equipment
CN110766096A (en) * 2019-10-31 2020-02-07 北京金山云网络技术有限公司 Video classification method and device and electronic equipment

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7639840B2 (en) * 2004-07-28 2009-12-29 Sarnoff Corporation Method and apparatus for improved video surveillance through classification of detected objects
US9076043B2 (en) * 2012-08-03 2015-07-07 Kodak Alaris Inc. Video summarization using group sparsity analysis
US9171213B2 (en) * 2013-03-15 2015-10-27 Xerox Corporation Two-dimensional and three-dimensional sliding window-based methods and systems for detecting vehicles
CN106778584B (en) * 2016-12-08 2019-07-16 南京邮电大学 A kind of face age estimation method based on further feature Yu shallow-layer Fusion Features
CN106709568B (en) * 2016-12-16 2019-03-22 北京工业大学 The object detection and semantic segmentation method of RGB-D image based on deep layer convolutional network
CN107067011B (en) * 2017-03-20 2019-05-03 北京邮电大学 A kind of vehicle color identification method and device based on deep learning
CN108229523B (en) * 2017-04-13 2021-04-06 深圳市商汤科技有限公司 Image detection method, neural network training method, device and electronic equipment
CN107292247A (en) * 2017-06-05 2017-10-24 浙江理工大学 A kind of Human bodys' response method and device based on residual error network
CN107992819B (en) * 2017-11-29 2020-07-10 青岛海信网络科技股份有限公司 Method and device for determining vehicle attribute structural features
CN110147700B (en) * 2018-05-18 2023-06-27 腾讯科技(深圳)有限公司 Video classification method, device, storage medium and equipment
CN109145840B (en) * 2018-08-29 2022-06-24 北京字节跳动网络技术有限公司 Video scene classification method, device, equipment and storage medium
CN109886951A (en) * 2019-02-22 2019-06-14 北京旷视科技有限公司 Method for processing video frequency, device and electronic equipment
CN110147711B (en) * 2019-02-27 2023-11-14 腾讯科技(深圳)有限公司 Video scene recognition method and device, storage medium and electronic device
CN110070067B (en) * 2019-04-29 2021-11-12 北京金山云网络技术有限公司 Video classification method, training method and device of video classification method model and electronic equipment
CN110163188B (en) * 2019-06-10 2023-08-08 腾讯科技(深圳)有限公司 Video processing and method, device and equipment for embedding target object in video
CN110363204A (en) * 2019-06-24 2019-10-22 杭州电子科技大学 A kind of object expression method based on multitask feature learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190138830A1 (en) * 2015-01-09 2019-05-09 Irvine Sensors Corp. Methods and Devices for Cognitive-based Image Data Analytics in Real Time Comprising Convolutional Neural Network
CN105550699A (en) * 2015-12-08 2016-05-04 北京工业大学 CNN-based video identification and classification method through time-space significant information fusion
CN110032926A (en) * 2019-02-22 2019-07-19 哈尔滨工业大学(深圳) A kind of video classification methods and equipment based on deep learning
CN110163115A (en) * 2019-04-26 2019-08-23 腾讯科技(深圳)有限公司 A kind of method for processing video frequency, device and computer readable storage medium
CN110691246A (en) * 2019-10-31 2020-01-14 北京金山云网络技术有限公司 Video coding method and device and electronic equipment
CN110766096A (en) * 2019-10-31 2020-02-07 北京金山云网络技术有限公司 Video classification method and device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG LIMIN; LI WEI; VAN GOOL LUC: "Appearance-and-Relation Networks for Video Classification", 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, IEEE, 18 June 2018 (2018-06-18), pages 1430 - 1439, XP033476106, DOI: 10.1109/CVPR.2018.00155 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449148A (en) * 2021-06-24 2021-09-28 北京百度网讯科技有限公司 Video classification method and device, electronic equipment and storage medium
CN113449148B (en) * 2021-06-24 2023-10-20 北京百度网讯科技有限公司 Video classification method, device, electronic equipment and storage medium
CN113691863A (en) * 2021-07-05 2021-11-23 浙江工业大学 Lightweight method for extracting video key frames
CN113691863B (en) * 2021-07-05 2023-06-20 浙江工业大学 Lightweight method for extracting video key frames
CN113569684A (en) * 2021-07-20 2021-10-29 上海明略人工智能(集团)有限公司 Short video scene classification method and system, electronic equipment and storage medium
CN114611396A (en) * 2022-03-15 2022-06-10 国网安徽省电力有限公司蚌埠供电公司 Line loss analysis method based on big data
CN114782797A (en) * 2022-06-21 2022-07-22 深圳市万物云科技有限公司 House scene classification method, device and equipment and readable storage medium
CN114782797B (en) * 2022-06-21 2022-09-20 深圳市万物云科技有限公司 House scene classification method, device and equipment and readable storage medium
CN115035462A (en) * 2022-08-09 2022-09-09 阿里巴巴(中国)有限公司 Video identification method, device, equipment and storage medium
CN115035462B (en) * 2022-08-09 2023-01-24 阿里巴巴(中国)有限公司 Video identification method, device, equipment and storage medium
CN115410048A (en) * 2022-09-29 2022-11-29 昆仑芯(北京)科技有限公司 Training method, device, equipment and medium of image classification model and image classification method, device and equipment
CN115410048B (en) * 2022-09-29 2024-03-19 昆仑芯(北京)科技有限公司 Training of image classification model, image classification method, device, equipment and medium

Also Published As

Publication number Publication date
CN110766096B (en) 2022-09-23
CN110766096A (en) 2020-02-07

Similar Documents

Publication Publication Date Title
WO2021082743A1 (en) Video classification method and apparatus, and electronic device
WO2020221298A1 (en) Text detection model training method and apparatus, text region determination method and apparatus, and text content determination method and apparatus
WO2020221278A1 (en) Video classification method and model training method and apparatus thereof, and electronic device
WO2020119527A1 (en) Human action recognition method and apparatus, and terminal device and storage medium
CN110147711B (en) Video scene recognition method and device, storage medium and electronic device
JP6536058B2 (en) Method, computer system, and program for estimating demographic characteristics of a user
WO2017166586A1 (en) Image identification method and system based on convolutional neural network, and electronic device
WO2019144892A1 (en) Data processing method, device, storage medium and electronic device
US20200175062A1 (en) Image retrieval method and apparatus, and electronic device
WO2022111069A1 (en) Image processing method and apparatus, electronic device and storage medium
EP3757874B1 (en) Action recognition method and apparatus
CN105005593B (en) The scene recognition method and device of multi-user shared equipment
WO2017092623A1 (en) Method and device for representing text as vector
CN112950581A (en) Quality evaluation method and device and electronic equipment
WO2020114108A1 (en) Clustering result interpretation method and device
CN108960314B (en) Training method and device based on difficult samples and electronic equipment
CN111768457B (en) Image data compression method, device, electronic equipment and storage medium
Kim et al. Deep blind image quality assessment by employing FR-IQA
CN111860353A (en) Video behavior prediction method, device and medium based on double-flow neural network
WO2021103474A1 (en) Image processing method and apparatus, storage medium and electronic apparatus
WO2023124278A1 (en) Image processing model training method and apparatus, and image classification method and apparatus
CN114282059A (en) Video retrieval method, device, equipment and storage medium
CN112529068A (en) Multi-view image classification method, system, computer equipment and storage medium
CN115098732B (en) Data processing method and related device
CN112541469B (en) Crowd counting method and system based on self-adaptive classification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20882486

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20882486

Country of ref document: EP

Kind code of ref document: A1