WO2020221278A1 - Video classification method and model training method and apparatus thereof, and electronic device - Google Patents

Video classification method and model training method and apparatus thereof, and electronic device Download PDF

Info

Publication number
WO2020221278A1
WO2020221278A1 PCT/CN2020/087690 CN2020087690W WO2020221278A1 WO 2020221278 A1 WO2020221278 A1 WO 2020221278A1 CN 2020087690 W CN2020087690 W CN 2020087690W WO 2020221278 A1 WO2020221278 A1 WO 2020221278A1
Authority
WO
WIPO (PCT)
Prior art keywords
classification
video
frame
training
neural network
Prior art date
Application number
PCT/CN2020/087690
Other languages
French (fr)
Chinese (zh)
Inventor
苏驰
李凯
陈宜航
刘弘也
Original Assignee
北京金山云网络技术有限公司
北京金山云科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京金山云网络技术有限公司, 北京金山云科技有限公司 filed Critical 北京金山云网络技术有限公司
Publication of WO2020221278A1 publication Critical patent/WO2020221278A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Definitions

  • This application relates to the field of image processing technology, and in particular to a video classification method and model training method, device and electronic equipment.
  • the video can be classified by 3D convolutional neural network, and the spatiotemporal features of the video can be extracted by 3D convolution.
  • the network parameters of the 3D convolutional neural network are relatively large, resulting in high computational cost in the network training process and recognition process , The time cost is large; in addition, the number of layers of the three-dimensional convolutional neural network is relatively shallow, and it is difficult to mine high-level semantic features, which makes the video classification accuracy rate low.
  • the purpose of this application is to provide a video classification method and model training method, device, and electronic equipment to reduce the amount of calculation, improve model training and recognition efficiency, and at the same time improve the accuracy of video classification.
  • an embodiment of the present application provides a method for training a video classification model.
  • the method includes: determining current training data based on a preset training set; the training data includes multiple video frames; and inputting the training data to the initial model ;
  • the initial model includes a convolutional neural network, a recurrent neural network and an output network; the initial features of a multi-frame video frame are extracted through the convolutional neural network; the final feature of a multi-frame video frame is extracted from the initial feature through the recurrent neural network; the final feature Input to the output network and output the prediction results of multiple frames of video frames; determine the loss value of the prediction result through the preset prediction loss function; train the initial model according to the loss value until the parameters in the initial model converge to obtain a video classification model.
  • an embodiment of the present application provides a video classification method, which includes: obtaining a video to be classified; obtaining multiple video frames from the video according to a preset sampling interval; and inputting the multiple video frames to the pre-training
  • the completed video classification model outputs the classification results of multi-frame video frames; the video classification model is trained through the training method of the above-mentioned video classification model; the video category is determined according to the classification results of the multi-frame video frames.
  • an embodiment of the present application provides a training device for a video classification model.
  • the device includes: a training data determination module configured to determine current training data based on a preset training set; the training data includes multiple video frames;
  • the training data input module is set to input training data to the initial model;
  • the initial model includes a convolutional neural network, a cyclic neural network and an output network;
  • the initial feature extraction module is set to extract the initial features of a multi-frame video frame through the convolutional neural network ;
  • the final feature extraction module is set to extract the final features of the multi-frame video frame from the initial features through the recurrent neural network;
  • the prediction result output module is set to input the final feature to the output network and output the prediction result of the multi-frame video frame;
  • loss The value determination and training module is set to determine the loss value of the prediction result through a preset prediction loss function; the initial model is trained according to the loss value until the parameters in the initial model converge to obtain a video classification model.
  • an embodiment of the present application provides a video classification device.
  • the device includes: a video acquisition module configured to acquire a video to be classified; a video frame acquisition module configured to acquire a video from the video at a preset sampling interval. Frame video frame; classification module, set to input multi-frame video frames to the pre-trained video classification model, and output the classification result of the multi-frame video frame; the video classification model is trained through the training method of the above-mentioned video classification model; category determination module , Set to determine the category of the video according to the classification result of the multi-frame video frame.
  • an embodiment of the present application provides an electronic device, including a processor and a memory.
  • the memory stores machine-executable instructions that can be executed by the processor.
  • the processor executes the machine-executable instructions to implement the training of the aforementioned video classification model. Method, or the steps of the above video classification method.
  • an embodiment of the present application provides a machine-readable storage medium that stores machine-executable instructions.
  • the machine-executable instructions When the machine-executable instructions are called and executed by a processor, the machine-executable instructions prompt The processor implements the training method of the video classification model or the steps of the video classification method.
  • an embodiment of the present application provides an executable program code, the executable program code is set to be executed to execute any of the above-mentioned video classification model training methods, or the steps of any of the above-mentioned video classification methods .
  • the video classification method and its model training method, device and electronic equipment provided by the embodiments of the application adopt a combination of convolutional neural network and recurrent neural network, and extract features through a combination of two-dimensional convolution and one-dimensional convolution, Compared with three-dimensional convolution, the amount of calculation can be greatly reduced, thereby improving the efficiency of model training and recognition; this method can also consider the correlation information between video frames in the process of extracting features, so the extracted features can be accurate Characterize the video type, thereby improving the accuracy of video classification.
  • FIG. 1 is a flowchart of a method for training a video classification model provided by an embodiment of the application
  • FIG. 2 is a schematic structural diagram of a convolutional neural network in an initial model provided by an embodiment of this application;
  • FIG. 3 is a schematic structural diagram of an initial model provided by an embodiment of this application.
  • FIG. 4 is a schematic structural diagram of another initial model provided by an embodiment of the application.
  • FIG. 5 is a flowchart of another video classification model training method provided by an embodiment of the application.
  • FIG. 6 is a flowchart of a video classification method provided by an embodiment of the application.
  • FIG. 7 is a schematic structural diagram of a training device for a video classification model provided by an embodiment of the application.
  • FIG. 8 is a schematic structural diagram of a video classification device provided by an embodiment of this application.
  • FIG. 9 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
  • embodiments of the present application provide a video classification method and model training method, device and electronic equipment ; This technology can be widely used in the classification of conventional video and short video files in various formats, and can be used in scenes such as video surveillance, video push, and video management.
  • the method includes the following steps:
  • Step S102 Determine current training data based on a preset training set; the training data includes multiple video frames.
  • the training data needs to be determined multiple times during the training of the initial model; in one embodiment, the current training data can be determined from the preset training set each time; or, other implementations In this way, new training data can also be obtained every time.
  • the training set can contain multiple videos or multiple groups of video frames, and each group contains multiple video frames.
  • the multi-frame video frames are collected from the same video.
  • Each video or each group of video frames is pre-labeled with a type tag, which can be divided from multiple angles, such as video theme, scene, action, character attributes, etc., so each video or each group of video frames can be performed from multiple angles classification.
  • the genre tags of video A include TV series, metropolis, crime solving, idol, etc.
  • the training set contains multiple videos, you can select a video from it, and then collect multiple video frames from the video, and determine the collected multiple video frames as the training data; if the training set is It contains multiple sets of video frames, from which a set of video frames can be selected, and the multiple video frames in the set of video frames are determined as training data.
  • the above-mentioned training set can also be divided into a training subset and a cross-validation subset according to a preset ratio.
  • the current training data can be determined from the training subset.
  • test data can be obtained from the cross-validation subset to verify the performance of the model.
  • Step S104 input the training data to the initial model;
  • the initial model includes a convolutional neural network, a cyclic neural network and an output network.
  • multiple video frames in the training data can be adjusted to a preset size, such as 512 pixels * 512 pixels, so that the input video frame matches the convolutional neural network.
  • step S106 the feature of the multi-frame video frame is extracted through the convolutional neural network as the initial feature.
  • multiple video frames may be input to the convolutional neural network.
  • the features output by the convolutional neural network are referred to as initial features.
  • the convolutional neural network can be implemented by a multi-layer convolutional layer, of course, it can also include a pooling layer, a fully connected layer, an activation function, and so on.
  • the convolutional neural network performs convolution operations on each input video frame to obtain the feature map corresponding to each video frame. That is, the initial feature includes multiple feature maps, or the initial feature is composed of multiple feature maps. A large feature map composed.
  • step S108 the feature of the multi-frame video frame is extracted from the initial feature through the cyclic neural network as the final feature.
  • the above-mentioned initial features can be input to the recurrent neural network.
  • the features output by the recurrent neural network are referred to as final features here.
  • the multiple video frames are related to each other in content.
  • the above-mentioned convolutional neural network usually processes each video frame separately, and the extracted feature maps of each video frame are not related to each other.
  • the initial feature can be processed through the cyclic neural network. According to the timing between the multi-frame video frame, the feature processing process is introduced The associated information of the upper and lower video frames makes the final feature more representative of the video type.
  • Recurrent neural network is a type of recurrent neural network that takes sequence data as input and recursively in the evolution direction of the sequence. Therefore, the use of recurrent neural network to process the initial features can introduce the associated information of the upper and lower video frames.
  • Step S110 input the final feature to the output network, and output the prediction result of the multi-frame video frame.
  • the output network can be realized by a fully connected layer. If the final feature is a two-dimensional multilayer feature, the fully connected layer can convert the final feature of the two-dimensional multilayer into a prediction result in the form of a one-dimensional vector. Each element in the prediction result corresponds to a category, and the value of this element represents the possibility that the video belongs to the category. Alternatively, the final feature may also be a feature of other dimensions, which is not specifically limited.
  • Step S112 Determine the loss value of the prediction result through the preset prediction loss function; train the initial model according to the loss value until the parameters in the initial model converge to obtain the video classification model.
  • the multi-frame video frames in the training data are pre-labeled with type labels.
  • the type labels can be converted into vector form.
  • the probability value corresponding to the category of the video is usually 1.
  • the probability value corresponding to the category not belonging is usually 0.
  • the prediction loss function can compare the difference between the prediction result and the labeled type label. Generally, the greater the difference, the greater the aforementioned loss value.
  • the parameters of each part of the above-mentioned initial model can be adjusted to achieve the purpose of training. When each parameter in the model converges, the training ends and the video classification model is obtained.
  • the training method of the video classification model uses a combination of convolutional neural network and recurrent neural network, and extracts features through a combination of two-dimensional convolution and one-dimensional convolution. Compared with three-dimensional convolution, The amount of calculation can be greatly reduced, thereby improving the efficiency of model training and recognition; this method can also consider the associated information between video frames in the process of extracting features, so the extracted features can accurately represent the video type, thereby improving the video The accuracy of classification.
  • the above model can use multi-frame video frames sampled from the video to process and identify the video category.
  • the amount of processed data is small, thus further reducing the amount of calculation , Improve the efficiency of training and recognition.
  • the embodiment of this application also provides another method for training a video classification model, which is implemented on the basis of the method described in the above embodiment; from the above embodiment, it can be seen that the initial model includes a convolutional neural network, a recurrent neural network, and an output network In this embodiment, the specific structure of the initial model is further described.
  • Figure 2 shows a schematic diagram of the structure of a convolutional neural network in an initial model.
  • the convolutional neural network includes multiple groups of sub-networks connected in sequence (three groups of sub-networks are taken as an example in Figure 2), a global average pooling layer and Classification fully connected layer; each group of sub-networks includes a batch normalization layer, an activation function layer, a convolution layer, and a pooling layer that are sequentially connected.
  • the batch normalization layer in each group of sub-networks is used to normalize the data in the input video frame or feature map.
  • This process can speed up the convergence speed of the convolutional neural network and the initial model, and can alleviate The problem of gradient dispersion in the multi-layer convolutional network enables the activation function layer in the convolutional neural network to perform function transformation on the normalized video frame or feature map.
  • This transformation process breaks the linear combination of the convolutional layer input , Can improve the feature expression ability of convolutional neural network.
  • the activation function layer may specifically be Sigmoid function, tanh function, Relu function, etc.
  • the convolution layer is used to perform convolution calculations on the video frame or feature map transformed by the activation function layer, and output the corresponding feature map;
  • the pooling layer can be an average pooling layer (average pooling or mean-pooling), global average pooling Layer (Global Average Pooling), max-pooling layer (max-pooling), etc.;
  • the pooling layer can be used to compress the feature map output by the convolutional layer, retain the main features in the feature map, and delete non-main features to reduce
  • the dimension of the feature map taking the average pooling layer as an example, the average pooling layer can average the feature point values in the neighborhood of the preset range size of the current feature point, and use the average value as the new feature of the current feature point Point value.
  • the pooling layer can also help the feature map to maintain some non-deformation, such as rotation invariance, translation invariance, and expansion invariance.
  • the global average pooling layer connected to the sub-network is used for the feature maps output by the last set of sub-networks, and the feature sub-maps of each layer are averaged to obtain a one-dimensional feature vector to further reduce the dimensionality of the feature map.
  • the classification fully connected layer performs fully connected calculations on the feature vectors output by the global average pooling layer, and normalizes the calculation results through functions such as softmax.
  • the convolutional neural network before executing the training method of the video classification model, may be pre-trained through a large number of data sets in advance, so as to obtain the initial parameters of the convolutional neural network.
  • the data set may include an object recognition data set and a scene recognition data set.
  • the batch size can be set to 256 (that is, the above-mentioned preset number)
  • the momentum is set to 0.9
  • the weight attenuation coefficient is set to 0.0001.
  • the momentum and weight attenuation coefficients are used to update various parameters in the convolutional neural network through the back propagation algorithm and the stochastic gradient descent method.
  • Figure 3 shows a schematic structural diagram of an initial model
  • the initial model includes a convolutional neural network, a recurrent neural network and an output network, and also includes a global average pooling network; the global average pooling network is set in the convolutional neural network Between the cyclic neural network and the cyclic neural network; through the global average pooling network, the initial feature can be reduced in dimensionality, so that the dimension of the initial feature matches the cyclic neural network. That is, the dimensionality reduction process can be performed on the initial feature through the global average pooling network to obtain the dimensionality reduction feature; the feature of the multi-frame video frame can be extracted from the dimensionality reduction feature through the recurrent neural network , As the final feature.
  • the recurrent neural network may specifically be a Long Short Term Memory Network (Long Short Term Memory Network, which may be referred to as an LSTM network for short).
  • the performance of the long and short term memory network is better than that of ordinary recurrent neural networks and can make up for ordinary loops. Defects such as gradient explosion and gradient disappearance of neural network.
  • the LSTM network includes input gates, output gates, and forgetting gates; the input gate is set to lift the features that need to be memorized from the initial features; the output gate is set to read the memory features, and the forgetting gate is set to determine whether to retain the memory features .
  • the opening and closing timing of the input gate, output gate, and forget gate can be trained to complete the training of the cyclic neural network.
  • the initial feature contains M feature vectors, expressed as z t ,t ⁇ [1,...,M], and then these M feature vectors can be sent to the LSTM network.
  • the final feature of the multi-frame video frame is obtained, denoted as h M ; the calculation process of each feature vector by the LSTM network is as follows:
  • W f , W i , W C , W o , b f , b i , b C and b o are the preset parameters of the LSTM; after the M-th feature vector is input to the LSTM, h M is obtained; the h M That is, the final feature can be input to the subsequent output network.
  • the above-mentioned output network may include a classification fully connected layer; the above-mentioned final features are input to the classification fully connected layer, and the classification result vector may be output.
  • the classification fully connected layer contains multiple neurons, and the classification fully connected layer presets a weight vector; the weight vector contains the weight elements corresponding to each neuron in the classification fully connected layer; for each neuron, the neuron The element is connected with each feature element of the final feature.
  • the neuron multiplies each feature element in the final feature with the corresponding weight element in the weight vector to obtain the predicted value of the neuron; because of the fully connected layer Contains multiple neurons, and the predicted values corresponding to multiple neurons constitute the above classification result vector.
  • the above-mentioned initial model may further include a classification function; inputting the classification result vector output by the above-mentioned classification fully connected layer into the classification function can output the classification probability vector corresponding to the classification result vector.
  • the classification function is used to calculate the probability of each element in the classification result vector.
  • the function can be a Softmax function or other probability regression functions.
  • the above-mentioned initial model uses a combination of convolutional neural network and long and short-term memory network, and extracts features through a combination of two-dimensional convolution and one-dimensional convolution.
  • this method can also consider the associated information between video frames in the process of extracting features, so the extracted features can accurately represent the video type; and the long- and short-term memory network can also avoid the deep network level.
  • the problem of gradient explosion and gradient disappearance improves the performance of the model, which is conducive to extracting the deep features of the video frame, thereby further improving the accuracy of video classification.
  • the embodiment of the present application also provides another method for training a video classification model, which is implemented on the basis of the method described in the foregoing embodiment; this embodiment focuses on the specific content of the output network and the prediction loss function.
  • the prediction loss function includes a classification loss function;
  • the classification loss function can be expressed as:
  • represents the summation operation
  • exp represents the exponential function with the natural constant e as the base
  • log represents the logarithmic operation
  • p l is the lth element of the classification probability vector corresponding to the classification result vector in the prediction result
  • y l is the pre-labeled multi-frame video frame standard The l-th element of the probability vector
  • r l is the proportion of the category corresponding to y l in the training set
  • is a preset hyperparameter, which can be set to 1.
  • r l is the proportion of the category corresponding to y l in the training set. If the proportion of a category in the training set is low, the r l value corresponding to the category will be smaller, and the w l value will be larger, so Play a balanced role, alleviate the problem of uneven distribution of samples in each category, and then improve the training efficiency of the model and the recognition accuracy of the model.
  • the output network includes a classification fully connected layer.
  • the output network also includes a threshold fully connected layer, as shown in Figure 4; the final feature is input to the threshold fully connected layer, and the threshold result vector can be output .
  • the threshold fully connected layer contains multiple neurons, and the threshold fully connected layer is preset with a weight vector; the weight vector contains the weight elements corresponding to each neuron of the threshold fully connected layer; for each This neuron is connected with each feature element of the final feature.
  • the neuron multiplies each feature element in the final feature with the corresponding weight element in the weight vector to obtain the prediction corresponding to the neuron Value; Since the fully connected layer contains multiple neurons, the predicted values corresponding to multiple neurons constitute the above threshold result vector.
  • the threshold fully connected layer is set to extract the threshold result of the model for each category learning from the final feature, that is, the threshold result vector.
  • Each category corresponds to its own threshold.
  • the thresholds of each category can be the same or different. Compared with the way of manually setting the threshold, the threshold of model learning is more accurate and reasonable, which is beneficial to improve the classification accuracy of the model.
  • the prediction loss function also includes a threshold loss function to evaluate the accuracy of the threshold result vector;
  • the function value of the classification loss function and the function value of the threshold loss function can be weighted and summed to obtain
  • the classification loss function takes into account the proportion of each category in the training set, which alleviates the problem of uneven sample distribution in each category, thereby improving the training efficiency of the model and the recognition accuracy of the model; the output network also sets The threshold fully connected layer, compared with the way of manually setting the threshold, the threshold of the model learning is more accurate and reasonable, which further improves the classification accuracy of the model.
  • This embodiment of the application also provides another method for training a video classification model, which is implemented on the basis of the method described in the above embodiment; this embodiment focuses on the specific process of training the initial model according to the loss value; as shown in Figure 5 As shown, the method includes the following steps:
  • Step S502 Determine current training data based on a preset training set; the training data includes multiple video frames;
  • Step S504 input training data to an initial model;
  • the initial model includes a convolutional neural network, a cyclic neural network, and an output network;
  • Step S506 extracting features of multiple video frames through a convolutional neural network as initial features
  • Step S508 extracting the features of the multi-frame video frame from the initial features through the recurrent neural network as the final feature;
  • Step S510 input the final feature to the output network, and output the prediction result of the multi-frame video frame;
  • Step S512 Determine the loss value of the prediction result through a preset prediction loss function
  • Step S51 update the parameters in the initial model according to the loss value
  • the function mapping relationship can be set in advance, the original parameters and the loss value are input into the function mapping relationship, and the updated parameters can be calculated.
  • the function mapping relationship of different parameters can be the same or different.
  • the parameters to be updated can be determined from the initial model according to preset rules; the parameters to be updated can be all parameters in the initial model, or some parameters can be randomly determined from the initial model; and then calculate the loss value to be updated parameters Derivative of Among them, L is the loss value of the probability matrix; W is the parameter to be updated; Represents partial derivative operation; the parameter to be updated can also be called the weight of each neuron. This process can also be called a back-propagation algorithm; if the loss value is large, it means that the output of the current initial model does not match the expected output result, then the derivative of the loss value to the parameter to be updated in the initial model can be calculated. As a basis for adjusting the parameters to be updated.
  • is the preset coefficient.
  • This process can also be referred to as a stochastic gradient descent algorithm; the derivative of each parameter to be updated can also be understood as the direction in which the loss value drops the fastest based on the current parameter to be updated. By adjusting the parameters in this direction, the loss value can be quickly reduced, so that The parameter converges.
  • the initial model is trained once, a loss value is obtained. At this time, one or more parameters can be randomly selected from each parameter in the initial model to perform the above-mentioned update process. The model training time is shorter and the algorithm is faster. ; Of course, the above-mentioned update process can also be performed on all the parameters in the initial model, and the model training in this way is more accurate.
  • Step S516 Determine whether the updated parameters are all converged; if the updated parameters are all converged, perform step S518; if the updated parameters are not all converged, perform step S502;
  • the step of determining the current training data based on the preset training set is continued until the updated parameters all converge.
  • Step S518 Determine the initial model after parameter update as the video classification model.
  • the combination of convolutional neural network and recurrent neural network is used to extract features through the combination of two-dimensional convolution and one-dimensional convolution.
  • the amount of calculation can be greatly reduced, thereby increasing Model training and recognition efficiency; this method can also consider the associated information between video frames in the process of extracting features, so the extracted features can accurately characterize the video type, thereby improving the accuracy of video classification.
  • an embodiment of the present application also provides a video classification method; this method is implemented on the basis of the video classification model training method described in the above embodiment, as shown in FIG. 6, the method includes The following steps:
  • Step S602 Obtain a video to be classified
  • the video can be a regular video or a short video; the specific format of the video can be MPEG (Moving Picture Experts Group), AVI (Audio Video Interleaved, audio video interleaved format), MOV (QuickTime film format) Wait, it is not limited here.
  • MPEG Motion Picture Experts Group
  • AVI Audio Video Interleaved, audio video interleaved format
  • MOV QuadickTime film format
  • Step S604 Obtain multiple video frames from the video according to a preset sampling interval
  • the sampling interval can be preset, for example, the sampling interval is 0.2 seconds, that is, 5 frames are sampled in 1 second.
  • Step S606 Input the multi-frame video frame to the pre-trained video classification model, and output the classification result of the multi-frame video frame; the video classification model is obtained by training the above-mentioned video classification model training method;
  • Step S608 Determine the video category according to the classification result of the multiple video frames.
  • the video classification method provided by the embodiment of the application first obtains multiple video frames from a video to be classified according to a preset sampling interval; inputs the multiple video frames to a pre-trained video classification model, and outputs multiple frames The classification result of the video frame; and the classification of the video is determined according to the classification result of the multi-frame video frame. Since the video classification model uses a combination of convolutional neural networks and recurrent neural networks, the features are extracted through the combination of two-dimensional convolution and one-dimensional convolution.
  • this method can also consider the associated information between video frames in the process of extracting features, so the extracted features can accurately characterize the video type, thereby improving the accuracy of video classification.
  • the classification result of the multi-frame video frame output by the above-mentioned video classification model may include one or more categories, and the classification result of the multi-frame video frame can be directly determined as the video category.
  • the classification result of the multi-frame video frame includes a classification probability vector and a threshold result vector.
  • the probability value of each category in the classification probability vector can be compared with the corresponding threshold in the threshold result vector to determine the video category.
  • calculate the category vector of the video according to the following formula
  • p l is the l-th element of the classification probability vector
  • ⁇ l is the l-th element of the threshold result vector
  • the category corresponding to the non-zero element is determined as the category of the video. Since the probability value of the category corresponding to the non-zero element is greater than the corresponding threshold, the category can be regarded as the category of the video.
  • the model not only outputs the classification probability vector, but also the threshold result vector. Based on the comparison result of the two vectors, the video category is finally determined.
  • the modulus output threshold is more accurate and reasonable. Helps improve the accuracy of video classification. Identifying tags for videos based on the classification result is helpful for users to quickly discover the content they are interested in, and also helpful for recommending videos of interest to users, which improves user experience.
  • the training data determining module 70 is configured to determine the current training data based on a preset training set; the training data includes multiple video frames;
  • the training data input module 71 is configured to input training data to the initial model;
  • the initial model includes a convolutional neural network, a recurrent neural network, and an output network;
  • the initial feature extraction module 72 is configured to extract features of multiple frames of video frames through a convolutional neural network as the initial features
  • the final feature extraction module 73 is configured to extract features of multiple video frames from the initial features through a recurrent neural network as the final feature;
  • the prediction result output module 74 is configured to input the final feature to the output network and output the prediction result of the multi-frame video frame;
  • the loss value determination and training module 75 is configured to determine the loss value of the prediction result through a preset prediction loss function; the initial model is trained according to the loss value until the parameters in the initial model converge to obtain a video classification model.
  • the training device for the video classification model uses a combination of a convolutional neural network and a recurrent neural network to extract features through a combination of two-dimensional convolution and one-dimensional convolution. Compared with three-dimensional convolution, The amount of calculation can be greatly reduced, thereby improving the efficiency of model training and recognition; this method can also consider the associated information between video frames in the process of extracting features, so the extracted features can accurately represent the video type, thereby improving the video The accuracy of classification.
  • the above-mentioned convolutional neural network includes multiple groups of sub-networks, a global average pooling layer, and a fully connected layer that are sequentially connected; each group of sub-networks includes a batch normalization layer, an activation function layer, and a volume Multilayer and pooling layer; the initial parameters of the above-mentioned convolutional neural network are obtained by training on a preset data set.
  • the above-mentioned initial model further includes a global average pooling network; the global average pooling network is set between the convolutional neural network and the recurrent neural network; the above-mentioned device further includes: a dimensionality reduction module set to pass the global average pooling The dimensionalization network performs dimensionality reduction processing on the initial features to obtain dimensionality reduction features; the final feature extraction module 73 is specifically configured to extract the features of the multi-frame video frame from the dimensionality reduction features through the recurrent neural network as the final feature .
  • the above-mentioned cyclic neural network includes a long and short-term memory network.
  • the above-mentioned output network includes a fully-connected classification layer; the initial model also includes a classification function; the above-mentioned prediction result output module is configured to: input the final feature into the fully-connected classification layer and output a classification result vector; the above device also includes : Probability vector output module, set to input the classification result vector to the classification function, and output the classification probability vector corresponding to the classification result vector.
  • the aforementioned prediction loss function includes a classification loss function;
  • the classification loss function is Among them, ⁇ represents the summation operation, exp represents the exponential function with the natural constant e as the base, log represents the logarithmic operation;
  • p l is the lth element of the classification probability vector corresponding to the classification result vector in the prediction result;
  • y l is the pre-labeled multi-frame video frame standard The l-th element of the probability vector;
  • r l is the proportion of the category corresponding to y l in the training set;
  • is the preset hyperparameter.
  • the aforementioned output network includes a threshold fully connected layer; the aforementioned prediction result output module is configured to: input the final feature to the threshold fully connected layer, and output a threshold result vector.
  • the aforementioned prediction loss function includes a threshold loss function;
  • the threshold loss function is y l is the l-th element of the standard probability vector of the pre-labeled multi-frame video frame;
  • ⁇ l element p l- ⁇ l );
  • ⁇ l is the l-th element of the threshold result vector in the prediction result;
  • the above prediction loss function includes a classification loss function and a threshold loss function; the above loss value determination and training module is configured to: perform a weighted summation on the function value of the classification loss function and the function value of the threshold loss function to obtain The loss value of the prediction result.
  • the above-mentioned loss value determination and training module is configured to: update the parameters in the initial model according to the loss value; determine whether the updated parameters are all converged; if the updated parameters are all converged, set the updated initial parameters
  • the model is determined to be a video classification model; if the updated parameters do not all converge, continue to perform the step of determining the current training data based on the preset training set until the updated parameters all converge.
  • the aforementioned loss value determination and training module is configured to: determine the parameters to be updated from the initial model according to preset rules; calculate the derivative of the loss value to the parameters to be updated in the initial model Among them, L is the loss value; W is the parameter to be updated; Indicates partial derivative operation; update the parameter to be updated, and get the updated parameter to be updated Among them, ⁇ is the preset coefficient.
  • FIG. 8 See FIG. 8 for a schematic structural diagram of a video classification device; the device includes:
  • the video acquisition module 80 is configured to acquire the video to be classified
  • the video frame obtaining module 81 is configured to obtain multiple video frames from the video according to a preset sampling interval
  • the classification module 82 is configured to input a multi-frame video frame to the pre-trained video classification model, and output the classification result of the multi-frame video frame; the video classification model is obtained through training of the above-mentioned video classification model training method;
  • the category determining module 83 is configured to determine the category of the video according to the classification result of the multiple video frames.
  • the classification result of the above-mentioned multi-frame video frame includes: a classification probability vector and a threshold result vector; the above-mentioned category determination module is configured to: calculate the category vector of the video Among them, p l is the l-th element of the classification probability vector; ⁇ l is the l-th element of the threshold result vector; in the category vector, the category corresponding to the non-zero element is determined as the category of the video.
  • the electronic device includes a memory 100 and a processor 101, where the memory 100 is configured to store one or more computer instructions, and one or more computer instructions are The processor 101 executes to implement the training method of the video classification model or the steps of the video classification method.
  • the electronic device shown in FIG. 9 further includes a bus 102 and a communication interface 103, and the processor 101, the communication interface 103, and the memory 100 are connected through the bus 102.
  • the memory 100 may include a high-speed random access memory (RAM, Random Access Memory), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
  • RAM Random Access Memory
  • non-volatile memory such as at least one disk memory.
  • the communication connection between the system network element and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the Internet, a wide area network, a local network, a metropolitan area network, etc. may be used.
  • the bus 102 may be an ISA bus, PCI bus, EISA bus, or the like.
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one bidirectional arrow is used in FIG. 9, but it does not mean that there is only one bus or one type of bus.
  • the processor 101 may be an integrated circuit chip with signal processing capability. In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 101 or instructions in the form of software.
  • the aforementioned processor 101 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP for short). ), Application Specific Integrated Circuit (ASIC for short), Field-Programmable Gate Array (FPGA for short) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 100, and the processor 101 reads information in the memory 100, and completes the steps of the method of the foregoing embodiment in combination with its hardware.
  • the embodiment of the present application also provides a machine-readable storage medium that stores machine-executable instructions.
  • the machine-executable instructions When the machine-executable instructions are called and executed by a processor, the machine-executable instructions prompt the processor to implement
  • the training method of the above-mentioned video classification model or the steps of the video classification method please refer to the method embodiment, which will not be repeated here.
  • the video classification method and its model training method, device, and computer program product of the electronic device provided by the embodiments of the present application include a computer-readable storage medium storing program code, and the instructions included in the program code can be set to execute the previous
  • a computer-readable storage medium storing program code
  • the instructions included in the program code can be set to execute the previous
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the computer software product is stored in a storage medium, including several
  • the instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code .
  • the embodiment of the present application provides an executable program code that is configured to be executed to execute any of the above-mentioned training methods for video classification models or the steps of any of the above-mentioned video classification methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The present application provides a video classification method and a model training method and apparatus thereof, and an electronic device. The training method comprises: extracting initial features of a plurality of video frames by means of a convolutional neural network; extracting final features of the plurality of video frames from the initial features by means of a recurrent neural network; inputting the final features into an output network, and outputting a prediction result of the plurality of video frames; determining a loss value of the prediction result by means of a preset predicted loss function; and training an initial model according to the loss value until a parameter in the initial model converges, and obtaining a video classification model. According to the present application, the convolutional neural network and the recurrent neural network are combined, so that an operation amount can be greatly reduced, thereby improving the model training and recognition efficiency; and meanwhile, association information between the video frames can be considered in a feature extraction process, so that the extracted features can accurately represent the video types, and the accuracy of video classification is improved.

Description

视频分类方法及其模型的训练方法、装置和电子设备Video classification method and model training method, device and electronic equipment
本申请要求于2019年4月29日提交中国专利局、申请号为201910359704.0、发明名称为“视频分类方法及其模型的训练方法、装置和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 29, 2019, the application number is 201910359704.0, and the invention title is "Video classification method and its model training method, device and electronic equipment", and its entire content Incorporated in this application by reference.
技术领域Technical field
本申请涉及图像处理技术领域,尤其是涉及一种视频分类方法及其模型的训练方法、装置和电子设备。This application relates to the field of image processing technology, and in particular to a video classification method and model training method, device and electronic equipment.
背景技术Background technique
相关技术中,可以通过三维卷积神经网络对视频进行分类,通过三维卷积提取视频的时空特征,但是三维卷积神经网络的网络参数量较大,导致网络训练过程和识别过程的运算成本高,时间开销大;另外,三维卷积神经网络的层数较浅,难以挖掘高层次的语义特征,使得视频分类准确率较低。In related technologies, the video can be classified by 3D convolutional neural network, and the spatiotemporal features of the video can be extracted by 3D convolution. However, the network parameters of the 3D convolutional neural network are relatively large, resulting in high computational cost in the network training process and recognition process , The time cost is large; in addition, the number of layers of the three-dimensional convolutional neural network is relatively shallow, and it is difficult to mine high-level semantic features, which makes the video classification accuracy rate low.
发明内容Summary of the invention
有鉴于此,本申请的目的在于提供一种视频分类方法及其模型的训练方法、装置和电子设备,以降低运算量,提高模型训练和识别效率,同时提高视频分类的准确率。In view of this, the purpose of this application is to provide a video classification method and model training method, device, and electronic equipment to reduce the amount of calculation, improve model training and recognition efficiency, and at the same time improve the accuracy of video classification.
第一方面,本申请实施例提供了一种视频分类模型的训练方法,该方法包括:基于预设的训练集合确定当前的训练数据;训练数据包括多帧视频帧;将训练数据输入至初始模型;初始模型包括卷积神经网络、循环神经网络和输出网络;通过卷积神经网络提取多帧视频帧的初始特征;通过循环神经网络从初始特征中提取多帧视频帧的最终特征;将最终特征输入至输出网络,输出多帧视频帧的预测结果;通过预设的预测损失函数确定预测结果的损失值;根据损失值对初始模型进行训练,直至初始模型中的参数收敛,得到视频分类模型。In the first aspect, an embodiment of the present application provides a method for training a video classification model. The method includes: determining current training data based on a preset training set; the training data includes multiple video frames; and inputting the training data to the initial model ; The initial model includes a convolutional neural network, a recurrent neural network and an output network; the initial features of a multi-frame video frame are extracted through the convolutional neural network; the final feature of a multi-frame video frame is extracted from the initial feature through the recurrent neural network; the final feature Input to the output network and output the prediction results of multiple frames of video frames; determine the loss value of the prediction result through the preset prediction loss function; train the initial model according to the loss value until the parameters in the initial model converge to obtain a video classification model.
第二方面,本申请实施例提供了一种视频分类方法,该方法包括:获取待分类的视频;按照预设的采样间隔从视频中获取多帧视频帧;将多帧视频帧输入至预先训练完成的视频分类模型,输出多帧视频帧的分类结果;视频分类模型通过上述视频分类模型的训练方法训练得到;根据多帧视频帧的分类结果确定视频的类别。In the second aspect, an embodiment of the present application provides a video classification method, which includes: obtaining a video to be classified; obtaining multiple video frames from the video according to a preset sampling interval; and inputting the multiple video frames to the pre-training The completed video classification model outputs the classification results of multi-frame video frames; the video classification model is trained through the training method of the above-mentioned video classification model; the video category is determined according to the classification results of the multi-frame video frames.
第三方面,本申请实施例提供了一种视频分类模型的训练装置,该装置包括:训练数据确定模块,设置为基于预设的训练集合确定当前的训练数据;训练数据包括多帧视频帧;训练数据输入模块,设置为将训练数据输入至初始模型;初始模型包括卷积神经网络、循环神经网络和输出网络;初始特征提取模块,设置为通过卷积神经网络提取多帧视频帧的初始特征;最终特征提取模块,设置为通过循环神经网络从初始特征中提取多帧视频帧的最终特征;预测结果输出模块,设置为将最终特征输入至输出网络,输出多帧视频帧的预测结果;损失值确定和训练模块,设置为通过预设的预测损失函数确定预测结果的损失值;根据损失值对初始模型进行训练,直至初始模型中的参数收敛,得到视频分类模型。In a third aspect, an embodiment of the present application provides a training device for a video classification model. The device includes: a training data determination module configured to determine current training data based on a preset training set; the training data includes multiple video frames; The training data input module is set to input training data to the initial model; the initial model includes a convolutional neural network, a cyclic neural network and an output network; the initial feature extraction module is set to extract the initial features of a multi-frame video frame through the convolutional neural network ; The final feature extraction module is set to extract the final features of the multi-frame video frame from the initial features through the recurrent neural network; the prediction result output module is set to input the final feature to the output network and output the prediction result of the multi-frame video frame; loss The value determination and training module is set to determine the loss value of the prediction result through a preset prediction loss function; the initial model is trained according to the loss value until the parameters in the initial model converge to obtain a video classification model.
第四方面,本申请实施例提供了一种视频分类装置,该装置包括:视频获取模块,设置为获取待分类的视频;视频帧获取模块,设置为按照预设的采样间隔从视频中获取多帧 视频帧;分类模块,设置为将多帧视频帧输入至预先训练完成的视频分类模型,输出多帧视频帧的分类结果;视频分类模型通过上述视频分类模型的训练方法训练得到;类别确定模块,设置为根据多帧视频帧的分类结果确定视频的类别。In a fourth aspect, an embodiment of the present application provides a video classification device. The device includes: a video acquisition module configured to acquire a video to be classified; a video frame acquisition module configured to acquire a video from the video at a preset sampling interval. Frame video frame; classification module, set to input multi-frame video frames to the pre-trained video classification model, and output the classification result of the multi-frame video frame; the video classification model is trained through the training method of the above-mentioned video classification model; category determination module , Set to determine the category of the video according to the classification result of the multi-frame video frame.
第五方面,本申请实施例提供了一种电子设备,包括处理器和存储器,存储器存储有能够被处理器执行的机器可执行指令,处理器执行机器可执行指令以实现上述视频分类模型的训练方法,或者上述视频分类方法的步骤。In a fifth aspect, an embodiment of the present application provides an electronic device, including a processor and a memory. The memory stores machine-executable instructions that can be executed by the processor. The processor executes the machine-executable instructions to implement the training of the aforementioned video classification model. Method, or the steps of the above video classification method.
第六方面,本申请实施例提供了一种机器可读存储介质,该机器可读存储介质存储有机器可执行指令,该机器可执行指令在被处理器调用和执行时,机器可执行指令促使处理器实现上述视频分类模型的训练方法,或者上述视频分类方法的步骤。In a sixth aspect, an embodiment of the present application provides a machine-readable storage medium that stores machine-executable instructions. When the machine-executable instructions are called and executed by a processor, the machine-executable instructions prompt The processor implements the training method of the video classification model or the steps of the video classification method.
第七方面,本申请实施例提供了一种可执行程序代码,所述可执行程序代码设置为被运行以执行上述任一种视频分类模型的训练方法,或者上述任一种视频分类方法的步骤。In a seventh aspect, an embodiment of the present application provides an executable program code, the executable program code is set to be executed to execute any of the above-mentioned video classification model training methods, or the steps of any of the above-mentioned video classification methods .
本申请实施例带来了以下有益效果:The embodiments of the application bring the following beneficial effects:
本申请实施例提供的视频分类方法及其模型的训练方法、装置和电子设备,采用卷积神经网络和循环神经网络相结合,通过二维卷积和一维卷积相结合的方式提取特征,相对于三维卷积而言,可以大幅降低运算量,从而提高了模型训练和识别效率;该方式也可以在提取特征的过程中考虑视频帧之间的关联信息,因而提取出的特征可以准确的表征视频类型,从而提高了视频分类的准确率。The video classification method and its model training method, device and electronic equipment provided by the embodiments of the application adopt a combination of convolutional neural network and recurrent neural network, and extract features through a combination of two-dimensional convolution and one-dimensional convolution, Compared with three-dimensional convolution, the amount of calculation can be greatly reduced, thereby improving the efficiency of model training and recognition; this method can also consider the correlation information between video frames in the process of extracting features, so the extracted features can be accurate Characterize the video type, thereby improving the accuracy of video classification.
为使本申请的上述目的、特征和优点能更明显易懂,下文特举较佳实施方式,并配合所附附图,作详细说明如下。In order to make the above-mentioned objectives, features and advantages of the present application more obvious and understandable, the following is a detailed description of the preferred embodiments together with the accompanying drawings.
附图说明Description of the drawings
为了更清楚地说明本申请实施例和相关技术的技术方案,下面对实施例和相关技术中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application and related technologies, the following briefly introduces the drawings that need to be used in the embodiments and related technologies. Obviously, the drawings in the following description are only of the present application. For some embodiments, those of ordinary skill in the art can obtain other drawings based on these drawings without creative work.
图1为本申请实施例提供的一种视频分类模型的训练方法的流程图;FIG. 1 is a flowchart of a method for training a video classification model provided by an embodiment of the application;
图2为本申请实施例提供的一种初始模型中卷积神经网络的结构示意图;2 is a schematic structural diagram of a convolutional neural network in an initial model provided by an embodiment of this application;
图3为本申请实施例提供的一种初始模型的结构示意图;FIG. 3 is a schematic structural diagram of an initial model provided by an embodiment of this application;
图4为本申请实施例提供的另一种初始模型的结构示意图;4 is a schematic structural diagram of another initial model provided by an embodiment of the application;
图5为本申请实施例提供的另一种视频分类模型的训练方法的流程图;FIG. 5 is a flowchart of another video classification model training method provided by an embodiment of the application;
图6为本申请实施例提供的一种视频分类方法的流程图;FIG. 6 is a flowchart of a video classification method provided by an embodiment of the application;
图7为本申请实施例提供的一种视频分类模型的训练装置的结构示意图;FIG. 7 is a schematic structural diagram of a training device for a video classification model provided by an embodiment of the application;
图8为本申请实施例提供的一种视频分类装置的结构示意图;FIG. 8 is a schematic structural diagram of a video classification device provided by an embodiment of this application;
图9为本申请实施例提供的一种电子设备的结构示意图。FIG. 9 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
具体实施方式Detailed ways
为使本申请的目的、技术方案、及优点更加清楚明白,以下参照附图并举实施例,对本申请进一步详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所 获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions, and advantages of the present application clearer, the following further describes the present application in detail with reference to the drawings and embodiments. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
考虑到三维卷积神经网络对视频进行分类,运算成本高,时间开销大且视频分类准确率较低的问题,本申请实施例提供一种视频分类方法及其模型的训练方法、装置和电子设备;该技术可以广泛应用于各种格式的常规视频、短视频文件的分类,可以用于视频监控、视频推送、视频管理等场景中。Considering the problems of high computational cost, high time overhead, and low video classification accuracy for video classification by a three-dimensional convolutional neural network, embodiments of the present application provide a video classification method and model training method, device and electronic equipment ; This technology can be widely used in the classification of conventional video and short video files in various formats, and can be used in scenes such as video surveillance, video push, and video management.
为便于对本实施例进行理解,首先对本申请实施例所公开的一种视频分类模型的训练方法进行详细介绍,如图1所示,该方法包括如下步骤:In order to facilitate the understanding of this embodiment, a method for training a video classification model disclosed in the embodiment of the application is first introduced in detail. As shown in FIG. 1, the method includes the following steps:
步骤S102,基于预设的训练集合确定当前的训练数据;该训练数据包括多帧视频帧。Step S102: Determine current training data based on a preset training set; the training data includes multiple video frames.
后续内容中,一些情况下,对初始模型进行训练的过程中,需要多次确定训练数据;一种实施方式中,每次可以从预设的训练集合中确定当前的训练数据;或者,其他实施方式中,也可以每次重新获取新的训练数据。In the subsequent content, in some cases, the training data needs to be determined multiple times during the training of the initial model; in one embodiment, the current training data can be determined from the preset training set each time; or, other implementations In this way, new training data can also be obtained every time.
以从预设的训练集合中确定当前的训练数据为例来说,该训练集合中可以包含有多段视频,也可以包含有多组视频帧,每组中包含有多帧视频帧,每组中的多帧视频帧从同一个视频中采集到。每段视频或每组视频帧预先标注有类型标签,该类型标签可以从多角度划分,如视频主题、场景、动作、人物属性等,因而每段视频或每组视频帧可以从多个角度进行分类。例如,视频A的类型标签包括电视剧、都市、破案、偶像等。Taking the current training data determined from the preset training set as an example, the training set can contain multiple videos or multiple groups of video frames, and each group contains multiple video frames. The multi-frame video frames are collected from the same video. Each video or each group of video frames is pre-labeled with a type tag, which can be divided from multiple angles, such as video theme, scene, action, character attributes, etc., so each video or each group of video frames can be performed from multiple angles classification. For example, the genre tags of video A include TV series, metropolis, crime solving, idol, etc.
在确定训练数据时,如果训练集合中包含的是多段视频,可以从中选择一段视频,进而从该视频中采集多帧视频帧,将采集出的多帧视频帧确定为训练数据;如果训练集合中包含的是多组视频帧,可以从中选择一组视频帧,将该组视频帧中的多帧视频帧确定为训练数据。When determining the training data, if the training set contains multiple videos, you can select a video from it, and then collect multiple video frames from the video, and determine the collected multiple video frames as the training data; if the training set is It contains multiple sets of video frames, from which a set of video frames can be selected, and the multiple video frames in the set of video frames are determined as training data.
另外,还可以将上述训练集合按照预设比例划分为训练子集和交叉验证子集。在训练过程中,可以从训练子集确定当前的训练数据。训练完成后或到达训练的某一阶段,可以从交叉验证子集中获取测试数据用于验证模型的性能。In addition, the above-mentioned training set can also be divided into a training subset and a cross-validation subset according to a preset ratio. During the training process, the current training data can be determined from the training subset. After the training is completed or reaches a certain stage of training, test data can be obtained from the cross-validation subset to verify the performance of the model.
步骤S104,将训练数据输入至初始模型;该初始模型包括卷积神经网络、循环神经网络和输出网络。Step S104, input the training data to the initial model; the initial model includes a convolutional neural network, a cyclic neural network and an output network.
在输入至初始模型之前,可以将训练数据中的多帧视频帧分别调整至预设大小,如512像素*512像素,以使输入的视频帧与卷积神经网络相匹配。Before inputting to the initial model, multiple video frames in the training data can be adjusted to a preset size, such as 512 pixels * 512 pixels, so that the input video frame matches the convolutional neural network.
步骤S106,通过卷积神经网络提取多帧视频帧的特征,作为初始特征。In step S106, the feature of the multi-frame video frame is extracted through the convolutional neural network as the initial feature.
举例来说,可以将多帧视频帧输入至卷积神经网络,为了与后续内容相区分,这里将卷积神经网络输出的特征称为初始特征。For example, multiple video frames may be input to the convolutional neural network. In order to distinguish from subsequent content, the features output by the convolutional neural network are referred to as initial features.
该卷积神经网络可以通过多层卷积层实现,当然还可以包含池化层、全连接层、激活函数等。卷积神经网络分别对输入的每帧视频帧进行卷积运算,得到每帧视频帧对应的特征图,即上述初始特征中包含有多张特征图,或者,该初始特征为由多张特征图组成的一张大特征图。The convolutional neural network can be implemented by a multi-layer convolutional layer, of course, it can also include a pooling layer, a fully connected layer, an activation function, and so on. The convolutional neural network performs convolution operations on each input video frame to obtain the feature map corresponding to each video frame. That is, the initial feature includes multiple feature maps, or the initial feature is composed of multiple feature maps. A large feature map composed.
步骤S108,通过循环神经网络从初始特征中提取多帧视频帧的特征,作为最终特征。In step S108, the feature of the multi-frame video frame is extracted from the initial feature through the cyclic neural network as the final feature.
举例来说,可以将上述初始特征输入至循环神经网络,为了与后续内容相区分,这里将循环神经网络输出的特征称为最终特征。For example, the above-mentioned initial features can be input to the recurrent neural network. In order to distinguish from subsequent content, the features output by the recurrent neural network are referred to as final features here.
由于多帧视频帧从同一个视频中采集,因而多帧视频帧彼此之间在内容上有所关联。而上述卷积神经网络通常单独处理每帧视频帧,提取出的每帧视频帧的特征图彼此之间没有关联。为了使训练出的模型能够更全面、准确地理解多帧视频帧对应视频的内容,可以通过循环神经网络继续对初始特征进行处理,根据多帧视频帧之间的时序,在特征处理过程中引入上下视频帧的关联信息,使最终特征更能表征视频类型。Since multiple video frames are collected from the same video, the multiple video frames are related to each other in content. The above-mentioned convolutional neural network usually processes each video frame separately, and the extracted feature maps of each video frame are not related to each other. In order to enable the trained model to understand more comprehensively and accurately the content of the video corresponding to the multi-frame video frame, the initial feature can be processed through the cyclic neural network. According to the timing between the multi-frame video frame, the feature processing process is introduced The associated information of the upper and lower video frames makes the final feature more representative of the video type.
循环神经网络是一类以序列数据为输入,在序列的演进方向进行递归的递归神经网络。因此,使用循环神经网络对初始特征进行处理过程中能够引入上下视频帧的关联信息。Recurrent neural network is a type of recurrent neural network that takes sequence data as input and recursively in the evolution direction of the sequence. Therefore, the use of recurrent neural network to process the initial features can introduce the associated information of the upper and lower video frames.
步骤S110,将最终特征输入至输出网络,输出多帧视频帧的预测结果。Step S110, input the final feature to the output network, and output the prediction result of the multi-frame video frame.
该输出网络可以通过全连接层实现,如果最终特征为二维多层特征,全连接层可以将二维多层的最终特征转化成一维向量形式的预测结果。该预测结果中的每个元素均对应有一个类别,该元素的值代表了视频属于该类别的可能性。或者,最终特征也可以为其他维度的特征,具体不做限定。The output network can be realized by a fully connected layer. If the final feature is a two-dimensional multilayer feature, the fully connected layer can convert the final feature of the two-dimensional multilayer into a prediction result in the form of a one-dimensional vector. Each element in the prediction result corresponds to a category, and the value of this element represents the possibility that the video belongs to the category. Alternatively, the final feature may also be a feature of other dimensions, which is not specifically limited.
步骤S112,通过预设的预测损失函数确定预测结果的损失值;根据损失值对初始模型进行训练,直至初始模型中的参数收敛,得到视频分类模型。Step S112: Determine the loss value of the prediction result through the preset prediction loss function; train the initial model according to the loss value until the parameters in the initial model converge to obtain the video classification model.
如上文所述,训练数据中的多帧视频帧预先标注有类型标签,为了便于运算,可以将该类型标签转换为向量形式,该向量中,视频所属的类别对应的概率值通常为1,视频不属于的类别对应的概率值通常为0。预测损失函数可以比较预测结果与标注的类型标签之间的区别。通常区别越大,上述损失值越大。基于该损失值可以调整上述初始模型中各个部分的参数,以达到训练的目的。当模型中各个参数收敛时,训练结束,得到视频分类模型。As mentioned above, the multi-frame video frames in the training data are pre-labeled with type labels. In order to facilitate calculations, the type labels can be converted into vector form. In this vector, the probability value corresponding to the category of the video is usually 1. The probability value corresponding to the category not belonging is usually 0. The prediction loss function can compare the difference between the prediction result and the labeled type label. Generally, the greater the difference, the greater the aforementioned loss value. Based on the loss value, the parameters of each part of the above-mentioned initial model can be adjusted to achieve the purpose of training. When each parameter in the model converges, the training ends and the video classification model is obtained.
本申请实施例提供的视频分类模型的训练方法,采用卷积神经网络和循环神经网络相结合,通过二维卷积和一维卷积相结合的方式提取特征,相对于三维卷积而言,可以大幅降低运算量,从而提高了模型训练和识别效率;该方式也可以在提取特征的过程中考虑视频帧之间的关联信息,因而提取出的特征可以准确的表征视频类型,从而提高了视频分类的准确率。The training method of the video classification model provided by the embodiment of this application uses a combination of convolutional neural network and recurrent neural network, and extracts features through a combination of two-dimensional convolution and one-dimensional convolution. Compared with three-dimensional convolution, The amount of calculation can be greatly reduced, thereby improving the efficiency of model training and recognition; this method can also consider the associated information between video frames in the process of extracting features, so the extracted features can accurately represent the video type, thereby improving the video The accuracy of classification.
另外,上述模型可以采用从视频中采样出的多帧视频帧进行处理并识别视频类别,相对于三维卷积神经网络需要输入视频段的方式,处理的数据量较小,因而进一步降低了运算量,提高了训练和识别效率。In addition, the above model can use multi-frame video frames sampled from the video to process and identify the video category. Compared with the way that the three-dimensional convolutional neural network needs to input video segments, the amount of processed data is small, thus further reducing the amount of calculation , Improve the efficiency of training and recognition.
本申请实施例还提供另一种视频分类模型的训练方法,该方法在上述实施例所述方法的基础上实现;由上述实施例可知,初始模型包括卷积神经网络、循环神经网络和输出网络,本实施例中进一步描述初始模型的具体结构。The embodiment of this application also provides another method for training a video classification model, which is implemented on the basis of the method described in the above embodiment; from the above embodiment, it can be seen that the initial model includes a convolutional neural network, a recurrent neural network, and an output network In this embodiment, the specific structure of the initial model is further described.
如图2所示的一种初始模型中卷积神经网络的结构示意图,卷积神经网络包括依次连接的多组子网络(图2中以三组子网络为例)、全局平均池化层和分类全连接层;每组子网络包括依次连接的批归一化层、激活函数层、卷积层和池化层。其中,每组子网络中的批归一化层用于对输入的视频帧或特征图中的数据进行归一化处理,该过程可以加快卷积神经网络以及初始模型的收敛速度,并且可以缓解在多层卷积网络中梯度弥散的问题,使 得卷积神经网络中的激活函数层可以对归一化处理后的视频帧或特征图进行函数变换,该变换过程打破卷积层输入的线性组合,可以提高卷积神经网络的特征表达能力。该激活函数层具体可以为Sigmoid函数、tanh函数、Relu函数等。卷积层用于对激活函数层变换后的视频帧或特征图进行卷积计算,输出相应的特征图;池化层可以为平均池化层(Average Pooling或mean-pooling)、全局平均池化层(Global Average Pooling)、最大池化层(max-pooling)等;池化层可以用于对卷积层输出的特征图进行压缩,保留特征图中的主要特征,删除非主要特征,以降低特征图的维度,以平均池化层为例,平均池化层可以对当前特征点的预设范围大小的邻域内的特征点值求平均值,将平均值作为该当前特征点的新的特征点值。另外,池化层还可以帮助特征图保持一些不变形,例如旋转不变性、平移不变性、伸缩不变性等。Figure 2 shows a schematic diagram of the structure of a convolutional neural network in an initial model. The convolutional neural network includes multiple groups of sub-networks connected in sequence (three groups of sub-networks are taken as an example in Figure 2), a global average pooling layer and Classification fully connected layer; each group of sub-networks includes a batch normalization layer, an activation function layer, a convolution layer, and a pooling layer that are sequentially connected. Among them, the batch normalization layer in each group of sub-networks is used to normalize the data in the input video frame or feature map. This process can speed up the convergence speed of the convolutional neural network and the initial model, and can alleviate The problem of gradient dispersion in the multi-layer convolutional network enables the activation function layer in the convolutional neural network to perform function transformation on the normalized video frame or feature map. This transformation process breaks the linear combination of the convolutional layer input , Can improve the feature expression ability of convolutional neural network. The activation function layer may specifically be Sigmoid function, tanh function, Relu function, etc. The convolution layer is used to perform convolution calculations on the video frame or feature map transformed by the activation function layer, and output the corresponding feature map; the pooling layer can be an average pooling layer (average pooling or mean-pooling), global average pooling Layer (Global Average Pooling), max-pooling layer (max-pooling), etc.; the pooling layer can be used to compress the feature map output by the convolutional layer, retain the main features in the feature map, and delete non-main features to reduce The dimension of the feature map, taking the average pooling layer as an example, the average pooling layer can average the feature point values in the neighborhood of the preset range size of the current feature point, and use the average value as the new feature of the current feature point Point value. In addition, the pooling layer can also help the feature map to maintain some non-deformation, such as rotation invariance, translation invariance, and expansion invariance.
与子网络连接的全局平均池化层用于对最后一组子网络输出的特征图,每层特征子图求平均值,得到一维的特征向量,以进一步降低特征图的维度。分类全连接层对全局平均池化层输出的特征向量进行全连接计算,并通过softmax等函数对计算结果进行归一化处理。The global average pooling layer connected to the sub-network is used for the feature maps output by the last set of sub-networks, and the feature sub-maps of each layer are averaged to obtain a one-dimensional feature vector to further reduce the dimensionality of the feature map. The classification fully connected layer performs fully connected calculations on the feature vectors output by the global average pooling layer, and normalizes the calculation results through functions such as softmax.
一种实施方式中,在执行上述视频分类模型的训练方法之前,可以预先通过大量的数据集对上述卷积神经网络进行预训练,从而得到卷积神经网络的初始参数。这样,后续对初始模型进行训练时,由初始参数开始训练,可以提高模型的泛化能力。具体而言,该数据集可以包含物体识别数据集和场景识别数据集。首先,随机初始化卷积神经网络的权重,从上述数据集中随机抽取预设数量的训练图像,逐一输入至卷积神经网络中进行训练,如果训练后的卷积神经网络中的各个参数不能均收敛,则继续从数据集中随机抽取预设数量的训练图像进行训练,直至卷积神经网络中的各个参数收敛,训练完毕。作为一个示例,卷积神经网络在训练之前,可以设置批大小为256(即上述预设数量)、动量设置为0.9、权重衰减系数设置为0.0001。在训练过程中,该动量和权重衰减系数用于通过反向传播算法和随机梯度下降法更新卷积神经网络中的各个参数。训练完毕后,卷积神经网络的各个参数均收敛,这些参数可以作为执行上述视频分类模型的训练方法时,卷积神经网络的初始参数。In one embodiment, before executing the training method of the video classification model, the convolutional neural network may be pre-trained through a large number of data sets in advance, so as to obtain the initial parameters of the convolutional neural network. In this way, when the initial model is subsequently trained, the training is started from the initial parameters, which can improve the generalization ability of the model. Specifically, the data set may include an object recognition data set and a scene recognition data set. First, randomly initialize the weights of the convolutional neural network, randomly extract a preset number of training images from the above data set, and input them into the convolutional neural network for training one by one. If the parameters of the trained convolutional neural network cannot all converge , Then continue to randomly extract a preset number of training images from the data set for training until the parameters in the convolutional neural network converge, and the training is completed. As an example, before training the convolutional neural network, the batch size can be set to 256 (that is, the above-mentioned preset number), the momentum is set to 0.9, and the weight attenuation coefficient is set to 0.0001. In the training process, the momentum and weight attenuation coefficients are used to update various parameters in the convolutional neural network through the back propagation algorithm and the stochastic gradient descent method. After the training is completed, each parameter of the convolutional neural network converges, and these parameters can be used as the initial parameters of the convolutional neural network when the training method of the video classification model is executed.
如图3所示的一种初始模型的结构示意图;该初始模型包括卷积神经网络、循环神经网络和输出网络,还包括全局平均池化网络;该全局平均池化网络设置在卷积神经网络和循环神经网络之间;通过该全局平均池化网络可以对初始特征进行降维处理,以使初始特征的维度与循环神经网络相匹配。也就是说,可以通过所述全局平均池化网络对所述初始特征进行降维处理,得到降维特征;通过所述循环神经网络从所述降维特征中提取所述多帧视频帧的特征,作为最终特征。Figure 3 shows a schematic structural diagram of an initial model; the initial model includes a convolutional neural network, a recurrent neural network and an output network, and also includes a global average pooling network; the global average pooling network is set in the convolutional neural network Between the cyclic neural network and the cyclic neural network; through the global average pooling network, the initial feature can be reduced in dimensionality, so that the dimension of the initial feature matches the cyclic neural network. That is, the dimensionality reduction process can be performed on the initial feature through the global average pooling network to obtain the dimensionality reduction feature; the feature of the multi-frame video frame can be extracted from the dimensionality reduction feature through the recurrent neural network , As the final feature.
一种实施方式中,该循环神经网络具体可以为长短时记忆网络(Long Short Term Memory Network,可以简称为LSTM网络),该长短时记忆网络性能优于普通的循环神经网络,可以弥补普通的循环神经网络的梯度爆炸、梯度消失等缺陷。在LSTM网络中包含有输入门、输出门和遗忘门;输入门设置为从初始特征中提起需要记忆的特征;输出门设置为读取记忆的特征,遗忘门设置为确定是否保留记忆中的特征。在将多帧视频帧对应的 初始特征依次输入至LSTM网络中时,可以训练上述输入门、输出门和遗忘门的开启和关闭时机,从而使循环神经网络训练完成。In one embodiment, the recurrent neural network may specifically be a Long Short Term Memory Network (Long Short Term Memory Network, which may be referred to as an LSTM network for short). The performance of the long and short term memory network is better than that of ordinary recurrent neural networks and can make up for ordinary loops. Defects such as gradient explosion and gradient disappearance of neural network. The LSTM network includes input gates, output gates, and forgetting gates; the input gate is set to lift the features that need to be memorized from the initial features; the output gate is set to read the memory features, and the forgetting gate is set to determine whether to retain the memory features . When the initial features corresponding to multiple video frames are sequentially input into the LSTM network, the opening and closing timing of the input gate, output gate, and forget gate can be trained to complete the training of the cyclic neural network.
具体而言,以M个视频帧为例,初始特征中包含有M个特征向量,表示为z t,t∈[1,…,M],然后将这M个特征向量送入LSTM网络中可以得到多帧视频帧的最终特征,表示为h M;LSTM网络对每个特征向量的计算过程如下: Specifically, taking M video frames as an example, the initial feature contains M feature vectors, expressed as z t ,t∈[1,...,M], and then these M feature vectors can be sent to the LSTM network. The final feature of the multi-frame video frame is obtained, denoted as h M ; the calculation process of each feature vector by the LSTM network is as follows:
Figure PCTCN2020087690-appb-000001
Figure PCTCN2020087690-appb-000001
Figure PCTCN2020087690-appb-000002
Figure PCTCN2020087690-appb-000002
f t=σ(W f[h t-1,z t]+b f) f t =σ(W f [h t-1 ,z t ]+b f )
i t=σ(W i[h t-1,z t]+b i) i t =σ(W i [h t-1 ,z t ]+b i )
Figure PCTCN2020087690-appb-000003
Figure PCTCN2020087690-appb-000003
Figure PCTCN2020087690-appb-000004
Figure PCTCN2020087690-appb-000004
o t=σ(W o[h t-1,z t]+b o) o t =σ(W o [h t-1 ,z t ]+b o )
h t=o t*tanh(C t) h t =o t *tanh(C t )
其中,W f、W i、W C、W o、b f、b i、b C和b o为LSTM的预设参数;将第M个特征向量输入至LSTM后,得到h M;该h M即最终特征,可输入至后续的输出网络中。 Among them, W f , W i , W C , W o , b f , b i , b C and b o are the preset parameters of the LSTM; after the M-th feature vector is input to the LSTM, h M is obtained; the h M That is, the final feature can be input to the subsequent output network.
一种实施方式中,上述输出网络可以包括分类全连接层;将上述最终特征输入至分类全连接层,可以输出分类结果向量。该分类全连接层包含有多个神经元,且该分类全连接层预设有权重向量;该权重向量中包含该分类全连接层各个神经元对应的权重元素;对于每个神经元,该神经元与最终特征的每个特征元素连接,该神经元将最终特征中的每个特征元素,与权重向量中对应的权重元素相乘,即可得到该神经元对应的预测值;由于全连接层中包含多个神经元,多个神经元对应的预测值组成上述分类结果向量。In one embodiment, the above-mentioned output network may include a classification fully connected layer; the above-mentioned final features are input to the classification fully connected layer, and the classification result vector may be output. The classification fully connected layer contains multiple neurons, and the classification fully connected layer presets a weight vector; the weight vector contains the weight elements corresponding to each neuron in the classification fully connected layer; for each neuron, the neuron The element is connected with each feature element of the final feature. The neuron multiplies each feature element in the final feature with the corresponding weight element in the weight vector to obtain the predicted value of the neuron; because of the fully connected layer Contains multiple neurons, and the predicted values corresponding to multiple neurons constitute the above classification result vector.
一种实施方式中,上述初始模型中还可以包括分类函数;将上述分类全连接层输出的分类结果向量输入至分类函数,可以输出分类结果向量对应的分类概率向量。该分类函数用于计算分类结果向量中每个元素的概率,该函数具体可以为Softmax函数,也可以为其他概率回归函数。In an embodiment, the above-mentioned initial model may further include a classification function; inputting the classification result vector output by the above-mentioned classification fully connected layer into the classification function can output the classification probability vector corresponding to the classification result vector. The classification function is used to calculate the probability of each element in the classification result vector. The function can be a Softmax function or other probability regression functions.
上述初始模型采用卷积神经网络和长短时记忆网络相结合,通过二维卷积和一维卷积相结合的方式提取特征,相对于三维卷积而言,可以大幅降低运算量,从而提高了模型训练和识别效率,该方式也可以在提取特征的过程中考虑视频帧之间的关联信息,因而提取出的特征可以准确的表征视频类型;且长短时记忆网络也可以避免网络层次较深时的梯度爆炸和梯度消失的问题,提高了模型的性能,有利于提取视频帧深层次的特征,从而进一步提高了视频分类的准确率。The above-mentioned initial model uses a combination of convolutional neural network and long and short-term memory network, and extracts features through a combination of two-dimensional convolution and one-dimensional convolution. Compared with three-dimensional convolution, the amount of calculation can be greatly reduced, thereby increasing Model training and recognition efficiency, this method can also consider the associated information between video frames in the process of extracting features, so the extracted features can accurately represent the video type; and the long- and short-term memory network can also avoid the deep network level. The problem of gradient explosion and gradient disappearance improves the performance of the model, which is conducive to extracting the deep features of the video frame, thereby further improving the accuracy of video classification.
本申请实施例还提供另一种视频分类模型的训练方法,该方法在上述实施例所述方法的基础上实现;本实施例重点描述输出网络和预测损失函数的具体内容。The embodiment of the present application also provides another method for training a video classification model, which is implemented on the basis of the method described in the foregoing embodiment; this embodiment focuses on the specific content of the output network and the prediction loss function.
首先,该预测损失函数包括分类损失函数;该分类损失函数可以通过算式表示为:First, the prediction loss function includes a classification loss function; the classification loss function can be expressed as:
Figure PCTCN2020087690-appb-000005
Figure PCTCN2020087690-appb-000005
其中,∑表示求和运算,
Figure PCTCN2020087690-appb-000006
exp表示以自然常数e为底的指数函数,log表示对数运算;p l为预测结果中分类结果向量对应的分类概率向量的第l个元素;y l为预先标注的多帧视频帧的标准概率向量的第l个元素;r l为y l对应的类别在训练集合中的比例;τ为预设的超参数,可以设置为1。
Among them, ∑ represents the summation operation,
Figure PCTCN2020087690-appb-000006
exp represents the exponential function with the natural constant e as the base, log represents the logarithmic operation; p l is the lth element of the classification probability vector corresponding to the classification result vector in the prediction result; y l is the pre-labeled multi-frame video frame standard The l-th element of the probability vector; r l is the proportion of the category corresponding to y l in the training set; τ is a preset hyperparameter, which can be set to 1.
r l为y l对应的类别在训练集合中的比例,如果某一类别在训练集合中的比例较低,则该类别对应的r l值会较小,而w l值会较大,从而可以起到平衡的作用,缓解各类别样本分布不均匀的问题,进而可以提高模型的训练效率以及模型的识别准确率。 r l is the proportion of the category corresponding to y l in the training set. If the proportion of a category in the training set is low, the r l value corresponding to the category will be smaller, and the w l value will be larger, so Play a balanced role, alleviate the problem of uneven distribution of samples in each category, and then improve the training efficiency of the model and the recognition accuracy of the model.
上述实施例中,描述了输出网络包括分类全连接层,本实施例中,输出网络还包括阈值全连接层,如图4所示;将最终特征输入至阈值全连接层,可以输出阈值结果向量。与分类全连接层相似,阈值全连接层包含有多个神经元,且该阈值全连接层预设有权重向量;该权重向量中包含该阈值全连接层各个神经元对应的权重元素;对于每个神经元,该神经元与最终特征的每个特征元素连接,该神经元将最终特征中的每个特征元素,与权重向量中对应的权重元素相乘,即可得到该神经元对应的预测值;由于全连接层中包含多个神经元,多个神经元对应的预测值组成上述阈值结果向量。In the above embodiment, it is described that the output network includes a classification fully connected layer. In this embodiment, the output network also includes a threshold fully connected layer, as shown in Figure 4; the final feature is input to the threshold fully connected layer, and the threshold result vector can be output . Similar to the classification fully connected layer, the threshold fully connected layer contains multiple neurons, and the threshold fully connected layer is preset with a weight vector; the weight vector contains the weight elements corresponding to each neuron of the threshold fully connected layer; for each This neuron is connected with each feature element of the final feature. The neuron multiplies each feature element in the final feature with the corresponding weight element in the weight vector to obtain the prediction corresponding to the neuron Value; Since the fully connected layer contains multiple neurons, the predicted values corresponding to multiple neurons constitute the above threshold result vector.
该阈值全连接层设置为从最终特征中提取模型针对各个类别学习的阈值结果,即阈值结果向量,各类别均对应有各自的阈值,各类别的阈值彼此之间可以相同,也可以不同。相对于人工设置阈值的方式,模型学习的阈值更加准确、合理,有利于提高模型的分类准确率。The threshold fully connected layer is set to extract the threshold result of the model for each category learning from the final feature, that is, the threshold result vector. Each category corresponds to its own threshold. The thresholds of each category can be the same or different. Compared with the way of manually setting the threshold, the threshold of model learning is more accurate and reasonable, which is beneficial to improve the classification accuracy of the model.
基于阈值全连接层输出的阈值结果向量,预测损失函数中还包括了阈值损失函数,用于评价阈值结果向量的准确率;该阈值损失函数可以通过算式表达为
Figure PCTCN2020087690-appb-000007
其中,∑表示求和运算,log表示对数运算;y l为预先标注的多帧视频帧的标准概率向量的第l个元素;δ l=σ(p ll);θ l为预测结果中阈值结果向量的第l个元素;
Figure PCTCN2020087690-appb-000008
e为数学中的自然常数。
Based on the threshold result vector output by the threshold fully connected layer, the prediction loss function also includes a threshold loss function to evaluate the accuracy of the threshold result vector; the threshold loss function can be expressed as
Figure PCTCN2020087690-appb-000007
Among them, ∑ represents the summation operation, log represents the logarithmic operation; y l is the lth element of the standard probability vector of the pre-labeled multi-frame video frame; δ l = σ(p ll ); θ l is the prediction The lth element of the threshold result vector in the result;
Figure PCTCN2020087690-appb-000008
e is the natural constant in mathematics.
当预测损失函数中包括分类损失函数和阈值损失函数时,通过预测损失函数确定预测结果的损失值的过程中,可以对分类损失函数的函数值和阈值损失函数的函数值进行加权求和,得到预测结果的损失值,如预测结果的损失值L=αL1+βL2;其中,α+β=1,α和β的值可以预设。When the prediction loss function includes the classification loss function and the threshold loss function, in the process of determining the loss value of the prediction result through the prediction loss function, the function value of the classification loss function and the function value of the threshold loss function can be weighted and summed to obtain The loss value of the prediction result, for example, the loss value of the prediction result L=αL1+βL2; where α+β=1, and the values of α and β can be preset.
上述方式中,分类损失函数中考虑了各类别在训练集合中的比例,缓解了各类别样本分布不均匀的问题,进而可以提高模型的训练效率以及模型的识别准确率;输出网络中还设置有阈值全连接层,相对于人工设置阈值的方式,模型学习的阈值更加准确、合理,进一步提高了模型的分类准确率。In the above method, the classification loss function takes into account the proportion of each category in the training set, which alleviates the problem of uneven sample distribution in each category, thereby improving the training efficiency of the model and the recognition accuracy of the model; the output network also sets The threshold fully connected layer, compared with the way of manually setting the threshold, the threshold of the model learning is more accurate and reasonable, which further improves the classification accuracy of the model.
本申请实施例还提供另一种视频分类模型的训练方法,该方法在上述实施例所述方法的基础上实现;本实施例重点描述根据损失值对初始模型进行训练的具体过程;如图5所示,该方法包括如下步骤:This embodiment of the application also provides another method for training a video classification model, which is implemented on the basis of the method described in the above embodiment; this embodiment focuses on the specific process of training the initial model according to the loss value; as shown in Figure 5 As shown, the method includes the following steps:
步骤S502,基于预设的训练集合确定当前的训练数据;该训练数据包括多帧视频帧;Step S502: Determine current training data based on a preset training set; the training data includes multiple video frames;
步骤S504,将训练数据输入至初始模型;该初始模型包括卷积神经网络、循环神经网络和输出网络;Step S504, input training data to an initial model; the initial model includes a convolutional neural network, a cyclic neural network, and an output network;
步骤S506,通过卷积神经网络提取多帧视频帧的特征,作为初始特征;Step S506, extracting features of multiple video frames through a convolutional neural network as initial features;
步骤S508,通过循环神经网络从初始特征中提取多帧视频帧的特征,作为最终特征;Step S508, extracting the features of the multi-frame video frame from the initial features through the recurrent neural network as the final feature;
步骤S510,将最终特征输入至输出网络,输出多帧视频帧的预测结果;Step S510, input the final feature to the output network, and output the prediction result of the multi-frame video frame;
步骤S512,通过预设的预测损失函数确定预测结果的损失值;Step S512: Determine the loss value of the prediction result through a preset prediction loss function;
步骤S514,根据损失值更新初始模型中的参数;Step S514, update the parameters in the initial model according to the loss value;
举例来说,可以预先设置函数映射关系,将原始参数和损失值输入至该函数映射关系中,即可计算得到更新的参数。不同参数的函数映射关系可以相同,也可以不同。For example, the function mapping relationship can be set in advance, the original parameters and the loss value are input into the function mapping relationship, and the updated parameters can be calculated. The function mapping relationship of different parameters can be the same or different.
具体而言,可以按照预设规则,从初始模型中确定待更新参数;该待更新参数可以为初始模型中的所有参数,也可以随机从初始模型中确定部分参数;再计算损失值对待更新参数的导数
Figure PCTCN2020087690-appb-000009
其中,L为概率矩阵的损失值;W为待更新参数;
Figure PCTCN2020087690-appb-000010
表示偏导数运算;该待更新参数也可以称为各神经元的权值。该过程也可以称为反向传播算法;如果损失值较大,则说明当前的初始模型的输出与期望输出结果不符,则求出上述损失值对初始模型中待更新参数的导数,该导数可以作为调整待更新参数的依据。
Specifically, the parameters to be updated can be determined from the initial model according to preset rules; the parameters to be updated can be all parameters in the initial model, or some parameters can be randomly determined from the initial model; and then calculate the loss value to be updated parameters Derivative of
Figure PCTCN2020087690-appb-000009
Among them, L is the loss value of the probability matrix; W is the parameter to be updated;
Figure PCTCN2020087690-appb-000010
Represents partial derivative operation; the parameter to be updated can also be called the weight of each neuron. This process can also be called a back-propagation algorithm; if the loss value is large, it means that the output of the current initial model does not match the expected output result, then the derivative of the loss value to the parameter to be updated in the initial model can be calculated. As a basis for adjusting the parameters to be updated.
得到各个待更新参数的导数后,再更新待更新参数,得到更新后的待更新参数
Figure PCTCN2020087690-appb-000011
其中,α为预设系数。该过程也可以称为随机梯度下降算法;各个待更新参数的导数也可以理解为基于当前的待更新参数,损失值下降最快的方向,通过该方向调整参数,可以使损失值快速降低,使该参数收敛。另外,当初始模型经一次训练后,得到一个损失值,此时可以从初始模型中各个参数中随机选择一个或多个参数进行上述的更新过程,该方式的模型训练时间较短,算法较快;当然也可以对初始模型中所有参数进行上述的更新过程,该方式的模型训练更加准确。
After obtaining the derivative of each parameter to be updated, update the parameter to be updated to obtain the updated parameter to be updated
Figure PCTCN2020087690-appb-000011
Among them, α is the preset coefficient. This process can also be referred to as a stochastic gradient descent algorithm; the derivative of each parameter to be updated can also be understood as the direction in which the loss value drops the fastest based on the current parameter to be updated. By adjusting the parameters in this direction, the loss value can be quickly reduced, so that The parameter converges. In addition, when the initial model is trained once, a loss value is obtained. At this time, one or more parameters can be randomly selected from each parameter in the initial model to perform the above-mentioned update process. The model training time is shorter and the algorithm is faster. ; Of course, the above-mentioned update process can also be performed on all the parameters in the initial model, and the model training in this way is more accurate.
步骤S516,判断更新后的参数是否均收敛;如果更新后的参数均收敛,执行步骤S518;如果更新后的参数没有均收敛,执行步骤S502;Step S516: Determine whether the updated parameters are all converged; if the updated parameters are all converged, perform step S518; if the updated parameters are not all converged, perform step S502;
如果更新后的参数没有均收敛,则继续执行基于预设的训练集合确定当前的训练数据的步骤,直至更新后的参数均收敛。If the updated parameters do not all converge, the step of determining the current training data based on the preset training set is continued until the updated parameters all converge.
步骤S518,将参数更新后的初始模型确定为视频分类模型。Step S518: Determine the initial model after parameter update as the video classification model.
上述方式中,采用卷积神经网络和循环神经网络相结合,通过二维卷积和一维卷积相结合的方式提取特征,相对于三维卷积而言,可以大幅降低运算量,从而提高了模型训练和识别效率;该方式也可以在提取特征的过程中考虑视频帧之间的关联信息,因而提取出的特征可以准确的表征视频类型,从而提高了视频分类的准确率。In the above method, the combination of convolutional neural network and recurrent neural network is used to extract features through the combination of two-dimensional convolution and one-dimensional convolution. Compared with three-dimensional convolution, the amount of calculation can be greatly reduced, thereby increasing Model training and recognition efficiency; this method can also consider the associated information between video frames in the process of extracting features, so the extracted features can accurately characterize the video type, thereby improving the accuracy of video classification.
基于上述视频分类模型的训练方法,本申请实施例还提供一种视频分类方法;该方法在上述实施例所述的视频分类模型的训练方法的基础上实现,如图6所示,该方法包括如下步骤:Based on the above video classification model training method, an embodiment of the present application also provides a video classification method; this method is implemented on the basis of the video classification model training method described in the above embodiment, as shown in FIG. 6, the method includes The following steps:
步骤S602,获取待分类的视频;Step S602: Obtain a video to be classified;
该视频可以为常规视频,也可以短视频;视频的具体格式可以为MPEG(Moving Picture  Experts Group,运动图像专家组)、AVI(Audio Video Interleaved,音频视频交错格式)、MOV(即QuickTime影片格式)等,在此不做限定。The video can be a regular video or a short video; the specific format of the video can be MPEG (Moving Picture Experts Group), AVI (Audio Video Interleaved, audio video interleaved format), MOV (QuickTime film format) Wait, it is not limited here.
步骤S604,按照预设的采样间隔从视频中获取多帧视频帧;Step S604: Obtain multiple video frames from the video according to a preset sampling interval;
该采样间隔可以预先设置,例如,采样间隔为0.2秒,即1秒采样5帧。The sampling interval can be preset, for example, the sampling interval is 0.2 seconds, that is, 5 frames are sampled in 1 second.
步骤S606,将多帧视频帧输入至预先训练完成的视频分类模型,输出多帧视频帧的分类结果;该视频分类模型通过上述视频分类模型的训练方法训练得到;Step S606: Input the multi-frame video frame to the pre-trained video classification model, and output the classification result of the multi-frame video frame; the video classification model is obtained by training the above-mentioned video classification model training method;
步骤S608,根据多帧视频帧的分类结果确定视频的类别。Step S608: Determine the video category according to the classification result of the multiple video frames.
本申请实施例提供的一种视频分类方法,首先按照预设的采样间隔从待分类的视频中获取多帧视频帧;将该多帧视频帧输入至预先训练完成的视频分类模型,输出多帧视频帧的分类结果;进而根据多帧视频帧的分类结果确定视频的类别。由于视频分类模型采用卷积神经网络和循环神经网络相结合,通过二维卷积和一维卷积相结合的方式提取特征,相对于三维卷积而言,可以大幅降低运算量,从而提高了模型训练和识别效率;该方式也可以在提取特征的过程中考虑视频帧之间的关联信息,因而提取出的特征可以准确的表征视频类型,从而提高了视频分类的准确率。The video classification method provided by the embodiment of the application first obtains multiple video frames from a video to be classified according to a preset sampling interval; inputs the multiple video frames to a pre-trained video classification model, and outputs multiple frames The classification result of the video frame; and the classification of the video is determined according to the classification result of the multi-frame video frame. Since the video classification model uses a combination of convolutional neural networks and recurrent neural networks, the features are extracted through the combination of two-dimensional convolution and one-dimensional convolution. Compared with three-dimensional convolution, the amount of calculation can be greatly reduced, thereby increasing Model training and recognition efficiency; this method can also consider the associated information between video frames in the process of extracting features, so the extracted features can accurately characterize the video type, thereby improving the accuracy of video classification.
上述视频分类模型输出的多帧视频帧的分类结果中可以包含一种或多种类别,多帧视频帧的分类结果可以直接确定为视频的类别。另外一种方式中,多帧视频帧的分类结果包括分类概率向量和阈值结果向量,此时可以比较分类概率向量中各个类别的概率值与阈值结果向量中对应的阈值,进而确定视频的类别。具体而言,根据以下算式计算视频的类别向量
Figure PCTCN2020087690-appb-000012
The classification result of the multi-frame video frame output by the above-mentioned video classification model may include one or more categories, and the classification result of the multi-frame video frame can be directly determined as the video category. In another way, the classification result of the multi-frame video frame includes a classification probability vector and a threshold result vector. In this case, the probability value of each category in the classification probability vector can be compared with the corresponding threshold in the threshold result vector to determine the video category. Specifically, calculate the category vector of the video according to the following formula
Figure PCTCN2020087690-appb-000012
Figure PCTCN2020087690-appb-000013
Figure PCTCN2020087690-appb-000013
其中,p l为分类概率向量的第l个元素;θ l为阈值结果向量的第l个元素;再将类别向量中,非零元素对应的类别确定为视频的类别。由于非零元素对应的类别的概率值大于对应的阈值,所以可以将该类别作为视频的类别。 Among them, p l is the l-th element of the classification probability vector; θ l is the l-th element of the threshold result vector; and in the category vector, the category corresponding to the non-zero element is determined as the category of the video. Since the probability value of the category corresponding to the non-zero element is greater than the corresponding threshold, the category can be regarded as the category of the video.
上述方式中,模型不仅输出了分类概率向量,还输出了阈值结果向量,基于两个向量的比较结果最终确定视频的类别,相对于人工设置阈值的方式,模数输出的阈值更加准确、合理,有利于提高视频分类准确率。基于该分类结果对视频标识标签,有利于用户快速发现自己感兴趣的内容,也有利于向用户推荐感兴趣的视频,提高了用户体验度。In the above method, the model not only outputs the classification probability vector, but also the threshold result vector. Based on the comparison result of the two vectors, the video category is finally determined. Compared with the method of manually setting the threshold, the modulus output threshold is more accurate and reasonable. Helps improve the accuracy of video classification. Identifying tags for videos based on the classification result is helpful for users to quickly discover the content they are interested in, and also helpful for recommending videos of interest to users, which improves user experience.
需要说明的是,上述各方法实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。It should be noted that the foregoing method embodiments are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other.
对应于上述方法实施例,参见图7所示的一种视频分类模型的训练装置的结构示意图,该装置包括:Corresponding to the foregoing method embodiment, refer to the schematic structural diagram of a training device for a video classification model shown in FIG. 7, and the device includes:
训练数据确定模块70,设置为基于预设的训练集合确定当前的训练数据;训练数据包括多帧视频帧;The training data determining module 70 is configured to determine the current training data based on a preset training set; the training data includes multiple video frames;
训练数据输入模块71,设置为将训练数据输入至初始模型;初始模型包括卷积神经网络、循环神经网络和输出网络;The training data input module 71 is configured to input training data to the initial model; the initial model includes a convolutional neural network, a recurrent neural network, and an output network;
初始特征提取模块72,设置为通过卷积神经网络提取多帧视频帧的特征,作为初始特 征;The initial feature extraction module 72 is configured to extract features of multiple frames of video frames through a convolutional neural network as the initial features;
最终特征提取模块73,设置为通过循环神经网络从初始特征中提取多帧视频帧的特征,作为最终特征;The final feature extraction module 73 is configured to extract features of multiple video frames from the initial features through a recurrent neural network as the final feature;
预测结果输出模块74,设置为将最终特征输入至输出网络,输出多帧视频帧的预测结果;The prediction result output module 74 is configured to input the final feature to the output network and output the prediction result of the multi-frame video frame;
损失值确定和训练模块75,设置为通过预设的预测损失函数确定预测结果的损失值;根据损失值对初始模型进行训练,直至初始模型中的参数收敛,得到视频分类模型。The loss value determination and training module 75 is configured to determine the loss value of the prediction result through a preset prediction loss function; the initial model is trained according to the loss value until the parameters in the initial model converge to obtain a video classification model.
本申请实施例提供的视频分类模型的训练装置,采用卷积神经网络和循环神经网络相结合,通过二维卷积和一维卷积相结合的方式提取特征,相对于三维卷积而言,可以大幅降低运算量,从而提高了模型训练和识别效率;该方式也可以在提取特征的过程中考虑视频帧之间的关联信息,因而提取出的特征可以准确的表征视频类型,从而提高了视频分类的准确率。The training device for the video classification model provided by the embodiment of the application uses a combination of a convolutional neural network and a recurrent neural network to extract features through a combination of two-dimensional convolution and one-dimensional convolution. Compared with three-dimensional convolution, The amount of calculation can be greatly reduced, thereby improving the efficiency of model training and recognition; this method can also consider the associated information between video frames in the process of extracting features, so the extracted features can accurately represent the video type, thereby improving the video The accuracy of classification.
在一些实施例中,上述卷积神经网络包括依次连接的多组子网络、全局平均池化层和分类全连接层;每组子网络包括依次连接的批归一化层、激活函数层、卷积层和池化层;上述卷积神经网络的初始参数通过预设的数据集训练得到。In some embodiments, the above-mentioned convolutional neural network includes multiple groups of sub-networks, a global average pooling layer, and a fully connected layer that are sequentially connected; each group of sub-networks includes a batch normalization layer, an activation function layer, and a volume Multilayer and pooling layer; the initial parameters of the above-mentioned convolutional neural network are obtained by training on a preset data set.
在一些实施例中,上述初始模型还包括全局平均池化网络;全局平均池化网络设置在卷积神经网络和循环神经网络之间;上述装置还包括:降维模块,设置为通过全局平均池化网络对初始特征进行降维处理,得到降维特征;最终特征提取模块73具体设置为:通过所述循环神经网络从所述降维特征中提取所述多帧视频帧的特征,作为最终特征。In some embodiments, the above-mentioned initial model further includes a global average pooling network; the global average pooling network is set between the convolutional neural network and the recurrent neural network; the above-mentioned device further includes: a dimensionality reduction module set to pass the global average pooling The dimensionalization network performs dimensionality reduction processing on the initial features to obtain dimensionality reduction features; the final feature extraction module 73 is specifically configured to extract the features of the multi-frame video frame from the dimensionality reduction features through the recurrent neural network as the final feature .
在一些实施例中,上述循环神经网络包括长短时记忆网络。In some embodiments, the above-mentioned cyclic neural network includes a long and short-term memory network.
在一些实施例中,上述输出网络包括分类全连接层;初始模型还包括分类函数;上述预测结果输出模块,设置为:将最终特征输入至分类全连接层,输出分类结果向量;上述装置还包括:概率向量输出模块,设置为将分类结果向量输入至分类函数,输出分类结果向量对应的分类概率向量。In some embodiments, the above-mentioned output network includes a fully-connected classification layer; the initial model also includes a classification function; the above-mentioned prediction result output module is configured to: input the final feature into the fully-connected classification layer and output a classification result vector; the above device also includes : Probability vector output module, set to input the classification result vector to the classification function, and output the classification probability vector corresponding to the classification result vector.
在一些实施例中,上述预测损失函数包括分类损失函数;该分类损失函数为
Figure PCTCN2020087690-appb-000014
其中,∑表示求和运算,
Figure PCTCN2020087690-appb-000015
Figure PCTCN2020087690-appb-000016
exp表示以自然常数e为底的指数函数,log表示对数运算;p l为预测结果中分类结果向量对应的分类概率向量的第l个元素;y l为预先标注的多帧视频帧的标准概率向量的第l个元素;r l为y l对应的类别在训练集合中的比例;τ为预设的超参数。
In some embodiments, the aforementioned prediction loss function includes a classification loss function; the classification loss function is
Figure PCTCN2020087690-appb-000014
Among them, ∑ represents the summation operation,
Figure PCTCN2020087690-appb-000015
Figure PCTCN2020087690-appb-000016
exp represents the exponential function with the natural constant e as the base, log represents the logarithmic operation; p l is the lth element of the classification probability vector corresponding to the classification result vector in the prediction result; y l is the pre-labeled multi-frame video frame standard The l-th element of the probability vector; r l is the proportion of the category corresponding to y l in the training set; τ is the preset hyperparameter.
在一些实施例中,上述输出网络包括阈值全连接层;上述预测结果输出模块,设置为:将最终特征输入至阈值全连接层,输出阈值结果向量。In some embodiments, the aforementioned output network includes a threshold fully connected layer; the aforementioned prediction result output module is configured to: input the final feature to the threshold fully connected layer, and output a threshold result vector.
在一些实施例中,上述预测损失函数包括阈值损失函数;该阈值损失函数为
Figure PCTCN2020087690-appb-000017
y l为预先标注的多帧视频帧的标准概率向量的第l个元素;δ l=元素p ll);θ l为预测结果中阈值结果向量的第l个元素;
Figure PCTCN2020087690-appb-000018
In some embodiments, the aforementioned prediction loss function includes a threshold loss function; the threshold loss function is
Figure PCTCN2020087690-appb-000017
y l is the l-th element of the standard probability vector of the pre-labeled multi-frame video frame; δ l = element p l- θ l ); θ l is the l-th element of the threshold result vector in the prediction result;
Figure PCTCN2020087690-appb-000018
在一些实施例中,上述预测损失函数包括分类损失函数和阈值损失函数;上述损失值 确定和训练模块,设置为:对分类损失函数的函数值和阈值损失函数的函数值进行加权求和,得到预测结果的损失值。In some embodiments, the above prediction loss function includes a classification loss function and a threshold loss function; the above loss value determination and training module is configured to: perform a weighted summation on the function value of the classification loss function and the function value of the threshold loss function to obtain The loss value of the prediction result.
在一些实施例中,上述损失值确定和训练模块,设置为:根据损失值更新初始模型中的参数;判断更新后的参数是否均收敛;如果更新后的参数均收敛,将参数更新后的初始模型确定为视频分类模型;如果更新后的参数没有均收敛,继续执行基于预设的训练集合确定当前的训练数据的步骤,直至更新后的参数均收敛。In some embodiments, the above-mentioned loss value determination and training module is configured to: update the parameters in the initial model according to the loss value; determine whether the updated parameters are all converged; if the updated parameters are all converged, set the updated initial parameters The model is determined to be a video classification model; if the updated parameters do not all converge, continue to perform the step of determining the current training data based on the preset training set until the updated parameters all converge.
在一些实施例中,上述损失值确定和训练模块,设置为:按照预设规则,从初始模型确定待更新参数;计算损失值对初始模型中待更新参数的导数
Figure PCTCN2020087690-appb-000019
其中,L为损失值;W为待更新参数;
Figure PCTCN2020087690-appb-000020
表示偏导数运算;更新待更新参数,得到更新后的待更新参数
Figure PCTCN2020087690-appb-000021
其中,α为预设系数。
In some embodiments, the aforementioned loss value determination and training module is configured to: determine the parameters to be updated from the initial model according to preset rules; calculate the derivative of the loss value to the parameters to be updated in the initial model
Figure PCTCN2020087690-appb-000019
Among them, L is the loss value; W is the parameter to be updated;
Figure PCTCN2020087690-appb-000020
Indicates partial derivative operation; update the parameter to be updated, and get the updated parameter to be updated
Figure PCTCN2020087690-appb-000021
Among them, α is the preset coefficient.
参见图8所示的一种视频分类装置的结构示意图;该装置包括:See FIG. 8 for a schematic structural diagram of a video classification device; the device includes:
视频获取模块80,设置为获取待分类的视频;The video acquisition module 80 is configured to acquire the video to be classified;
视频帧获取模块81,设置为按照预设的采样间隔从视频中获取多帧视频帧;The video frame obtaining module 81 is configured to obtain multiple video frames from the video according to a preset sampling interval;
分类模块82,设置为将多帧视频帧输入至预先训练完成的视频分类模型,输出多帧视频帧的分类结果;视频分类模型通过上述视频分类模型的训练方法训练得到;The classification module 82 is configured to input a multi-frame video frame to the pre-trained video classification model, and output the classification result of the multi-frame video frame; the video classification model is obtained through training of the above-mentioned video classification model training method;
类别确定模块83,设置为根据多帧视频帧的分类结果确定视频的类别。The category determining module 83 is configured to determine the category of the video according to the classification result of the multiple video frames.
在一些实施例中,上述多帧视频帧的分类结果包括:分类概率向量和阈值结果向量;上述类别确定模块,设置为:计算视频的类别向量
Figure PCTCN2020087690-appb-000022
其中,p l为分类概率向量的第l个元素;θ l为阈值结果向量的第l个元素;将类别向量中,非零元素对应的类别确定为视频的类别。
In some embodiments, the classification result of the above-mentioned multi-frame video frame includes: a classification probability vector and a threshold result vector; the above-mentioned category determination module is configured to: calculate the category vector of the video
Figure PCTCN2020087690-appb-000022
Among them, p l is the l-th element of the classification probability vector; θ l is the l-th element of the threshold result vector; in the category vector, the category corresponding to the non-zero element is determined as the category of the video.
本申请实施例所提供的装置,其实现原理及产生的技术效果和前述方法实施例相同,为简要描述,装置实施例部分未提及之处,可参考前述方法实施例中相应内容。The implementation principles and technical effects of the device provided in the embodiment of the application are the same as those of the foregoing method embodiment. For a brief description, for the parts not mentioned in the device embodiment, please refer to the corresponding content in the foregoing method embodiment.
本申请实施例还提供了一种电子设备,参见图9所示,该电子设备包括存储器100和处理器101,其中,存储器100设置为存储一条或多条计算机指令,一条或多条计算机指令被处理器101执行,以实现上述视频分类模型的训练方法,或者视频分类方法的步骤。An embodiment of the present application also provides an electronic device. As shown in FIG. 9, the electronic device includes a memory 100 and a processor 101, where the memory 100 is configured to store one or more computer instructions, and one or more computer instructions are The processor 101 executes to implement the training method of the video classification model or the steps of the video classification method.
进一步地,图9所示的电子设备还包括总线102和通信接口103,处理器101、通信接口103和存储器100通过总线102连接。Further, the electronic device shown in FIG. 9 further includes a bus 102 and a communication interface 103, and the processor 101, the communication interface 103, and the memory 100 are connected through the bus 102.
其中,存储器100可能包含高速随机存取存储器(RAM,Random Access Memory),也可能还包括非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。通过至少一个通信接口103(可以是有线或者无线)实现该系统网元与至少一个其他网元之间的通信连接,可以使用互联网,广域网,本地网,城域网等。总线102可以是ISA总线、PCI总线或EISA总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图9中仅用一个双向箭头表示,但并不表示仅有一根总线或一种类型的总线。The memory 100 may include a high-speed random access memory (RAM, Random Access Memory), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the system network element and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the Internet, a wide area network, a local network, a metropolitan area network, etc. may be used. The bus 102 may be an ISA bus, PCI bus, EISA bus, or the like. The bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one bidirectional arrow is used in FIG. 9, but it does not mean that there is only one bus or one type of bus.
处理器101可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器101中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器101可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、网 络处理器(Network Processor,简称NP)等;还可以是数字信号处理器(Digital Signal Processing,简称DSP)、专用集成电路(Application Specific Integrated Circuit,简称ASIC)、现成可编程门阵列(Field-Programmable Gate Array,简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器100,处理器101读取存储器100中的信息,结合其硬件完成前述实施例的方法的步骤。The processor 101 may be an integrated circuit chip with signal processing capability. In the implementation process, the steps of the foregoing method can be completed by an integrated logic circuit of hardware in the processor 101 or instructions in the form of software. The aforementioned processor 101 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP for short). ), Application Specific Integrated Circuit (ASIC for short), Field-Programmable Gate Array (FPGA for short) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The storage medium is located in the memory 100, and the processor 101 reads information in the memory 100, and completes the steps of the method of the foregoing embodiment in combination with its hardware.
本申请实施例还提供了一种机器可读存储介质,该机器可读存储介质存储有机器可执行指令,该机器可执行指令在被处理器调用和执行时,机器可执行指令促使处理器实现上述视频分类模型的训练方法,或者视频分类方法的步骤,具体实现可参见方法实施例,在此不再赘述。The embodiment of the present application also provides a machine-readable storage medium that stores machine-executable instructions. When the machine-executable instructions are called and executed by a processor, the machine-executable instructions prompt the processor to implement For the training method of the above-mentioned video classification model or the steps of the video classification method, for specific implementation, please refer to the method embodiment, which will not be repeated here.
本申请实施例所提供的视频分类方法及其模型的训练方法、装置和电子设备的计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令可设置为执行前面方法实施例中所述的方法,具体实现可参见方法实施例,在此不再赘述。The video classification method and its model training method, device, and computer program product of the electronic device provided by the embodiments of the present application include a computer-readable storage medium storing program code, and the instructions included in the program code can be set to execute the previous For the specific implementation of the method described in the method embodiment, please refer to the method embodiment, which will not be repeated here.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对相关技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of this application essentially or the part that contributes to the related technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including several The instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code .
本申请实施例提供了一种可执行程序代码,所述可执行程序代码设置为被运行以执行上述任一种视频分类模型的训练方法,或者上述任一种视频分类方法的步骤。The embodiment of the present application provides an executable program code that is configured to be executed to execute any of the above-mentioned training methods for video classification models or the steps of any of the above-mentioned video classification methods.
最后应说明的是:以上所述实施例,仅为本申请的具体实施方式,用以说明本申请的技术方案,而非对其限制,本申请的保护范围并不局限于此,尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本申请实施例技术方案的精神和范围,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应所述以权利要求的保护范围为准。Finally, it should be noted that the above-mentioned embodiments are only specific implementations of the application, which are used to illustrate the technical solutions of the application, rather than limit it. The scope of protection of the application is not limited thereto, although referring to the foregoing The examples describe the application in detail, and those of ordinary skill in the art should understand that any person skilled in the art can still modify the technical solutions described in the foregoing examples within the technical scope disclosed in this application. Or it is easy to think of changes, or equivalent replacements of some of the technical features; and these modifications, changes or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be covered in this application Within the scope of protection. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.
以上所述仅为本申请的较佳实施例而已,并不用以限制本申请,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请保护的范围之内。The above are only the preferred embodiments of this application and are not intended to limit this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in this application Within the scope of protection.

Claims (29)

  1. 一种视频分类模型的训练方法,所述方法包括:A method for training a video classification model, the method comprising:
    确定当前的训练数据;所述训练数据包括多帧视频帧;Determine current training data; the training data includes multiple video frames;
    将所述训练数据输入至初始模型;所述初始模型包括卷积神经网络、循环神经网络和输出网络;Input the training data to an initial model; the initial model includes a convolutional neural network, a recurrent neural network, and an output network;
    通过所述卷积神经网络提取所述多帧视频帧的特征,作为初始特征;Extracting the feature of the multi-frame video frame by the convolutional neural network as an initial feature;
    通过所述循环神经网络从所述初始特征中提取所述多帧视频帧的特征,作为最终特征;Extracting the feature of the multi-frame video frame from the initial feature by using the recurrent neural network as the final feature;
    将所述最终特征输入至所述输出网络,输出所述多帧视频帧的预测结果;Inputting the final feature to the output network, and outputting the prediction result of the multi-frame video frame;
    通过预设的预测损失函数确定所述预测结果的损失值;根据所述损失值对所述初始模型进行训练,直至所述初始模型中的参数收敛,得到视频分类模型。The loss value of the prediction result is determined by a preset prediction loss function; the initial model is trained according to the loss value until the parameters in the initial model converge to obtain a video classification model.
  2. 根据权利要求1所述的方法,其中,所述卷积神经网络包括依次连接的多组子网络、全局平均池化层和分类全连接层;每组所述子网络包括依次连接的批归一化层、激活函数层、卷积层和池化层;The method according to claim 1, wherein the convolutional neural network includes multiple groups of sub-networks connected in sequence, a global average pooling layer, and a fully connected layer of classification; each group of the sub-networks includes a batch of sub-networks connected in sequence Layer, activation function layer, convolution layer and pooling layer;
    所述卷积神经网络的初始参数通过预设的数据集训练得到。The initial parameters of the convolutional neural network are obtained by training on a preset data set.
  3. 根据权利要求1所述的方法,其中,所述初始模型还包括全局平均池化网络;所述全局平均池化网络设置在所述卷积神经网络和所述循环神经网络之间;The method according to claim 1, wherein the initial model further comprises a global average pooling network; the global average pooling network is set between the convolutional neural network and the recurrent neural network;
    所述通过所述循环神经网络从所述初始特征中提取所述多帧视频帧的特征,作为最终特征,包括:The extracting the feature of the multi-frame video frame from the initial feature through the recurrent neural network as the final feature includes:
    通过所述全局平均池化网络对所述初始特征进行降维处理,得到降维特征;Performing dimensionality reduction processing on the initial features through the global average pooling network to obtain dimensionality reduction features;
    通过所述循环神经网络从所述降维特征中提取所述多帧视频帧的特征,作为最终特征。The feature of the multi-frame video frame is extracted from the dimensionality reduction feature through the recurrent neural network as the final feature.
  4. 根据权利要求1所述的方法,其中,所述循环神经网络包括长短时记忆网络。The method of claim 1, wherein the recurrent neural network comprises a long and short-term memory network.
  5. 根据权利要求1所述的方法,其中,所述输出网络包括分类全连接层;所述初始模型还包括分类函数;The method according to claim 1, wherein the output network includes a classification fully connected layer; the initial model further includes a classification function;
    所述将所述最终特征输入至所述输出网络,输出所述多帧视频帧的预测结果的步骤,包括:将所述最终特征输入至所述分类全连接层,输出分类结果向量;The step of inputting the final feature to the output network and outputting the prediction result of the multi-frame video frame includes: inputting the final feature to the classification fully connected layer, and outputting a classification result vector;
    所述方法还包括:将所述分类结果向量输入至所述分类函数,输出所述分类结果向量对应的分类概率向量。The method further includes: inputting the classification result vector to the classification function, and outputting a classification probability vector corresponding to the classification result vector.
  6. 根据权利要求5所述的方法,其中,所述预测损失函数包括分类损失函数;The method according to claim 5, wherein the prediction loss function includes a classification loss function;
    所述分类损失函数为
    Figure PCTCN2020087690-appb-100001
    The classification loss function is
    Figure PCTCN2020087690-appb-100001
    其中,
    Figure PCTCN2020087690-appb-100002
    p l为所述预测结果中分类结果向量对应的分类概率向量的第l个元素;y l为预先标注的所述多帧视频帧的标准概率向量的第l个元素;r l为y l对应的类别在所述训练集合中的比例;τ为预设的超参数。
    among them,
    Figure PCTCN2020087690-appb-100002
    p l is the l-th element of the classification probability vector corresponding to the classification result vector in the prediction result; y l is the l-th element of the pre-labeled standard probability vector of the multi-frame video frame; r l is the corresponding y l The proportion of the category in the training set; τ is a preset hyperparameter.
  7. 根据权利要求5所述的方法,其中,所述输出网络包括阈值全连接层;The method of claim 5, wherein the output network includes a threshold fully connected layer;
    所述将所述最终特征输入至所述输出网络,输出所述多帧视频帧的预测结果的步骤,包括:将所述最终特征输入至所述阈值全连接层,输出阈值结果向量。The step of inputting the final feature to the output network and outputting the prediction result of the multi-frame video frame includes: inputting the final feature to the threshold fully connected layer, and outputting a threshold result vector.
  8. 根据权利要求7所述的方法,其中,所述预测损失函数包括阈值损失函数;8. The method of claim 7, wherein the predicted loss function comprises a threshold loss function;
    所述阈值损失函数为
    Figure PCTCN2020087690-appb-100003
    The threshold loss function is
    Figure PCTCN2020087690-appb-100003
    y l为预先标注的所述多帧视频帧的标准概率向量的第l个元素;δ l=元素p ll);θ l为所述预测结果中阈值结果向量的第l个元素;
    Figure PCTCN2020087690-appb-100004
    y l is the lth element of the pre-marked standard probability vector of the multi-frame video frame; δ l = element p l- θ l ); θ l is the lth element of the threshold result vector in the prediction result;
    Figure PCTCN2020087690-appb-100004
  9. 根据权利要求1所述的方法,其中,所述预测损失函数包括分类损失函数和阈值损失函数;The method according to claim 1, wherein the prediction loss function includes a classification loss function and a threshold loss function;
    通过预设的预测损失函数确定所述预测结果的损失值的步骤,包括:The step of determining the loss value of the prediction result through a preset prediction loss function includes:
    对所述分类损失函数的函数值和所述阈值损失函数的函数值进行加权求和,得到所述预测结果的损失值。A weighted summation is performed on the function value of the classification loss function and the function value of the threshold loss function to obtain the loss value of the prediction result.
  10. 根据权利要求1所述的方法,其中,根据所述损失值对所述初始模型进行训练,直至所述初始模型中的参数收敛,得到视频分类模型的步骤,包括:The method according to claim 1, wherein the step of training the initial model according to the loss value until the parameters in the initial model converge to obtain a video classification model comprises:
    根据所述损失值更新所述初始模型中的参数;Update the parameters in the initial model according to the loss value;
    判断更新后的所述参数是否均收敛;Determine whether the updated parameters are all converged;
    如果更新后的所述参数均收敛,将参数更新后的所述初始模型确定为视频分类模型;If the updated parameters all converge, determining the initial model after the updated parameters as the video classification model;
    如果更新后的所述参数没有均收敛,继续执行确定当前的训练数据的步骤,直至更新后的所述参数均收敛。If the updated parameters do not all converge, continue to perform the step of determining the current training data until the updated parameters all converge.
  11. 根据权利要求10所述的方法,其中,根据所述损失值更新所述初始模型中的参数的步骤,包括:The method according to claim 10, wherein the step of updating the parameters in the initial model according to the loss value comprises:
    按照预设规则,从所述初始模型确定待更新参数;According to preset rules, determine the parameters to be updated from the initial model;
    计算所述损失值对所述初始模型中所述待更新参数的导数
    Figure PCTCN2020087690-appb-100005
    其中,L为所述损失值;W为所述待更新参数;
    Calculate the derivative of the loss value to the parameter to be updated in the initial model
    Figure PCTCN2020087690-appb-100005
    Wherein, L is the loss value; W is the parameter to be updated;
    更新所述待更新参数,得到更新后的待更新参数
    Figure PCTCN2020087690-appb-100006
    其中,α为预设系数。
    Update the parameter to be updated to obtain the updated parameter to be updated
    Figure PCTCN2020087690-appb-100006
    Among them, α is the preset coefficient.
  12. 一种视频分类方法,所述方法包括:A video classification method, the method includes:
    获取待分类的视频;Get the video to be classified;
    按照预设的采样间隔从所述视频中获取多帧视频帧;Acquiring multiple video frames from the video according to a preset sampling interval;
    将所述多帧视频帧输入至预先训练完成的视频分类模型,输出所述多帧视频帧的分类结果;所述视频分类模型通过权利要求1-11任一项所述的视频分类模型的训练方法训练得到;Input the multi-frame video frame to the pre-trained video classification model, and output the classification result of the multi-frame video frame; the video classification model is trained by the video classification model of any one of claims 1-11 Method training;
    根据所述多帧视频帧的分类结果确定所述视频的类别。The category of the video is determined according to the classification result of the multi-frame video frame.
  13. 根据权利要求12所述的方法,其中,所述多帧视频帧的分类结果包括:分类概率向量和阈值结果向量;The method according to claim 12, wherein the classification result of the multi-frame video frame comprises: a classification probability vector and a threshold result vector;
    根据所述多帧视频帧的分类结果确定所述视频的类别的步骤,包括:The step of determining the category of the video according to the classification result of the multi-frame video frame includes:
    计算所述视频的类别向量
    Figure PCTCN2020087690-appb-100007
    其中,p l为所述分类概率向量的第l个元素; θ l为所述阈值结果向量的第l个元素;
    Calculate the category vector of the video
    Figure PCTCN2020087690-appb-100007
    Where p l is the lth element of the classification probability vector; θ l is the lth element of the threshold result vector;
    将所述类别向量中,非零元素对应的类别确定为所述视频的类别。Determine the category corresponding to the non-zero element in the category vector as the category of the video.
  14. 一种视频分类模型的训练装置,包括:A training device for a video classification model, including:
    训练数据确定模块,设置为确定当前的训练数据;所述训练数据包括多帧视频帧;A training data determination module, configured to determine current training data; the training data includes multiple video frames;
    训练数据输入模块,设置为将所述训练数据输入至初始模型;所述初始模型包括卷积神经网络、循环神经网络和输出网络;A training data input module, configured to input the training data to an initial model; the initial model includes a convolutional neural network, a recurrent neural network, and an output network;
    初始特征提取模块,设置为通过所述卷积神经网络提取所述多帧视频帧的特征,作为初始特征;An initial feature extraction module, configured to extract features of the multi-frame video frame through the convolutional neural network as the initial feature;
    最终特征提取模块,设置为通过所述循环神经网络从所述初始特征中提取所述多帧视频帧的特征,作为最终特征;A final feature extraction module, configured to extract the features of the multi-frame video frame from the initial features through the recurrent neural network as the final feature;
    预测结果输出模块,设置为将所述最终特征输入至所述输出网络,输出所述多帧视频帧的预测结果;A prediction result output module, configured to input the final feature to the output network and output the prediction result of the multi-frame video frame;
    损失值确定和训练模块,设置为通过预设的预测损失函数确定所述预测结果的损失值;根据所述损失值对所述初始模型进行训练,直至所述初始模型中的参数收敛,得到视频分类模型。The loss value determination and training module is configured to determine the loss value of the prediction result through a preset prediction loss function; train the initial model according to the loss value until the parameters in the initial model converge to obtain a video Classification model.
  15. 根据权利要求14所述的装置,其中,所述卷积神经网络包括依次连接的多组子网络、全局平均池化层和分类全连接层;每组所述子网络包括依次连接的批归一化层、激活函数层、卷积层和池化层;The device according to claim 14, wherein the convolutional neural network includes multiple groups of sub-networks connected in sequence, a global average pooling layer, and a fully connected classification layer; each group of the sub-networks includes a batch of sub-networks connected in sequence Layer, activation function layer, convolution layer and pooling layer;
    所述卷积神经网络的初始参数通过预设的数据集训练得到。The initial parameters of the convolutional neural network are obtained by training on a preset data set.
  16. 根据权利要求14所述的装置,其中,所述初始模型还包括全局平均池化网络;所述全局平均池化网络设置在所述卷积神经网络和所述循环神经网络之间;The apparatus according to claim 14, wherein the initial model further comprises a global average pooling network; the global average pooling network is set between the convolutional neural network and the recurrent neural network;
    所述装置还包括:降维模块,设置为通过所述全局平均池化网络对所述初始特征进行降维处理,得到降维特征;The device further includes: a dimensionality reduction module configured to perform dimensionality reduction processing on the initial feature through the global average pooling network to obtain a dimensionality reduction feature;
    所述最终特征提取模块,具体设置为:通过所述循环神经网络从所述降维特征中提取所述多帧视频帧的特征,作为最终特征。The final feature extraction module is specifically configured to extract the feature of the multi-frame video frame from the dimensionality reduction feature through the cyclic neural network as the final feature.
  17. 根据权利要求14所述的装置,其中,所述循环神经网络包括长短时记忆网络。The device of claim 14, wherein the recurrent neural network comprises a long and short-term memory network.
  18. 根据权利要求14所述的装置,其中,所述输出网络包括分类全连接层;所述初始模型还包括分类函数;The apparatus according to claim 14, wherein the output network includes a fully connected classification layer; the initial model further includes a classification function;
    所述预测结果输出模块,设置为:将所述最终特征输入至所述分类全连接层,输出分类结果向量;The prediction result output module is configured to: input the final feature to the classification fully connected layer, and output a classification result vector;
    所述装置还包括:概率向量输出模块,设置为将所述分类结果向量输入至所述分类函数,输出所述分类结果向量对应的分类概率向量。The device further includes: a probability vector output module, configured to input the classification result vector to the classification function, and output a classification probability vector corresponding to the classification result vector.
  19. 根据权利要求18所述的装置,其中,所述预测损失函数包括分类损失函数;The apparatus according to claim 18, wherein the prediction loss function comprises a classification loss function;
    所述分类损失函数为
    Figure PCTCN2020087690-appb-100008
    The classification loss function is
    Figure PCTCN2020087690-appb-100008
    其中,
    Figure PCTCN2020087690-appb-100009
    p l为所述预测结果中分类结果向量对应的分类概率向 量的第l个元素;y l为预先标注的所述多帧视频帧的标准概率向量的第l个元素;r l为y l对应的类别在所述训练集合中的比例;τ为预设的超参数。
    among them,
    Figure PCTCN2020087690-appb-100009
    p l is the l-th element of the classification probability vector corresponding to the classification result vector in the prediction result; y l is the l-th element of the pre-labeled standard probability vector of the multi-frame video frame; r l is the corresponding y l The proportion of the category in the training set; τ is a preset hyperparameter.
  20. 根据权利要求18所述的装置,其中,所述输出网络包括阈值全连接层;The apparatus of claim 18, wherein the output network includes a threshold fully connected layer;
    所述预测结果输出模块,设置为:将所述最终特征输入至所述阈值全连接层,输出阈值结果向量。The prediction result output module is configured to: input the final feature to the threshold fully connected layer, and output a threshold result vector.
  21. 根据权利要求20所述的装置,其中,所述预测损失函数包括阈值损失函数;The apparatus according to claim 20, wherein the prediction loss function comprises a threshold loss function;
    所述阈值损失函数为
    Figure PCTCN2020087690-appb-100010
    y l为预先标注的所述多帧视频帧的标准概率向量的第l个元素;δ l=元素p ll);θ l为所述预测结果中阈值结果向量的第l个元素;
    Figure PCTCN2020087690-appb-100011
    The threshold loss function is
    Figure PCTCN2020087690-appb-100010
    y l is the lth element of the pre-marked standard probability vector of the multi-frame video frame; δ l = element p l- θ l ); θ l is the lth element of the threshold result vector in the prediction result;
    Figure PCTCN2020087690-appb-100011
  22. 根据权利要求14所述的装置,其中,所述预测损失函数包括分类损失函数和阈值损失函数;The apparatus according to claim 14, wherein the prediction loss function includes a classification loss function and a threshold loss function;
    所述损失值确定和训练模块,设置为:对所述分类损失函数的函数值和所述阈值损失函数的函数值进行加权求和,得到所述预测结果的损失值。The loss value determination and training module is configured to: perform a weighted summation on the function value of the classification loss function and the function value of the threshold loss function to obtain the loss value of the prediction result.
  23. 根据权利要求14所述的装置,其中,所述损失值确定和训练模块,设置为:The device according to claim 14, wherein the loss value determination and training module is set to:
    根据所述损失值更新所述初始模型中的参数;Update the parameters in the initial model according to the loss value;
    判断更新后的所述参数是否均收敛;Determine whether the updated parameters are all converged;
    如果更新后的所述参数均收敛,将参数更新后的所述初始模型确定为视频分类模型;If the updated parameters all converge, determining the initial model after the updated parameters as the video classification model;
    如果更新后的所述参数没有均收敛,继续执行基于预设的训练集合确定当前的训练数据的步骤,直至更新后的所述参数均收敛。If the updated parameters do not all converge, continue to perform the step of determining the current training data based on the preset training set until the updated parameters all converge.
  24. 根据权利要求23所述的装置,其中,所述损失值确定和训练模块,设置为:The device according to claim 23, wherein the loss value determination and training module is set to:
    按照预设规则,从所述初始模型确定待更新参数;According to preset rules, determine the parameters to be updated from the initial model;
    计算所述损失值对所述初始模型中所述待更新参数的导数
    Figure PCTCN2020087690-appb-100012
    其中,L为所述损失值;W为所述待更新参数;
    Calculate the derivative of the loss value to the parameter to be updated in the initial model
    Figure PCTCN2020087690-appb-100012
    Wherein, L is the loss value; W is the parameter to be updated;
    更新所述待更新参数,得到更新后的待更新参数
    Figure PCTCN2020087690-appb-100013
    其中,α为预设系数。
    Update the parameter to be updated to obtain the updated parameter to be updated
    Figure PCTCN2020087690-appb-100013
    Among them, α is the preset coefficient.
  25. 一种视频分类装置,所述装置包括:A video classification device, the device includes:
    视频获取模块,设置为获取待分类的视频;The video acquisition module is set to acquire the video to be classified;
    视频帧获取模块,设置为按照预设的采样间隔从所述视频中获取多帧视频帧;A video frame obtaining module, configured to obtain multiple video frames from the video according to a preset sampling interval;
    分类模块,设置为将所述多帧视频帧输入至预先训练完成的视频分类模型,输出所述多帧视频帧的分类结果;所述视频分类模型通过权利要求1-11任一项所述的视频分类模型的训练方法训练得到;The classification module is configured to input the multi-frame video frame to the pre-trained video classification model, and output the classification result of the multi-frame video frame; the video classification model is according to any one of claims 1-11 The training method of the video classification model is trained;
    类别确定模块,设置为根据所述多帧视频帧的分类结果确定所述视频的类别。The category determining module is configured to determine the category of the video according to the classification result of the multi-frame video frame.
  26. 根据权利要求25所述的装置,其中,所述多帧视频帧的分类结果包括:分类概率向量和阈值结果向量;The apparatus according to claim 25, wherein the classification result of the multi-frame video frame comprises: a classification probability vector and a threshold result vector;
    所述类别确定模块,设置为:The category determination module is set to:
    计算所述视频的类别向量
    Figure PCTCN2020087690-appb-100014
    其中,p l为所述分类概率向量的第l个元素;θ l为所述阈值结果向量的第l个元素;
    Calculate the category vector of the video
    Figure PCTCN2020087690-appb-100014
    Wherein, p l is the lth element of the classification probability vector; θl is the lth element of the threshold result vector;
    将所述类别向量中,非零元素对应的类别确定为所述视频的类别。Determine the category corresponding to the non-zero element in the category vector as the category of the video.
  27. 一种电子设备,包括处理器和存储器,所述存储器存储有能够被所述处理器执行的机器可执行指令,所述处理器执行所述机器可执行指令以实现权利要求1至11任一项所述的视频分类模型的训练方法,或者权利要求12或13所述的视频分类方法的步骤。An electronic device comprising a processor and a memory, the memory storing machine executable instructions that can be executed by the processor, and the processor executing the machine executable instructions to implement any one of claims 1 to 11 The training method of the video classification model, or the steps of the video classification method of claim 12 or 13.
  28. 一种机器可读存储介质,该机器可读存储介质存储有机器可执行指令,该机器可执行指令在被处理器调用和执行时,机器可执行指令促使处理器实现权利要求1至11任一项所述的视频分类模型的训练方法,或者权利要求12或13所述的视频分类方法的步骤。A machine-readable storage medium, the machine-readable storage medium stores machine-executable instructions. When the machine-executable instructions are called and executed by a processor, the machine-executable instructions prompt the processor to implement any one of claims 1 to 11 The training method of the video classification model described in item, or the steps of the video classification method described in claim 12 or 13.
  29. 一种可执行程序代码,所述可执行程序代码设置为被运行以执行权利要求1至11任一项所述的视频分类模型的训练方法,或者权利要求12或13所述的视频分类方法的步骤。An executable program code configured to be run to execute the training method of the video classification model according to any one of claims 1 to 11, or the video classification method according to claim 12 or 13. step.
PCT/CN2020/087690 2019-04-29 2020-04-29 Video classification method and model training method and apparatus thereof, and electronic device WO2020221278A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910359704.0A CN110070067B (en) 2019-04-29 2019-04-29 Video classification method, training method and device of video classification method model and electronic equipment
CN201910359704.0 2019-04-29

Publications (1)

Publication Number Publication Date
WO2020221278A1 true WO2020221278A1 (en) 2020-11-05

Family

ID=67369701

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/087690 WO2020221278A1 (en) 2019-04-29 2020-04-29 Video classification method and model training method and apparatus thereof, and electronic device

Country Status (2)

Country Link
CN (1) CN110070067B (en)
WO (1) WO2020221278A1 (en)

Cited By (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364204A (en) * 2020-11-12 2021-02-12 北京达佳互联信息技术有限公司 Video searching method and device, computer equipment and storage medium
CN112418320A (en) * 2020-11-24 2021-02-26 杭州未名信科科技有限公司 Enterprise association relation identification method and device and storage medium
CN112560996A (en) * 2020-12-24 2021-03-26 北京百度网讯科技有限公司 User portrait recognition model training method, device, readable storage medium and product
CN112597864A (en) * 2020-12-16 2021-04-02 佳都新太科技股份有限公司 Monitoring video abnormity detection method and device
CN112613486A (en) * 2021-01-07 2021-04-06 福州大学 Professional stereoscopic video comfort classification method based on multilayer attention and BiGRU
CN112613577A (en) * 2020-12-31 2021-04-06 上海商汤智能科技有限公司 Neural network training method and device, computer equipment and storage medium
CN112633407A (en) * 2020-12-31 2021-04-09 深圳云天励飞技术股份有限公司 Method and device for training classification model, electronic equipment and storage medium
CN112734699A (en) * 2020-12-24 2021-04-30 浙江大华技术股份有限公司 Article state warning method and device, storage medium and electronic device
CN112734013A (en) * 2021-01-07 2021-04-30 北京迈格威科技有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN112749685A (en) * 2021-01-28 2021-05-04 北京百度网讯科技有限公司 Video classification method, apparatus and medium
CN112819011A (en) * 2021-01-28 2021-05-18 北京迈格威科技有限公司 Method and device for identifying relationships between objects and electronic system
CN112835008A (en) * 2021-01-12 2021-05-25 西安电子科技大学 High-resolution range profile target identification method based on attitude self-adaptive convolutional network
CN112866156A (en) * 2021-01-15 2021-05-28 浙江工业大学 Radio signal clustering method and system based on deep learning
CN112954312A (en) * 2021-02-07 2021-06-11 福州大学 No-reference video quality evaluation method fusing spatio-temporal characteristics
CN112950581A (en) * 2021-02-25 2021-06-11 北京金山云网络技术有限公司 Quality evaluation method and device and electronic equipment
CN112950579A (en) * 2021-02-26 2021-06-11 北京金山云网络技术有限公司 Image quality evaluation method and device and electronic equipment
CN112949456A (en) * 2021-02-26 2021-06-11 北京达佳互联信息技术有限公司 Video feature extraction model training method and device, and video feature extraction method and device
CN112949460A (en) * 2021-02-26 2021-06-11 陕西理工大学 Human body behavior network model based on video and identification method
CN113079136A (en) * 2021-03-22 2021-07-06 广州虎牙科技有限公司 Motion capture method, motion capture device, electronic equipment and computer-readable storage medium
CN113094933A (en) * 2021-05-10 2021-07-09 华东理工大学 Attention mechanism-based ultrasonic damage detection and analysis method and application thereof
CN113095372A (en) * 2021-03-22 2021-07-09 国网江苏省电力有限公司营销服务中心 Low-voltage transformer area line loss reasonable interval calculation method based on robust neural network
CN113112998A (en) * 2021-05-11 2021-07-13 腾讯音乐娱乐科技(深圳)有限公司 Model training method, reverberation effect reproduction method, device and readable storage medium
CN113139956A (en) * 2021-05-12 2021-07-20 深圳大学 Generation method and identification method of section identification model based on language knowledge guidance
CN113158971A (en) * 2021-05-11 2021-07-23 北京易华录信息技术股份有限公司 Event detection model training method and event classification method and system
CN113177540A (en) * 2021-04-14 2021-07-27 北京明略软件系统有限公司 Positioning method and system based on trackside component
CN113177529A (en) * 2021-05-27 2021-07-27 腾讯音乐娱乐科技(深圳)有限公司 Method, device and equipment for identifying screen splash and storage medium
CN113220940A (en) * 2021-05-13 2021-08-06 北京小米移动软件有限公司 Video classification method and device, electronic equipment and storage medium
CN113223058A (en) * 2021-05-12 2021-08-06 北京百度网讯科技有限公司 Optical flow estimation model training method and device, electronic equipment and storage medium
CN113239869A (en) * 2021-05-31 2021-08-10 西安电子科技大学 Two-stage behavior identification method and system based on key frame sequence and behavior information
CN113411425A (en) * 2021-06-21 2021-09-17 深圳思谋信息科技有限公司 Video hyper-resolution model construction processing method, device, computer equipment and medium
CN113469450A (en) * 2021-07-14 2021-10-01 润联软件系统(深圳)有限公司 Data classification method and device, computer equipment and storage medium
CN113469249A (en) * 2021-06-30 2021-10-01 阿波罗智联(北京)科技有限公司 Image classification model training method, classification method, road side equipment and cloud control platform
CN113591603A (en) * 2021-07-09 2021-11-02 北京旷视科技有限公司 Certificate verification method and device, electronic equipment and storage medium
CN113627536A (en) * 2021-08-12 2021-11-09 北京百度网讯科技有限公司 Model training method, video classification method, device, equipment and storage medium
CN113705686A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Image classification method and device, electronic equipment and readable storage medium
CN113749668A (en) * 2021-08-23 2021-12-07 华中科技大学 Wearable electrocardiogram real-time diagnosis system based on deep neural network
CN113794900A (en) * 2021-08-31 2021-12-14 北京达佳互联信息技术有限公司 Video processing method and device
CN113869182A (en) * 2021-09-24 2021-12-31 北京理工大学 Video anomaly detection network and training method thereof
CN114550310A (en) * 2022-04-22 2022-05-27 杭州魔点科技有限公司 Method and device for identifying multi-label behaviors
CN114611584A (en) * 2022-02-21 2022-06-10 上海市胸科医院 CP-EBUS elastic mode video processing method, device, equipment and medium
CN114611634A (en) * 2022-05-11 2022-06-10 上海闪马智能科技有限公司 Behavior type determination method and device, storage medium and electronic device
CN115205763A (en) * 2022-09-09 2022-10-18 阿里巴巴(中国)有限公司 Video processing method and device
CN115618282A (en) * 2022-12-16 2023-01-17 国检中心深圳珠宝检验实验室有限公司 Synthetic gem identification method, device and storage medium
CN115695025A (en) * 2022-11-04 2023-02-03 中国电信股份有限公司 Training method and device of network security situation prediction model
WO2023016290A1 (en) * 2021-08-12 2023-02-16 北京有竹居网络技术有限公司 Video classification method and apparatus, readable medium and electronic device
CN115830516A (en) * 2023-02-13 2023-03-21 新乡职业技术学院 Computer neural network image processing method for battery detonation detection
CN116451770A (en) * 2023-05-19 2023-07-18 北京百度网讯科技有限公司 Compression method, training method, processing method and device of neural network model
CN116567294A (en) * 2023-05-19 2023-08-08 上海国威互娱文化科技有限公司 Panoramic video segmentation processing method and system
CN116935363A (en) * 2023-07-04 2023-10-24 东莞市微振科技有限公司 Cutter identification method, cutter identification device, electronic equipment and readable storage medium
WO2024104068A1 (en) * 2022-11-15 2024-05-23 腾讯科技(深圳)有限公司 Video detection method and apparatus, device, storage medium, and product
CN113869182B (en) * 2021-09-24 2024-05-31 北京理工大学 Video anomaly detection network and training method thereof

Families Citing this family (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110070067B (en) * 2019-04-29 2021-11-12 北京金山云网络技术有限公司 Video classification method, training method and device of video classification method model and electronic equipment
CN110457525B (en) * 2019-08-12 2023-09-26 央视国际网络无锡有限公司 Short video classification method
CN110489593B (en) * 2019-08-20 2023-04-28 腾讯科技(深圳)有限公司 Topic processing method and device for video, electronic equipment and storage medium
CN110418163B (en) * 2019-08-27 2021-10-08 北京百度网讯科技有限公司 Video frame sampling method and device, electronic equipment and storage medium
CN110503160B (en) * 2019-08-28 2022-03-25 北京达佳互联信息技术有限公司 Image recognition method and device, electronic equipment and storage medium
CN110674488B (en) * 2019-09-06 2024-04-26 深圳壹账通智能科技有限公司 Verification code identification method, system and computer equipment based on neural network
CN110751030A (en) * 2019-09-12 2020-02-04 厦门网宿有限公司 Video classification method, device and system
CN110852195A (en) * 2019-10-24 2020-02-28 杭州趣维科技有限公司 Video slice-based video type classification method
CN110766096B (en) * 2019-10-31 2022-09-23 北京金山云网络技术有限公司 Video classification method and device and electronic equipment
CN110807437B (en) * 2019-11-08 2023-01-03 腾讯科技(深圳)有限公司 Video granularity characteristic determination method and device and computer-readable storage medium
CN110929780B (en) 2019-11-19 2023-07-11 腾讯科技(深圳)有限公司 Video classification model construction method, video classification device, video classification equipment and medium
CN111008579A (en) * 2019-11-22 2020-04-14 华中师范大学 Concentration degree identification method and device and electronic equipment
CN111046232A (en) * 2019-11-30 2020-04-21 北京达佳互联信息技术有限公司 Video classification method, device and system
CN111177460B (en) * 2019-12-20 2023-04-18 腾讯科技(深圳)有限公司 Method and device for extracting key frame
CN111143612B (en) * 2019-12-27 2023-06-27 广州市百果园信息技术有限公司 Video auditing model training method, video auditing method and related devices
CN111242222B (en) * 2020-01-14 2023-12-19 北京迈格威科技有限公司 Classification model training method, image processing method and device
CN113539304B (en) * 2020-04-21 2022-09-16 华为云计算技术有限公司 Video strip splitting method and device
CN111507288A (en) * 2020-04-22 2020-08-07 上海眼控科技股份有限公司 Image detection method, image detection device, computer equipment and storage medium
CN111507289A (en) * 2020-04-22 2020-08-07 上海眼控科技股份有限公司 Video matching method, computer device and storage medium
CN113642592A (en) * 2020-04-27 2021-11-12 武汉Tcl集团工业研究院有限公司 Training method of training model, scene recognition method and computer equipment
CN111783613B (en) * 2020-06-28 2021-10-08 北京百度网讯科技有限公司 Anomaly detection method, model training method, device, equipment and storage medium
CN113842111A (en) * 2020-06-28 2021-12-28 珠海格力电器股份有限公司 Sleep staging method and device, computing equipment and storage medium
CN111782879B (en) * 2020-07-06 2023-04-18 Oppo(重庆)智能科技有限公司 Model training method and device
CN111967382A (en) * 2020-08-14 2020-11-20 北京金山云网络技术有限公司 Age estimation method, and training method and device of age estimation model
CN112131995A (en) * 2020-09-16 2020-12-25 北京影谱科技股份有限公司 Action classification method and device, computing equipment and storage medium
CN112330711B (en) * 2020-11-26 2023-12-05 北京奇艺世纪科技有限公司 Model generation method, information extraction device and electronic equipment
CN112464831B (en) * 2020-12-01 2021-07-30 马上消费金融股份有限公司 Video classification method, training method of video classification model and related equipment
CN112488014B (en) * 2020-12-04 2022-06-10 重庆邮电大学 Video prediction method based on gated cyclic unit
CN112669270A (en) * 2020-12-21 2021-04-16 北京金山云网络技术有限公司 Video quality prediction method and device and server
CN112766618B (en) * 2020-12-25 2024-02-02 苏艺然 Abnormality prediction method and device
CN112804561A (en) * 2020-12-29 2021-05-14 广州华多网络科技有限公司 Video frame insertion method and device, computer equipment and storage medium
CN112799547B (en) * 2021-01-26 2023-04-07 广州创知科技有限公司 Touch positioning method of infrared touch screen, model training method, device, equipment and medium
CN112784111A (en) * 2021-03-12 2021-05-11 有半岛(北京)信息科技有限公司 Video classification method, device, equipment and medium
CN113011562A (en) * 2021-03-18 2021-06-22 华为技术有限公司 Model training method and device
CN113268631B (en) * 2021-04-21 2024-04-19 北京点众快看科技有限公司 Video screening method and device based on big data
CN113163121A (en) * 2021-04-21 2021-07-23 安徽清新互联信息科技有限公司 Video anti-shake method and readable storage medium
CN113536939B (en) * 2021-06-18 2023-02-10 西安电子科技大学 Video duplication removing method based on 3D convolutional neural network
CN113473026B (en) * 2021-07-08 2023-04-07 厦门四信通信科技有限公司 Day and night switching method, device, equipment and storage medium for camera
CN113449700B (en) * 2021-08-30 2021-11-23 腾讯科技(深圳)有限公司 Training of video classification model, video classification method, device, equipment and medium
CN113822382B (en) * 2021-11-22 2022-02-15 平安科技(深圳)有限公司 Course classification method, device, equipment and medium based on multi-mode feature representation
CN114064973B (en) * 2022-01-11 2022-05-03 人民网科技(北京)有限公司 Video news classification model establishing method, classification method, device and equipment
CN115119013B (en) * 2022-03-26 2023-05-05 浙江九鑫智能科技有限公司 Multi-level data machine control application system
CN117351463A (en) * 2022-06-28 2024-01-05 魔门塔(苏州)科技有限公司 Parameter detection method and device
CN115205768B (en) * 2022-09-16 2023-01-31 山东百盟信息技术有限公司 Video classification method based on resolution self-adaptive network
CN117456308A (en) * 2023-11-20 2024-01-26 脉得智能科技(无锡)有限公司 Model training method, video classification method and related devices

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104331442A (en) * 2014-10-24 2015-02-04 华为技术有限公司 Video classification method and device
US20170169315A1 (en) * 2015-12-15 2017-06-15 Sighthound, Inc. Deeply learned convolutional neural networks (cnns) for object localization and classification
CN107480707A (en) * 2017-07-26 2017-12-15 天津大学 A kind of deep neural network method based on information lossless pond
CN109409242A (en) * 2018-09-28 2019-03-01 东南大学 A kind of black smoke vehicle detection method based on cyclic convolution neural network
CN110070067A (en) * 2019-04-29 2019-07-30 北京金山云网络技术有限公司 The training method of video classification methods and its model, device and electronic equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170178346A1 (en) * 2015-12-16 2017-06-22 High School Cube, Llc Neural network architecture for analyzing video data
CN107330362B (en) * 2017-05-25 2020-10-09 北京大学 Video classification method based on space-time attention
CN107341462A (en) * 2017-06-28 2017-11-10 电子科技大学 A kind of video classification methods based on notice mechanism
CN108805259A (en) * 2018-05-23 2018-11-13 北京达佳互联信息技术有限公司 neural network model training method, device, storage medium and terminal device
CN108899075A (en) * 2018-06-28 2018-11-27 众安信息技术服务有限公司 A kind of DSA image detecting method, device and equipment based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104331442A (en) * 2014-10-24 2015-02-04 华为技术有限公司 Video classification method and device
US20170169315A1 (en) * 2015-12-15 2017-06-15 Sighthound, Inc. Deeply learned convolutional neural networks (cnns) for object localization and classification
CN107480707A (en) * 2017-07-26 2017-12-15 天津大学 A kind of deep neural network method based on information lossless pond
CN109409242A (en) * 2018-09-28 2019-03-01 东南大学 A kind of black smoke vehicle detection method based on cyclic convolution neural network
CN110070067A (en) * 2019-04-29 2019-07-30 北京金山云网络技术有限公司 The training method of video classification methods and its model, device and electronic equipment

Cited By (83)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364204B (en) * 2020-11-12 2024-03-12 北京达佳互联信息技术有限公司 Video searching method, device, computer equipment and storage medium
CN112364204A (en) * 2020-11-12 2021-02-12 北京达佳互联信息技术有限公司 Video searching method and device, computer equipment and storage medium
CN112418320A (en) * 2020-11-24 2021-02-26 杭州未名信科科技有限公司 Enterprise association relation identification method and device and storage medium
CN112418320B (en) * 2020-11-24 2024-01-19 杭州未名信科科技有限公司 Enterprise association relation identification method, device and storage medium
CN112597864A (en) * 2020-12-16 2021-04-02 佳都新太科技股份有限公司 Monitoring video abnormity detection method and device
CN112560996B (en) * 2020-12-24 2024-03-05 北京百度网讯科技有限公司 User portrait identification model training method, device, readable storage medium and product
CN112560996A (en) * 2020-12-24 2021-03-26 北京百度网讯科技有限公司 User portrait recognition model training method, device, readable storage medium and product
CN112734699A (en) * 2020-12-24 2021-04-30 浙江大华技术股份有限公司 Article state warning method and device, storage medium and electronic device
CN112613577A (en) * 2020-12-31 2021-04-06 上海商汤智能科技有限公司 Neural network training method and device, computer equipment and storage medium
CN112633407A (en) * 2020-12-31 2021-04-09 深圳云天励飞技术股份有限公司 Method and device for training classification model, electronic equipment and storage medium
CN112633407B (en) * 2020-12-31 2023-10-13 深圳云天励飞技术股份有限公司 Classification model training method and device, electronic equipment and storage medium
CN112613486B (en) * 2021-01-07 2023-08-08 福州大学 Professional stereoscopic video comfort level classification method based on multilayer attention and BiGRU
CN112734013A (en) * 2021-01-07 2021-04-30 北京迈格威科技有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN112613486A (en) * 2021-01-07 2021-04-06 福州大学 Professional stereoscopic video comfort classification method based on multilayer attention and BiGRU
CN112835008A (en) * 2021-01-12 2021-05-25 西安电子科技大学 High-resolution range profile target identification method based on attitude self-adaptive convolutional network
CN112835008B (en) * 2021-01-12 2022-03-04 西安电子科技大学 High-resolution range profile target identification method based on attitude self-adaptive convolutional network
CN112866156A (en) * 2021-01-15 2021-05-28 浙江工业大学 Radio signal clustering method and system based on deep learning
CN112819011A (en) * 2021-01-28 2021-05-18 北京迈格威科技有限公司 Method and device for identifying relationships between objects and electronic system
CN112749685B (en) * 2021-01-28 2023-11-03 北京百度网讯科技有限公司 Video classification method, apparatus and medium
CN112749685A (en) * 2021-01-28 2021-05-04 北京百度网讯科技有限公司 Video classification method, apparatus and medium
CN112954312A (en) * 2021-02-07 2021-06-11 福州大学 No-reference video quality evaluation method fusing spatio-temporal characteristics
CN112954312B (en) * 2021-02-07 2024-01-05 福州大学 Non-reference video quality assessment method integrating space-time characteristics
CN112950581A (en) * 2021-02-25 2021-06-11 北京金山云网络技术有限公司 Quality evaluation method and device and electronic equipment
CN112949456A (en) * 2021-02-26 2021-06-11 北京达佳互联信息技术有限公司 Video feature extraction model training method and device, and video feature extraction method and device
CN112949460B (en) * 2021-02-26 2024-02-13 陕西理工大学 Human behavior network model based on video and identification method
CN112950579B (en) * 2021-02-26 2024-05-31 北京金山云网络技术有限公司 Image quality evaluation method and device and electronic equipment
CN112949456B (en) * 2021-02-26 2023-12-12 北京达佳互联信息技术有限公司 Video feature extraction model training and video feature extraction method and device
CN112949460A (en) * 2021-02-26 2021-06-11 陕西理工大学 Human body behavior network model based on video and identification method
CN112950579A (en) * 2021-02-26 2021-06-11 北京金山云网络技术有限公司 Image quality evaluation method and device and electronic equipment
CN113095372A (en) * 2021-03-22 2021-07-09 国网江苏省电力有限公司营销服务中心 Low-voltage transformer area line loss reasonable interval calculation method based on robust neural network
CN113079136A (en) * 2021-03-22 2021-07-06 广州虎牙科技有限公司 Motion capture method, motion capture device, electronic equipment and computer-readable storage medium
CN113177540A (en) * 2021-04-14 2021-07-27 北京明略软件系统有限公司 Positioning method and system based on trackside component
CN113094933A (en) * 2021-05-10 2021-07-09 华东理工大学 Attention mechanism-based ultrasonic damage detection and analysis method and application thereof
CN113094933B (en) * 2021-05-10 2023-08-08 华东理工大学 Ultrasonic damage detection and analysis method based on attention mechanism and application thereof
CN113158971A (en) * 2021-05-11 2021-07-23 北京易华录信息技术股份有限公司 Event detection model training method and event classification method and system
CN113112998B (en) * 2021-05-11 2024-03-15 腾讯音乐娱乐科技(深圳)有限公司 Model training method, reverberation effect reproduction method, device, and readable storage medium
CN113158971B (en) * 2021-05-11 2024-03-08 北京易华录信息技术股份有限公司 Event detection model training method and event classification method and system
CN113112998A (en) * 2021-05-11 2021-07-13 腾讯音乐娱乐科技(深圳)有限公司 Model training method, reverberation effect reproduction method, device and readable storage medium
CN113139956A (en) * 2021-05-12 2021-07-20 深圳大学 Generation method and identification method of section identification model based on language knowledge guidance
CN113139956B (en) * 2021-05-12 2023-04-14 深圳大学 Generation method and identification method of section identification model based on language knowledge guidance
CN113223058B (en) * 2021-05-12 2024-04-30 北京百度网讯科技有限公司 Training method and device of optical flow estimation model, electronic equipment and storage medium
CN113223058A (en) * 2021-05-12 2021-08-06 北京百度网讯科技有限公司 Optical flow estimation model training method and device, electronic equipment and storage medium
CN113220940B (en) * 2021-05-13 2024-02-09 北京小米移动软件有限公司 Video classification method, device, electronic equipment and storage medium
CN113220940A (en) * 2021-05-13 2021-08-06 北京小米移动软件有限公司 Video classification method and device, electronic equipment and storage medium
CN113177529B (en) * 2021-05-27 2024-04-23 腾讯音乐娱乐科技(深圳)有限公司 Method, device, equipment and storage medium for identifying screen
CN113177529A (en) * 2021-05-27 2021-07-27 腾讯音乐娱乐科技(深圳)有限公司 Method, device and equipment for identifying screen splash and storage medium
CN113239869A (en) * 2021-05-31 2021-08-10 西安电子科技大学 Two-stage behavior identification method and system based on key frame sequence and behavior information
CN113239869B (en) * 2021-05-31 2023-08-11 西安电子科技大学 Two-stage behavior recognition method and system based on key frame sequence and behavior information
CN113411425B (en) * 2021-06-21 2023-11-07 深圳思谋信息科技有限公司 Video super-division model construction processing method, device, computer equipment and medium
CN113411425A (en) * 2021-06-21 2021-09-17 深圳思谋信息科技有限公司 Video hyper-resolution model construction processing method, device, computer equipment and medium
CN113469249A (en) * 2021-06-30 2021-10-01 阿波罗智联(北京)科技有限公司 Image classification model training method, classification method, road side equipment and cloud control platform
CN113469249B (en) * 2021-06-30 2024-04-09 阿波罗智联(北京)科技有限公司 Image classification model training method, classification method, road side equipment and cloud control platform
CN113591603A (en) * 2021-07-09 2021-11-02 北京旷视科技有限公司 Certificate verification method and device, electronic equipment and storage medium
CN113469450B (en) * 2021-07-14 2024-05-10 华润数字科技有限公司 Data classification method, device, computer equipment and storage medium
CN113469450A (en) * 2021-07-14 2021-10-01 润联软件系统(深圳)有限公司 Data classification method and device, computer equipment and storage medium
WO2023016290A1 (en) * 2021-08-12 2023-02-16 北京有竹居网络技术有限公司 Video classification method and apparatus, readable medium and electronic device
CN113627536A (en) * 2021-08-12 2021-11-09 北京百度网讯科技有限公司 Model training method, video classification method, device, equipment and storage medium
CN113627536B (en) * 2021-08-12 2024-01-16 北京百度网讯科技有限公司 Model training, video classification method, device, equipment and storage medium
CN113749668A (en) * 2021-08-23 2021-12-07 华中科技大学 Wearable electrocardiogram real-time diagnosis system based on deep neural network
CN113749668B (en) * 2021-08-23 2022-08-09 华中科技大学 Wearable electrocardiogram real-time diagnosis system based on deep neural network
CN113705686A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Image classification method and device, electronic equipment and readable storage medium
CN113705686B (en) * 2021-08-30 2023-09-15 平安科技(深圳)有限公司 Image classification method, device, electronic equipment and readable storage medium
CN113794900B (en) * 2021-08-31 2023-04-07 北京达佳互联信息技术有限公司 Video processing method and device
CN113794900A (en) * 2021-08-31 2021-12-14 北京达佳互联信息技术有限公司 Video processing method and device
CN113869182A (en) * 2021-09-24 2021-12-31 北京理工大学 Video anomaly detection network and training method thereof
CN113869182B (en) * 2021-09-24 2024-05-31 北京理工大学 Video anomaly detection network and training method thereof
CN114611584A (en) * 2022-02-21 2022-06-10 上海市胸科医院 CP-EBUS elastic mode video processing method, device, equipment and medium
CN114550310A (en) * 2022-04-22 2022-05-27 杭州魔点科技有限公司 Method and device for identifying multi-label behaviors
CN114611634B (en) * 2022-05-11 2023-07-28 上海闪马智能科技有限公司 Method and device for determining behavior type, storage medium and electronic device
CN114611634A (en) * 2022-05-11 2022-06-10 上海闪马智能科技有限公司 Behavior type determination method and device, storage medium and electronic device
CN115205763A (en) * 2022-09-09 2022-10-18 阿里巴巴(中国)有限公司 Video processing method and device
CN115205763B (en) * 2022-09-09 2023-02-17 阿里巴巴(中国)有限公司 Video processing method and device
CN115695025A (en) * 2022-11-04 2023-02-03 中国电信股份有限公司 Training method and device of network security situation prediction model
CN115695025B (en) * 2022-11-04 2024-05-14 中国电信股份有限公司 Training method and device for network security situation prediction model
WO2024104068A1 (en) * 2022-11-15 2024-05-23 腾讯科技(深圳)有限公司 Video detection method and apparatus, device, storage medium, and product
CN115618282B (en) * 2022-12-16 2023-06-06 国检中心深圳珠宝检验实验室有限公司 Identification method, device and storage medium for synthetic precious stone
CN115618282A (en) * 2022-12-16 2023-01-17 国检中心深圳珠宝检验实验室有限公司 Synthetic gem identification method, device and storage medium
CN115830516A (en) * 2023-02-13 2023-03-21 新乡职业技术学院 Computer neural network image processing method for battery detonation detection
CN116451770B (en) * 2023-05-19 2024-03-01 北京百度网讯科技有限公司 Compression method, training method, processing method and device of neural network model
CN116567294A (en) * 2023-05-19 2023-08-08 上海国威互娱文化科技有限公司 Panoramic video segmentation processing method and system
CN116451770A (en) * 2023-05-19 2023-07-18 北京百度网讯科技有限公司 Compression method, training method, processing method and device of neural network model
CN116935363B (en) * 2023-07-04 2024-02-23 东莞市微振科技有限公司 Cutter identification method, cutter identification device, electronic equipment and readable storage medium
CN116935363A (en) * 2023-07-04 2023-10-24 东莞市微振科技有限公司 Cutter identification method, cutter identification device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN110070067A (en) 2019-07-30
CN110070067B (en) 2021-11-12

Similar Documents

Publication Publication Date Title
WO2020221278A1 (en) Video classification method and model training method and apparatus thereof, and electronic device
CN109359636B (en) Video classification method, device and server
CN110175580B (en) Video behavior identification method based on time sequence causal convolutional network
CN109543714B (en) Data feature acquisition method and device, electronic equipment and storage medium
CN110147711B (en) Video scene recognition method and device, storage medium and electronic device
WO2020088216A1 (en) Audio and video processing method and device, apparatus, and medium
CN109344884B (en) Media information classification method, method and device for training picture classification model
WO2020114378A1 (en) Video watermark identification method and apparatus, device, and storage medium
US11042798B2 (en) Regularized iterative collaborative feature learning from web and user behavior data
CN110633669B (en) Mobile terminal face attribute identification method based on deep learning in home environment
WO2019052301A1 (en) Video classification method, information processing method and server
CN111738357B (en) Junk picture identification method, device and equipment
CN111522996B (en) Video clip retrieval method and device
CN111062871A (en) Image processing method and device, computer equipment and readable storage medium
WO2020108396A1 (en) Video classification method, and server
WO2020238353A1 (en) Data processing method and apparatus, storage medium, and electronic apparatus
CN113850162B (en) Video auditing method and device and electronic equipment
JP7089045B2 (en) Media processing methods, related equipment and computer programs
CN113469289B (en) Video self-supervision characterization learning method and device, computer equipment and medium
WO2021138855A1 (en) Model training method, video processing method and apparatus, storage medium and electronic device
US11948359B2 (en) Video processing method and apparatus, computing device and medium
WO2023123923A1 (en) Human body weight identification method, human body weight identification device, computer device, and medium
CN114282047A (en) Small sample action recognition model training method and device, electronic equipment and storage medium
CN112581355A (en) Image processing method, image processing device, electronic equipment and computer readable medium
CN115705706A (en) Video processing method, video processing device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20799318

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 180222)

122 Ep: pct application non-entry in european phase

Ref document number: 20799318

Country of ref document: EP

Kind code of ref document: A1