WO2021138855A1 - 模型训练方法、视频处理方法、装置、存储介质及电子设备 - Google Patents

模型训练方法、视频处理方法、装置、存储介质及电子设备 Download PDF

Info

Publication number
WO2021138855A1
WO2021138855A1 PCT/CN2020/071021 CN2020071021W WO2021138855A1 WO 2021138855 A1 WO2021138855 A1 WO 2021138855A1 CN 2020071021 W CN2020071021 W CN 2020071021W WO 2021138855 A1 WO2021138855 A1 WO 2021138855A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
model
feature
classification
audio
Prior art date
Application number
PCT/CN2020/071021
Other languages
English (en)
French (fr)
Inventor
郭子亮
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to PCT/CN2020/071021 priority Critical patent/WO2021138855A1/zh
Priority to CN202080084487.XA priority patent/CN114787844A/zh
Publication of WO2021138855A1 publication Critical patent/WO2021138855A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management

Definitions

  • This application relates to the field of machine learning, and in particular to a model training method, video processing method, device, storage medium, and electronic equipment.
  • Video tags are hierarchical classification tags formed through multi-dimensional analysis such as scene classification, character recognition, voice recognition, and text recognition on the video. Among them, the process of obtaining video tags can be called video marking.
  • the content of videos is classified through video marking, which can be used as a basis for users to find videos they are interested in and videos recommended by certain businesses or platforms.
  • the method of video marking is manual marking, which requires manual marking of the video.
  • the manual marking method has the problem of low efficiency.
  • the embodiments of the present application provide a model training method, a video processing method, a device, a storage medium, and electronic equipment, which can improve the efficiency of video marking by training the model.
  • an embodiment of the present application provides a model training method, including:
  • the basic model including an image feature extraction model, an audio feature extraction model, and a classification model
  • the parameters of the image feature extraction model, the audio feature extraction model, and the classification model are adjusted according to the difference between the predicted label and the classification label until the basic model converges, and the converged basic model is used as Video classification model for video classification.
  • an embodiment of the present application provides a video processing method, including:
  • the video classification model is obtained by training using the model training method provided in this embodiment.
  • an embodiment of the present application provides a model training device, including:
  • the first acquisition module is used to acquire video samples and classification labels corresponding to the video samples, and divide the video samples into image samples and audio samples;
  • the building module is used to build a basic model, which includes an image feature extraction model, an audio feature extraction model, and a classification model;
  • the extraction module is used to extract the image features of the image samples through the image feature extraction model, and extract the audio features of the audio samples through the audio feature extraction model;
  • the classification module is used to input the image features and audio features into the classification model for classification, and obtain the predicted label of the corresponding video sample;
  • the adjustment module is used to adjust the parameters of the image feature extraction model, the audio feature extraction model, and the classification model according to the difference between the predicted label and the classification label, until the basic model converges, and the converged basic model is used as the video classification for video classification model.
  • an embodiment of the present application provides a video processing device, including:
  • the receiving module is used to receive video processing requests
  • the second acquisition module is used to acquire the target video that needs to be classified according to the video processing request, and divide the target video into a target image and a target audio;
  • the calling module is used to call the pre-trained video classification model
  • the prediction module is used to input the target image and the target audio into the video classification model for classification, and obtain the classification label of the target video;
  • the video classification model is obtained by training using the model training method provided in this embodiment.
  • an embodiment of the present application provides a storage medium on which a computer program is stored, wherein, when the computer program is executed on a computer, the computer is caused to execute the model training method or video processing provided in this embodiment. method.
  • an embodiment of the present application provides an electronic device including a memory and a processor, the memory stores a computer program, and the processor invokes the computer program stored in the memory to execute:
  • the image feature of the image sample is extracted through the image feature extraction model, and the audio feature of the audio sample is extracted through the audio feature extraction model;
  • the parameters of the image feature extraction model, the audio feature extraction model, and the classification model are adjusted according to the difference between the predicted label and the classification label, until the basic model converges, and the converged basic model is used as the video classification model for video classification.
  • an embodiment of the present application provides an electronic device, including a memory and a processor, and a computer program is stored in the memory, and the processor is configured to execute:
  • the video classification model is obtained by training using the model training method provided in this embodiment.
  • FIG. 1 is a schematic diagram of the first flow of a model training method provided by an embodiment of the present application.
  • Fig. 2 is a schematic diagram of the principle of a model training method provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of the second flow of the model training method provided by the embodiment of the present application.
  • FIG. 4 is a schematic diagram of a third process of the model training method provided by an embodiment of the present application.
  • Fig. 5 is a schematic flowchart of a video processing method provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of the first structure of a model training device provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of the first structure of a model training device provided by an embodiment of the present application.
  • Fig. 8 is a schematic structural diagram of a video processing device provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of the first structure of an electronic device provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a second structure of an electronic device provided by an embodiment of the present application.
  • the embodiment of the present application provides a model training method.
  • the execution subject of the model training method may be the model training device provided in the embodiment of the present application, or an electronic device integrated with the model training device.
  • the model training device can be implemented in hardware or software, and the electronic device can be a smart phone, a tablet computer, a palmtop computer, a notebook computer, or a desktop computer that is equipped with a processor and has processing capabilities.
  • the following will take the electronic device as an example for the execution of the model training method.
  • FIG. 1 is a schematic flowchart of the first model training method provided by an embodiment of the present application.
  • the process of the model training method can include:
  • the electronic device can obtain video samples through a wired connection or a wireless connection.
  • the tag can reflect the content of the video sample, and the same video sample can have multiple corresponding tags.
  • the label of the video sample may be girl, skateboard, street, and the surrounding unimportant pedestrians, vehicles, buildings, etc. are ignored.
  • the process of obtaining video tags can be called video marking.
  • the content of the video is classified through video marking, which can be used as a basis for users to find videos that they are interested in and videos recommended by certain businesses or platforms.
  • the classification label may be a label that is manually set (that is, set by manual marking).
  • a video segment is intercepted from a video sample, and the video segment is divided into image samples and audio samples.
  • the video segment can be one segment or multiple segments.
  • the video clip is divided into an image sample and an audio sample.
  • the multiple video clips are divided into multiple image samples and multiple audio samples.
  • the image sample is the visual observation of the content of the video sample, for example, the image sample shows that a person is playing the piano;
  • the audio sample is the auditory observation of the content of the video sample, for example, the sound of the piano.
  • a basic model which includes an image feature extraction model, an audio feature extraction model, and a classification model.
  • building a basic model includes building an image feature extraction model, an audio feature extraction model, and a classification model in the basic model.
  • the basic model can be constructed so that the trained basic model can be applied to electronic devices such as smart phones, and then the video of the smart phone can be classified.
  • the image feature extraction model may adopt the ResNet-101 model.
  • the ResNet-101 model is a CNN (Convolution Neural Network, convolutional neural network) model with 101 hidden layers of the network.
  • the ResNet-101 model uses multiple parameterized layers to learn the residual representation between input and output, instead of using parameterized layers like general CNN networks to directly try to learn the mapping between input and output.
  • Using parameterized layers to directly learn residuals is much easier than directly learning the mapping between input and output (faster convergence) and much more effective (higher classification accuracy can be achieved by using more layers).
  • the audio feature extraction model may adopt a VGG (Oxford Visual Geometry Group) deep network model.
  • VGG Outdoor Visual Geometry Group
  • the VGG deep network model is composed of 5 layers of convolutional layers, 3 layers of fully connected layers, and softmax output layer. The layers are separated by max-pooling. Among them, the increased depth of the VGG deep network model and the use of small convolution kernels have a great effect on the final audio feature extraction effect.
  • the image feature extraction model, the audio feature extraction model, and the classification model are performed separately.
  • Pre-training then, the image feature of the image sample is extracted through the pre-trained image feature extraction model, and the audio feature of the audio sample is extracted through the pre-trained audio feature extraction model.
  • the image sample is input into an image feature extraction model such as a pre-trained ResNet-101 model for image feature extraction to obtain image features of the image sample.
  • an image feature extraction model such as a pre-trained ResNet-101 model for image feature extraction to obtain image features of the image sample.
  • the image samples can be input to the pre-trained ResNet-101 model, and the features before the last fully connected layer in the 101-layer network hidden layer of the ResNet-101 model (that is, the features output by the penultimate fully connected layer) ) As the image feature of the image sample.
  • the audio samples are input into an audio feature extraction model such as a pre-trained VGG deep network model for audio feature extraction to obtain audio features of the audio samples.
  • an audio feature extraction model such as a pre-trained VGG deep network model for audio feature extraction to obtain audio features of the audio samples.
  • the audio samples are input to the pre-trained VGG deep network model, and the features before the last layer of the VGG deep network model (that is, the features output by the penultimate fully connected layer) are used as the audio features of the audio samples .
  • the extracted image features and audio features can reflect the features of the video sample.
  • the extracted image features and audio features together reflect the features of the video segment.
  • the classification model includes two modules, a feature fusion module and a feature classification module.
  • the step of inputting image features and audio features into the classification model for classification, and obtaining the predicted label of the corresponding video sample includes:
  • the video features of the video samples are input into the feature classification module for classification, and the predicted labels of the corresponding video samples are obtained.
  • the image feature extraction model outputs the image features of the image samples and the audio feature extraction model. After the audio features of the audio samples are output, these image features and audio features are not input to any algorithm other than the base model, but are extracted from the image features in the base model. After the output of the audio feature extraction model, it directly enters the classification model in the basic model.
  • the classification model receives the image features extracted by the image feature extraction model and the audio features extracted by the audio feature extraction model, and obtains the predicted label of the video sample by fusing and classifying these image features and audio features.
  • the feature fusion module can fuse multiple frame features into one feature, for example, can fuse multiple frame image features into one image feature, and fuse multiple frames of audio features into one audio feature.
  • the NeXtVLAD algorithm can be used in the feature fusion module.
  • the multi-frame image feature is input into the NeXtVLAD algorithm as a variable x, and x can be x 1 , x 2 , x 3 and so on.
  • Obtain the cluster center C obtained by the image feature extraction module through pre-training.
  • the specific algorithm can be: the difference between C and x 1 is multiplied by the weight corresponding to x 1 + the difference between C and x 2 is multiplied in corresponding weight x 2 and x + C difference subtracted multiplied by 3 x 3 + ... corresponding weights.
  • the visual feature fused by image features and the sound feature fused by audio features are obtained by means of weighted sum and normalization.
  • the video feature of the video sample is formed. If the image sample and the audio sample are divided from a video segment intercepted in the video sample, the visual feature and the sound feature obtained by the NeXtVLAD algorithm are corresponding to the visual feature and the sound feature of the same video segment, and the visual feature and the sound feature are combined.
  • the formed video feature is also the video feature of the video segment.
  • the feature classification module includes a weight distribution unit and a weight weighting unit.
  • the weight allocation unit can use SE Context Gate (SE context gate, a layer in the neural network), and the weight weighting unit can use the MoE (Mixture of Experts, mixed expert) model.
  • SE Context Gate SE context gate, a layer in the neural network
  • MoE Matture of Experts, mixed expert
  • SE Context Gate is used to suppress unimportant information in video features and highlight important information. For example, if a girl skates on the road in the video, the skateboard and girl are important information, and pedestrians and cars are unimportant information.
  • FIG. 2 is a schematic diagram of the principle of the model training method provided by an embodiment of the application, in which the structure of the SE Context Gate is shown.
  • the video feature X is input to the fully connected layer, and the video feature X is batch-normalized and then input to the ReLU activation function. Input the output value of the ReLU activation function to the next fully connected layer and perform batch normalization again, and then input the result of the batch normalization again to the Sigmoid activation function.
  • the Sigmoid activation function calculates the feature weight of the video feature X, and The feature weight is multiplied by the video feature X to obtain the output Y. The video feature and the corresponding feature weight can be obtained from the output Y.
  • ReLU When x ⁇ 0, ReLU is hardly saturated, and when x>0, there is no saturation problem. Therefore, ReLU can maintain the gradient without decay when x>0.
  • the Sigmoid activation function has an exponential shape, which is close to biological neurons in a physical sense.
  • the output of the Sigmoid activation function can also be used to express the probability or to normalize the input.
  • the video features are input into the SE Context Gate, and the feature weights of the video features are calculated. Among them, the feature weights of each video feature are different. For important information, the feature weight of the corresponding video feature is significant, but for the unimportant The feature weight corresponding to the video feature is small. Finally, through the SE Context Gate, the video feature and the feature weight corresponding to the video feature are output together and input to the MoE model.
  • the MoE model receives the video features and corresponding weights from the SE Context Gate, and uses the classification algorithm in the MoE model to input these video features and corresponding weights into multiple softmax classifiers, and perform classification results on multiple softmax classifiers. Weighted voting to get the final result.
  • the classification algorithm in the MoE model can be:
  • category h and category h′ both represent one of the categories in the total category He
  • x) is the classification result of a single softmax classifier
  • x) is the weighted summation of the classification results of multiple softmax classifiers in the MoE model and the corresponding weights
  • the category of the video sample obtained.
  • a loss function is used to calculate the gap between the predicted label and the classification label. Substitute the predicted label and the classification label into the loss function to obtain the loss value. If the loss value meets the preset conditions, the basic model is considered to be convergent. For example, with the training of the basic model, the output loss value of the loss function is getting smaller and smaller.
  • the loss value threshold is set according to the demand. When the loss value is less than the loss value threshold, the basic model is considered to converge and the training result of the basic model meets expectations. This convergent basic model is used as a video classification model for video classification.
  • the weight change between two iterations is already small, and the weight threshold is preset.
  • the weight change between the two iterations of the basic model is less than the preset weight threshold, it is considered
  • the basic model converges, and the training result of the basic model meets expectations, and the converged basic model is used as a video classification model for video classification.
  • the model training method divides the video samples into image samples and audio samples by obtaining video samples and classification labels corresponding to the video samples; constructing a basic model, which includes an image feature extraction model and audio Feature extraction models and classification models; image features of image samples are extracted through image feature extraction models, and audio features of audio samples are extracted through audio feature extraction models; image features and audio features are input into classification model for classification, and corresponding video samples are obtained
  • the parameters of the image feature extraction model, the audio feature extraction model and the classification model are adjusted until the basic model converges, and the converged basic model is used as the video classification model for video classification .
  • the basic model can be trained through the loss function to obtain a video classification model with higher accuracy, which improves the accuracy of video classification.
  • FIG. 3 is a schematic diagram of the second flow of the model training method provided by an embodiment of the present application.
  • the process of the model training method can include:
  • the electronic device can obtain video samples through a wired connection or a wireless connection.
  • the video sample time can vary from a few seconds to tens of hours.
  • the tag can reflect the content of the video sample, and the same video sample can have multiple corresponding tags.
  • the label of the video sample may be girl, skateboard, street, and the surrounding unimportant pedestrians, vehicles, buildings, etc. are ignored.
  • the process of obtaining video tags can be called video marking.
  • the content of the video is classified through video marking, which can be used as a basis for users to find videos that they are interested in and videos recommended by certain businesses or platforms.
  • the classification label may be a label that is manually set (that is, set by manual marking).
  • video clips are cut from video samples, and the video clips are divided into image samples and audio samples.
  • the video segment can be one segment or multiple segments.
  • the video clip is divided into an image sample and an audio sample.
  • the multiple video clips are divided into multiple image samples and multiple audio samples.
  • the image sample is the visual observation of the content of the video sample, for example, the image sample shows that a person is playing the piano;
  • the audio sample is the auditory observation of the content of the video sample, for example, the sound of the piano.
  • Construct a basic model which includes an image feature extraction model, an audio feature extraction model, and a classification model.
  • building a basic model includes building an image feature extraction model, an audio feature extraction model, and a classification model in the basic model.
  • the basic model can be constructed so that the trained basic model can be applied to electronic devices such as smart phones, and then the video of the smart phone can be classified.
  • the image feature extraction model may adopt the ResNet-101 model.
  • the ResNet-101 model is a CNN (Convolution Neural Network, convolutional neural network) model with 101 hidden layers of the network.
  • the ResNet-101 model uses multiple parameterized layers to learn the residual representation between input and output, instead of using parameterized layers like general CNN networks to directly try to learn the mapping between input and output.
  • Using parameterized layers to directly learn residuals is much easier than directly learning the mapping between input and output (faster convergence) and much more effective (higher classification accuracy can be achieved by using more layers).
  • the audio feature extraction model may adopt a VGG (Oxford Visual Geometry Group) deep network model.
  • VGG Outdoor Visual Geometry Group
  • the VGG deep network model is composed of 5 layers of convolutional layers, 3 layers of fully connected layers, and softmax output layer. The layers are separated by max-pooling. Among them, the increased depth of the VGG deep network model and the use of small convolution kernels have a great effect on the final audio feature extraction effect.
  • the first data set may be the ImageNet data set.
  • ImageNet is a large-scale visualization database for the research of visual object recognition software. It contains more than 14 million images, of which 1.2 million images are divided into 1000 categories (about 1 million images contain bounding boxes and annotations).
  • the pre-trained image feature extraction model can be used to extract image features from image samples.
  • the ResNet-101 model with better model parameters can be obtained at the end of the training, and the feature extraction pre-trained ResNet-101 model can be used as the image feature extraction model.
  • the method before pre-training the image feature extraction model according to the first data set, the method further includes:
  • the step of pre-training the image feature extraction model according to the first data set includes:
  • the image feature extraction model is pre-trained according to the pre-processed first data set.
  • the first data set may be preprocessed by performing data gain on the first data set.
  • the method of data gain includes: randomly changing the initial image in the first data set. For example, perform more than one of horizontal mirror flip, vertical mirror flip, cropping, brightness adjustment, saturation adjustment, and hue adjustment on the initial image in the first data set. Then, input the preprocessed first data set into the image feature extraction model, and adjust the image feature extraction model according to the difference between the image features output by the image feature extraction model and the original image features of the original image in the first data set. Various parameters, so as to achieve the purpose of pre-training the image feature extraction model.
  • the second data set may be an AudioSet data set.
  • the AudioSet data set contains multiple types of audio categories and a large number of manually marked sound clips, covering a wide range of human and animal sounds, sounds of musical instruments and music genres, and daily environmental sounds.
  • the method before pre-training the audio feature extraction model according to the second data set, the method further includes:
  • Pre-training the audio feature extraction model according to the second data set includes:
  • the audio feature extraction model is pre-trained according to the pre-processed second data set.
  • the second data set may be preprocessed by performing short-time Fourier transform on the sound segments in the second data set.
  • the short-time Fourier transform is a time-spectrum analysis method, which expresses the signal characteristics at a certain moment through a segment of the signal in the time window.
  • the length of the window determines the time resolution and frequency resolution of the spectrogram. The longer the window, the longer the intercepted signal, the longer the signal, and the higher the frequency resolution after Fourier transform. .
  • the short-time Fourier transform is to first multiply a function and a window function, and then perform a one-dimensional Fourier transform.
  • a series of Fourier transform results are obtained through the sliding of the window function, and these results are arranged to obtain a two-dimensional representation.
  • the sound signal can be discretized by short-time Fourier transform.
  • the audio features output by the model according to the audio feature extraction model are the same as those of the sound segment in the second data set.
  • adjust the parameters of the audio feature extraction model so as to achieve the purpose of pre-training the audio feature extraction model.
  • the image feature extraction model, the audio feature extraction model, and the classification model are pre-trained.
  • the image feature extraction model and the audio feature extraction model are respectively pre-trained, and then the classification model is pre-trained according to the pre-trained image feature extraction model and the pre-trained audio feature extraction model.
  • pre-training the classification model according to the pre-trained image feature extraction model and the pre-trained audio feature extraction model includes:
  • the parameters of the classification model are adjusted until the classification model converges.
  • the difference between the predicted label and the classification label can be reflected by the first BCE Loss (loss function). Input the predicted label and the classification label into the first loss function to obtain the first loss value.
  • the first loss value reflects the difference between the predicted label and the classification label.
  • the classification model converges. For example, when the first loss value is less than the first preset threshold, it is determined that the classification model has converged, and the pre-training of the classification model is completed.
  • the pre-trained image feature extraction model, audio feature extraction model, and classification model are subjected to end-to-end joint training.
  • image samples and audio samples are input to the image feature extraction model and audio feature extraction model in the basic model for feature extraction.
  • the output of the image feature extraction model and audio feature extraction model are used as the input of the classification model, and finally from the classification
  • the model outputs the predicted label of the video sample.
  • the entire training process is completed inside the basic model, and no algorithms other than the basic model are used in the joint training process.
  • the audio feature extraction is performed in the network model model to obtain the audio feature of the audio sample.
  • an image feature extraction model such as the ResNet-101 model for image feature extraction to obtain the image features of the image samples
  • the audio feature extraction model such as VGG depth
  • the audio feature extraction is performed in the network model model to obtain the audio feature of the audio sample.
  • input image samples into the ResNet-101 model and use the features before the last fully connected layer in the 101-layer hidden layer of the ResNet-101 model (features output by the penultimate fully connected layer) as the image features of the image sample
  • Input audio samples into the VGG deep network model and use the features before the last fully connected layer in the VGG deep network model (features output by the penultimate fully connected layer) as the audio features of the audio samples.
  • the image feature can be a multi-frame image feature
  • the audio feature can be a multi-frame audio feature.
  • the extracted image feature and audio feature together represent the feature of the same video segment.
  • the steps of inputting image features and audio features into the feature fusion module to perform feature fusion to obtain the video features of the video sample include:
  • Feature fusion is performed in the image feature input feature fusion neural network model to obtain the visual features of the target video.
  • the audio feature input feature fusion is performed in the neural network model to perform feature fusion to obtain the sound feature of the target video.
  • the feature fusion module can fuse multiple frame features into one feature, for example, fuse multiple frame image features into one image feature, and fuse multiple frame audio features into one audio feature.
  • NeXtVLAD algorithm can be used.
  • the multi-frame image feature is input into the NeXtVLAD algorithm as a variable x, and x can be x 1 , x 2 , x 3 and so on.
  • x can be x 1 , x 2 , x 3 and so on.
  • the specific algorithm can be: the difference between C and x 1 is multiplied by the weight corresponding to x 1 + the difference between C and x 2 is multiplied in corresponding weight x 2 and x + C difference subtracted multiplied by 3 x 3 + ... corresponding weights.
  • the visual features formed by the fusion of multiple frames of image features and the sound features formed by the fusion of multiple frames of audio features are obtained through weighted sum and normalization.
  • the video feature of the video sample is formed. If the image sample and the audio sample are divided from a video segment intercepted in the video sample, the visual feature and the sound feature obtained by the NeXtVLAD algorithm are corresponding to the visual feature and the sound feature of the same video segment, and the visual feature and the sound feature are combined
  • the formed video feature is also the video feature of the video segment.
  • the classification model includes a feature fusion module and a classification module
  • the classification module includes a weight distribution unit and a weight weighting unit.
  • the step of inputting the video features of the video sample into the feature classification module for classification, and obtaining the predicted label of the corresponding video sample includes:
  • the video feature and the corresponding feature weight are input into the weight weighting unit to calculate the weighted sum to obtain the predicted label of the corresponding video sample.
  • SE Context Gate In the weight allocation unit of the classification module, SE Context Gate can be used. SE Context Gate is used to suppress unimportant information in video features and highlight important information. For example, if a girl skates on the road in the video, the skateboard and girl are important information, and pedestrians and cars are unimportant information.
  • FIG. 2 is a schematic diagram of the principle of the model training method provided by an embodiment of the application, in which the structure of the SE Context Gate is shown.
  • the video feature X is input to the fully connected layer, and the video feature X is batch-normalized and then input to the ReLU activation function. Input the output value of the ReLU activation function to the next fully connected layer and perform batch normalization again, and then input the result of the batch normalization again to the Sigmoid activation function.
  • the Sigmoid activation function calculates the feature weight of the video feature X, and The feature weight is multiplied by the video feature X to obtain the output Y. The video feature and the corresponding feature weight can be obtained from the output Y.
  • ReLU When x ⁇ 0, ReLU is hardly saturated, and when x>0, there is no saturation problem. Therefore, ReLU can maintain the gradient without decay when x>0.
  • the Sigmoid activation function has an exponential shape, which is close to biological neurons in a physical sense.
  • the output of the Sigmoid activation function can also be used to express the probability or to normalize the input.
  • the video features are input into the SE Context Gate, and the feature weights of the video features are calculated. Among them, the feature weights of each video feature are different. For important information, the feature weight of the corresponding video feature is significant, but for the unimportant The feature weight corresponding to the video feature is small. Finally, through the SE Context Gate, the video feature and the feature weight corresponding to the video feature are output together and input to the MoE model.
  • the weight weighting unit includes a plurality of preset classifiers
  • the step of inputting the video feature and the corresponding feature weight into the weight weighting unit to obtain the predicted label of the corresponding video sample includes:
  • a weighted sum is calculated according to the multiple classification results and the weights corresponding to the multiple classification results to obtain the predicted label of the corresponding video sample.
  • the MoE model can be used.
  • the MoE model includes classification algorithms including multiple softmax classifiers.
  • the MoE model receives the video features and corresponding weights from the SE Context Gate, and uses the classification algorithm in the MoE model to input these video features and corresponding weights into multiple softmax classifiers, and perform classification results on multiple softmax classifiers. Weighted voting to get the final result.
  • the classification algorithm in the MoE model can be:
  • category h and category h′ both represent one of the categories in the total category He
  • x) is the classification result of a single softmax classifier
  • x) is the weighted summation of the classification results of multiple softmax classifiers in the MoE model and the corresponding weights
  • the category of the video sample obtained.
  • the difference between the predicted label and the classification label is not only used to adjust the parameters of the classification model, but also used to adjust the image feature extraction model and audio features.
  • the parameters of the three models, the model and the classification model, are extracted.
  • the second loss function is used to calculate the gap between the predicted label and the classification label. Substituting the predicted label and the classification label into the second loss function to obtain a second loss value, and if the second loss value meets a preset condition, it is determined that the basic model converges. For example, with the training of the basic model, the second loss value output by the second loss function becomes smaller and smaller, and the second preset threshold is set according to the demand. When the second loss value is less than the second preset threshold, the basic model is considered to be converged , The training result of the basic model meets expectations, and the converged basic model is used as a video classification model for video classification.
  • the weight change between two iterations is already very small, and the third preset threshold is preset.
  • the weight change between the two iterations is less than the third preset threshold. It is considered that the basic model converges, and the training result of the basic model meets expectations, and the converged basic model is used as a video classification model for video classification.
  • FIG. 4 is a schematic diagram of a third process of the model training method provided by an embodiment of the present application.
  • a video clip is intercepted from a manually marked video sample, image frame sampling and audio sampling are performed on the video clip, the video clip is divided into image samples and audio samples, and the image samples and audio samples are respectively preprocessed deal with.
  • the preprocessing of the image samples includes scaling of the image size; the preprocessing of the audio samples includes performing short-time Fourier transform on the audio signal of the audio samples.
  • a basic model which includes image feature extraction model, audio feature extraction model, and classification model.
  • the image features output by the image feature extraction model and the audio feature extraction model are summed
  • the audio features are input into the classification model.
  • the input image features are fused through the image part of the feature fusion model in the classification model to obtain the visual features of the video sample; the audio part of the feature fusion model in the classification model is fused to the input audio features to get The sound feature of the video sample; the visual feature and the sound feature are combined into the video feature of the video sample, and the video feature is input to the weight distribution unit, and the weight corresponding to the video feature is calculated.
  • the video feature and the corresponding weight are input to the weight weighting unit to obtain the prediction result of the video sample, that is, the prediction label.
  • the bolded modules are the modules participating in the training, that is, the modules that adjust the parameters according to the difference between the predicted label and the classification label in the joint training process.
  • the model training method divides the video samples into image samples and audio samples by obtaining video samples and classification labels corresponding to the video samples; constructing a basic model, which includes an image feature extraction model and audio Feature extraction models and classification models; image features of image samples are extracted through image feature extraction models, and audio features of audio samples are extracted through audio feature extraction models; image features and audio features are input into classification model for classification, and corresponding video samples are obtained
  • the parameters of the image feature extraction model, the audio feature extraction model and the classification model are adjusted until the basic model converges, and the converged basic model is used as the video classification model for video classification .
  • the basic model can be trained through the loss function to obtain a video classification model with higher accuracy, which improves the accuracy of video classification.
  • FIG. 5 is a schematic flowchart of a video processing method provided by an embodiment of the present application.
  • the flow of the video processing method may include:
  • the electronic device when the electronic device receives a touch operation of a target component, a preset voice operation, or a start instruction of a preset target application, the generation of a video processing request is triggered.
  • the electronic device can also automatically trigger the generation of the video processing request at a preset time interval or based on a certain trigger rule. For example, when the electronic device detects that the current display interface includes a video, such as detecting that the electronic device starts a browser to browse the video, it can automatically trigger the generation of a video processing request, and classify the current target video according to the video classification model.
  • the electronic device can automatically generate the predicted label of the target video through the machine learning algorithm.
  • the target video may be a video stored in an electronic device.
  • the video processing request includes path information indicating the location where the target video is stored, and the electronic device can use the path information to obtain the target that needs label prediction. video.
  • the electronic device may obtain the target video that needs to be classified through a wired connection or a wireless connection according to the video processing request.
  • a video segment is captured in the target video, and the video segment may be one segment or multiple segments.
  • the video clip is divided into a target image and a target audio.
  • the multiple video clips are divided into multiple target images and multiple target audios.
  • the target image and the target audio correspond to the same video segment in the target video.
  • the video classification model is obtained by training using the model training method provided in this embodiment.
  • model training process please refer to the related description of the above-mentioned embodiment, which will not be repeated here.
  • the target image and the target audio are input into the video classification model for classification, so as to obtain the classification label corresponding to the target video.
  • the category label may represent the category of the target video.
  • the video processing method receives a video processing request; obtains the target video that needs to be classified according to the video processing request, and divides the target video into a target image and a target audio; and calls a pre-trained video classification Model; input the target image and target audio into the video classification model for classification, and obtain the classification label of the target video; in this way, the target video is classified through the video classification model.
  • the model training device may include a first acquisition model 401, a construction module 402, an extraction module 403, a classification module 404, and an adjustment module 405:
  • the first obtaining module 401 is configured to obtain video samples and classification labels corresponding to the video samples, and divide the video samples into image samples and audio samples;
  • the construction module 402 is used to construct a basic model, the basic model includes an image feature extraction model, an audio feature extraction model, and a classification model;
  • the extraction module 403 is used to extract the image features of the image samples through the image feature extraction model, and extract the audio features of the audio samples through the audio feature extraction model;
  • the classification module 404 is configured to input the image features and audio features into the classification model for classification, and obtain the predicted label of the corresponding video sample;
  • the adjustment module 405 is used to adjust the parameters of the image feature extraction model, the audio feature extraction model, and the classification model according to the difference between the predicted label and the classification label, until the basic model converges, and use the converged basic model as a video for video classification Classification model.
  • the first acquisition module 401 is specifically configured to intercept video clips from video samples; divide the video clips into image samples and audio samples.
  • the classification module 404 is specifically configured to input the image features and audio features into the feature fusion module for feature fusion to obtain the video features of the video samples; input the video features of the video samples into the feature classification module for classification to obtain Corresponding to the predicted label of the video sample.
  • the classification module 404 is specifically configured to input image features into the feature fusion module for feature fusion to obtain visual features of the video sample; input audio features into the feature fusion module for feature fusion to obtain the sound features of the video sample ; Combine visual features and sound features into video features of video samples.
  • the feature classification module includes a weight distribution unit and a weight weighting unit.
  • the classification module 404 is specifically configured to input the video feature into the weight distribution unit to calculate the weight to obtain the feature weight corresponding to the video feature;
  • the feature weight is input into the weight weighting unit to calculate the weighted sum to obtain the predicted label of the corresponding video sample.
  • the weight weighting unit includes multiple preset classifiers
  • the classification module 404 is specifically configured to input video features and corresponding feature weights into multiple preset classifiers to obtain multiple classification results and corresponding weights. ; Calculate a weighted sum according to multiple classification results and corresponding weights to obtain the predicted label of the corresponding video sample.
  • FIG. 7 is a schematic diagram of the second structure of the model training device provided by an embodiment of the application.
  • the model training device provided in the embodiment of the present application further includes a pre-training module 406:
  • the pre-training module 406 is used for pre-training the image feature extraction model according to the first data set; pre-training the audio feature extraction model according to the second data set; extracting the model and the pre-trained image feature according to the pre-training
  • the audio feature extraction model pre-trains the classification model.
  • the pre-training module 406 pre-trains the image feature extraction model according to the first data set
  • the first data set is pre-processed.
  • the image feature extraction model is pre-trained according to the pre-processed first data set.
  • the model training device obtains video samples and the classification labels of the corresponding video samples through the first obtaining model 401, and divides the video samples into image samples and audio samples;
  • the building module 402 constructs a basic model,
  • the model includes an image feature extraction model, an audio feature extraction model, and a classification model;
  • the extraction module 403 extracts the image features of the image sample through the image feature extraction model, and extracts the audio features of the audio sample through the audio feature extraction model;
  • the classification module 404 extracts the image
  • the features and audio features are input into the classification model for classification to obtain the predicted label of the corresponding video sample;
  • the adjustment module 405 adjusts the parameters of the image feature extraction model, the audio feature extraction model, and the classification model according to the difference between the predicted label and the classification label, until the basic model Convergence, and use the converged basic model as a video classification model for video classification.
  • the basic model can be trained through the loss function to obtain a video classification model with higher accuracy, which improves
  • model training device provided in this embodiment of the application belongs to the same concept as the model training method in the above embodiment. Any method provided in the model training method embodiment can be run on the model training device, and its specific implementation For details of the process, refer to the embodiment of the model training method, which will not be repeated here.
  • FIG. 8 is a schematic structural diagram of a video processing apparatus provided by an embodiment of the application.
  • the video processing device may include: a receiving module 501, a second acquiring module 502, a calling model 503, and a prediction module 504.
  • the receiving module 501 is configured to receive a video processing request
  • the second obtaining module 502 is configured to obtain a target video that needs to be classified according to a video processing request, and divide the target video into a target image and a target audio;
  • the calling module 503 is used to call a pre-trained video classification model
  • the prediction module 504 is configured to input the target image and the target audio into the video classification model for label prediction, and obtain the target label of the target video;
  • the video classification model is obtained by training using the model training method provided in the embodiment of the present application.
  • the video processing device receives a video processing request through the receiving module 501; the second acquiring module 502 acquires the target video that needs to be classified according to the video processing request, and divides the target video into a target image and a target. Audio; the calling module 503 calls a pre-trained video classification model; the prediction module 504 classifies the target image and target audio into the video classification model to obtain the classification label of the target video; in this way, the target video is classified through the video classification model.
  • the video processing device provided in this embodiment of the application belongs to the same concept as the video processing method in the above embodiment. Any method provided in the video processing method embodiment can be run on the video processing device, and its specific implementation For details of the process, refer to the embodiment of the video processing method, which will not be repeated here.
  • the embodiment of the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer is caused to execute the model training method or the video processing method provided in the embodiment of the present application .
  • the storage medium may be a magnetic disk, an optical disk, a read only memory (Read Only Memory, ROM,), or a random access device (Random Access Memory, RAM), etc.
  • An embodiment of the application further provides an electronic device, including a memory, a processor, and a computer program stored in the memory.
  • the processor calls the computer program stored in the memory to execute the model training method or video provided in the embodiment of the application. Approach.
  • the above-mentioned electronic device may be a mobile terminal such as a tablet computer or a smart phone.
  • FIG. 9 is a schematic diagram of the first structure of an electronic device provided by an embodiment of this application.
  • the electronic device 600 may include components such as a memory 601 and a processor 602. Those skilled in the art can understand that the structure of the electronic device shown in FIG. 9 does not constitute a limitation on the electronic device, and may include more or fewer components than shown in the figure, or a combination of certain components, or different component arrangements.
  • the memory 601 may be used to store software programs and modules.
  • the processor 602 executes various functional applications and data processing by running the computer programs and modules stored in the memory 601.
  • the memory 601 may mainly include a storage program area and a storage data area.
  • the storage program area may store an operating system, a computer program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of electronic equipment, etc.
  • the processor 602 is the control center of the electronic device. It uses various interfaces and lines to connect various parts of the entire electronic device, and executes the electronic device by running or executing the application program stored in the memory 601 and calling the data stored in the memory 601
  • the various functions and processing data of the electronic equipment can be used to monitor the electronic equipment as a whole.
  • the memory 601 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the memory 601 may further include a memory controller to provide the processor 602 with access to the memory 601.
  • the processor 602 in the electronic device loads the executable code corresponding to the process of one or more application programs into the memory 601 according to the following instructions, and the processor 602 runs and stores the executable code in the memory 601 The application in 601, so as to realize the process:
  • the image feature of the image sample is extracted through the image feature extraction model, and the audio feature of the audio sample is extracted through the audio feature extraction model;
  • the parameters of the image feature extraction model, the audio feature extraction model, and the classification model are adjusted according to the difference between the predicted label and the classification label, until the basic model converges, and the converged basic model is used as the video classification model for video classification.
  • the classification model includes a feature fusion module and a feature classification module.
  • the processor 502 executes to input image features and audio features into the classification model for classification, and when the predicted label of the corresponding video sample is obtained, it can execute:
  • the video features of the video samples are input into the feature classification module for classification, and the predicted labels of the corresponding video samples are obtained.
  • the processor 602 when the processor 602 performs feature fusion by inputting image features and audio features into the feature fusion module to obtain the video features of the video sample, it can execute:
  • the visual features and sound features are combined into the video features of the video samples.
  • the feature classification module includes a weight distribution unit and a weight weighting unit.
  • the processor 602 executes to input the video features of the video sample into the feature classification module for classification, and when the predicted label of the corresponding video sample is obtained, it can execute:
  • the video feature and the corresponding feature weight are input into the weight weighting unit to calculate the weighted sum to obtain the predicted label of the corresponding video sample.
  • the weight weighting unit includes a plurality of preset classifiers
  • the processor 502 executes to input the video feature and the corresponding feature weight into the weight weighting unit, and when the predicted label of the corresponding video sample is obtained, it can execute:
  • a weighted sum is calculated according to multiple classification results and corresponding weights to obtain the predicted label of the corresponding video sample.
  • the processor 602 when the processor 602 divides the video sample into an image sample and an audio sample, it can execute:
  • the video segment is divided into image samples and audio samples.
  • the processor 602 before the processor 602 extracts the image features of the image sample through the image feature extraction model, and extracts the audio features of the audio sample through the audio feature extraction model, it can execute:
  • the classification model is pre-trained according to the pre-trained image feature extraction model and the pre-trained audio feature extraction model.
  • the processor 602 before the processor 602 performs pre-training of the image feature extraction model according to the first data set, it may perform:
  • the step of pre-training the image feature extraction model according to the first data set includes:
  • the image feature extraction model is pre-trained according to the pre-processed first data set.
  • the processor 602 in the electronic device loads the executable code corresponding to the process of one or more application programs into the memory 601 according to the following instructions, and the processor 601 runs and stores the executable code in the memory 601 The application in 601, so as to realize the process:
  • the video classification model is obtained by training using the model training method provided in the embodiment of the present application.
  • FIG. 10 is a schematic diagram of a second structure of an electronic device provided by an embodiment of the application.
  • the electronic device further includes a camera component 603, a radio frequency circuit 604, an audio circuit 605, and Power supply 606.
  • the display 603, the radio frequency circuit 604, the audio circuit 605, and the power supply 606 are electrically connected to the processor 602, respectively.
  • the display 603 can be used to display information input by the user or information provided to the user, and various graphical user interfaces. These graphical user interfaces can be composed of graphics, text, icons, videos, and any combination thereof.
  • the display 603 may include a display panel.
  • the display panel may be configured in the form of a liquid crystal display (LCD) or an organic light-emitting diode (OLED).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • the radio frequency circuit 604 may be used to transmit and receive radio frequency signals to establish wireless communication with network equipment or other electronic equipment through wireless communication, and to transmit and receive signals with the network equipment or other electronic equipment.
  • the audio circuit 605 can be used to provide an audio interface between the user and the electronic device through a speaker or a microphone.
  • the power supply 606 can be used to power various components of the electronic device 600.
  • the power supply 606 may be logically connected to the processor 602 through a power management system, so that functions such as management of charging, discharging, and power consumption management can be realized through the power management system.
  • the electronic device 600 may also include a camera component, a Bluetooth module, etc.
  • the camera component may include an image processing circuit, which may be implemented by hardware and/or software components, and may include defining image signal processing (Image Signal Processing) various processing units of the pipeline.
  • the image processing circuit may at least include: multiple cameras, an image signal processor (Image Signal Processor, ISP processor), a control logic, an image memory, a display, and the like.
  • Each camera may include at least one or more lenses and image sensors.
  • the image sensor may include a color filter array (such as a Bayer filter). The image sensor can acquire the light intensity and wavelength information captured with each imaging pixel of the image sensor, and provide a set of raw image data that can be processed by the image signal processor.
  • model training method/video processing method and device provided by the embodiments of this application belong to the same concept as the model training method/video processing method in the above embodiments, and the model training method/video processing can be run on the model training method/video processing method device.
  • the model training method/video processing method embodiment for any method provided in the method embodiment, for the specific implementation process, please refer to the model training method/video processing method embodiment, which will not be repeated here.
  • model training method/video processing method of the embodiment of the present application a person of ordinary skill in the art can understand that all or part of the process of implementing the model training method/video processing method of the embodiment of the present application can be implemented by a computer program. Control related hardware to complete.
  • the computer program can be stored in a computer readable storage medium, such as stored in a memory, and executed by at least one processor.
  • the execution process can include such as model training method/video processing method.
  • the storage medium may be a magnetic disk, an optical disc, a read only memory (ROM, Read Only Memory), a random access memory (RAM, Random Access Memory), etc.
  • each functional module can be integrated in a processing chip, or each module can exist alone physically, or two or more modules can be integrated in one module. in.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software functional modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer readable storage medium, such as a read-only memory, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Human Resources & Organizations (AREA)
  • Operations Research (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

一种模型训练方法、视频处理方法、装置、存储介质及电子设备。模型训练方法包括:获取视频样本以及对应视频样本的分类标签,将视频样本划分为图像样本和音频样本(101);构建基础模型,基础模型包括图像特征提取模型、音频特征提取模型和分类模型(102);通过图像特征提取模型提取得到图像样本的图像特征,以及通过音频特征提取模型提取得到音频样本的音频特征(103);将图像特征以及音频特征输入分类模型进行分类,得到对应视频样本的预测标签(104);根据预测标签与分类标签的差异对图像特征提取模型、音频特征提取模型以及分类模型的参数进行调整,直至基础模型收敛,并将收敛的基础模型作为用于视频分类的视频分类模型(105)。

Description

模型训练方法、视频处理方法、装置、存储介质及电子设备 技术领域
本申请涉及机器学习领域,特别涉及一种模型训练方法、视频处理方法、装置、存储介质及电子设备。
背景技术
随着移动互联网的快速发展和智能手机的快速普及,图像和视频等视觉内容数据与日俱增,随之衍生出视频标签。视频标签是通过对视频进行场景分类、人物识别、语音识别、文字识别等多维度分析,形成的层次化的分类标签。其中,获取视频标签的过程可称为视频打标,通过视频打标对视频的内容进行分类,可作为用户寻找自己感兴趣的视频及某些商家或者平台推荐视频的依据。
目前,视频打标的方式为人工打标,需要依靠人力对视频打标。但是,人工打标的方式存在效率低的问题。
发明内容
本申请实施例提供一种模型训练方法、视频处理方法、装置、存储介质及电子设备,可以通过训练模型提高视频打标的效率。
第一方面,本申请实施例提供一种模型训练方法,包括:
获取视频样本以及对应所述视频样本的分类标签,将所述视频样本划分为图像样本和音频样本;
构建基础模型,所述基础模型包括图像特征提取模型、音频特征提取模型和分类模型;
通过所述图像特征提取模型提取得到所述图像样本的图像特征,以及通过所述音频特征提取模型提取得到所述音频样本的音频特征;
将所述图像特征以及所述音频特征输入所述分类模型进行分类,得到对应所述视频样本的预测标签;
根据所述预测标签与所述分类标签的差异对所述图像特征提取模型、所述音频特征提取模型以及所述分类模型的参数进行调整,直至基础模型收敛,并将收敛的基础模型作为用于视频分类的视频分类模型。
第二方面,本申请实施例提供一种视频处理方法,包括:
接收视频处理请求;
根据视频处理请求获取需要进行分类的目标视频,并将目标视频划分为目标图像和目标音频;
调用预先训练的视频分类模型;
将目标图像与目标音频输入视频分类模型进行分类,获得目标视频的分类标签;
其中,所述视频分类模型采用本实施例提供的模型训练方法训练得到。
第三方面,本申请实施例提供一种模型训练装置,包括:
第一获取模块,用于获取视频样本以及对应视频样本的分类标签,将视频样本划分为图像样本和音频样本;
构建模块,用于构建基础模型,基础模型包括图像特征提取模型、音频特征提取模型和分类模型;
提取模块,用于通过图像特征提取模型提取得到图像样本的图像特征,以及通过音频特征提取模型提取得到音频样本的音频特征;
分类模块,用于将图像特征以及音频特征输入分类模型进行分类,得到对应视频样本的预测标签;
调整模块,用于根据预测标签与分类标签的差异对图像特征提取模型、音频特征提取模型以及分类模型的参数进行调整,直至基础模型收敛,并将收敛的基础模型作为用于视频分类的视频分类模型。
第四方面,本申请实施例提供一种视频处理装置,包括:
接收模块,用于接收视频处理请求;
第二获取模块,用于根据视频处理请求获取需要进行分类的目标视频,并将目标视频划分为目标图像和目标音频;
调用模块,用于调用预先训练的视频分类模型;
预测模块,用于将目标图像与目标音频输入视频分类模型进行分类,获得目标视频的分类标签;
其中,视频分类模型采用本实施例提供的模型训练方法训练得到。
第五方面,本申请实施例提供一种存储介质,其上存储有计算机程序,其中,当所述计算机程序在计算机上执行时,使得所述计算机执行本实施例提供的模型训练方法或视频处理方法。
第六方面,本申请实施例提供一种电子设备,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器通过调用所述存储器中存储的所述计算机程序,用于执行:
获取视频样本以及对应视频样本的分类标签,将视频样本划分为图像样本和音频样本;
构建基础模型,基础模型包括图像特征提取模型、音频特征提取模型和分类模型;
通过图像特征提取模型提取得到图像样本的图像特征,以及通过音频特征提取模型提取得到音频样本的音频特征;
将图像特征以及音频特征输入分类模型进行分类,得到对应视频样本的预测标签;
根据预测标签与分类标签的差异对图像特征提取模型、音频特征提取模型以及分类模型的参数进行调整,直至基础模型收敛,并将收敛的基础模型作为用于视频分类的视频分类模型。
第七方面,本申请实施例提供一种电子设备,包括存储器和处理器,存储器中存储有 计算机程序,处理器通过调用存储器中存储的计算机程序,用于执行:
接收视频处理请求;
根据视频处理请求获取需要进行分类的目标视频,并将目标视频划分为目标图像和目标音频;
调用预先训练的视频分类模型;
将目标图像与目标音频输入视频分类模型进行分类,获得目标视频的分类标签;
其中,所述视频分类模型采用本实施例提供的模型训练方法训练得到。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单的介绍。显而易见的,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的模型训练方法的第一种流程示意图。
图2是本申请实施例提供的模型训练方法的原理示意图。
图3是本申请实施例提供的模型训练方法的第二种流程示意图。
图4是本申请实施例提供的模型训练方法的第三种流程示意图。
图5是本申请实施例提供的视频处理方法的流程示意图。
图6是本申请实施例提供的模型训练装置的第一种结构示意图。
图7是本申请实施例提供的模型训练装置的第一种结构示意图。
图8是本申请实施例提供的视频处理装置的结构示意图。
图9是本申请实施例提供的电子设备的第一种结构示意图。
图10是本申请实施例提供的电子设备的第二种结构示意图。
具体实施方式
请参照图示,其中相同的组件符号代表相同的组件,本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例,其不应被视为限制本申请未在此详述的其它具体实施例。
本申请实施例提供一种模型训练方法。其中,该模型训练方法的执行主体可以是本申请实施例提供的模型训练装置,或者集成了该模型训练装置的电子设备。该模型训练装置可以采用硬件或者软件的方式实现,电子设备可以是智能手机、平板电脑、掌上电脑、笔记本电脑、或者台式电脑等配置有处理器而具有处理能力的设备。为了便于描述,以下将以模型训练方法的执行主体为电子设备进行举例说明。
请参阅图1,图1是本申请实施例提供的模型训练方法的第一种流程示意图。该模型训练方法的流程可以包括:
101、获取视频样本以及对应视频样本的分类标签,将视频样本划分为图像样本和音频样本。
其中,电子设备可以通过有线连接或无线连接的方式获取视频样本。其中,标签能够体现视频样本的内容,同一个视频样本可以有多个对应的标签。
例如,视频样本的内容为女孩在街道上滑滑板,则该视频样本的标签可以为女孩、滑板、街道,而忽略周边不重要的行人、车辆、建筑等。获取视频标签的过程可称为视频打标,通过视频打标对视频的内容进行分类,可作为用户寻找自己感兴趣的视频及某些商家或者平台推荐视频的依据。其中,分类标签可以为人工设置(即通过人工打标的方式设置)的标签。
在一实施方式中,在视频样本中截取视频片段,将视频片段划分为图像样本和音频样本。其中,视频片段可以为一段,也可以为多段。当从视频中截取一段视频片段时,将这一段视频片段划分为一个图像样本和一个音频样本。当从视频中截取多段视频片段时,将多段视频片段划分为多个图像样本和多个音频样本。
其中,图像样本为从视觉上观察视频样本的内容,例如,图像样本中显示人在弹钢琴;音频样本为从听觉上观察视频样本的内容,例如,钢琴响起的声音。
102、构建基础模型,基础模型包括图像特征提取模型、音频特征提取模型和分类模型。
其中,构建基础模型包括构建基础模型中的图像特征提取模型、音频特征提取模型和分类模型。可以通过构建基础模型,使得训练完成的基础模型能应用于电子设备如智能手机中,进而对智能手机的视频进行分类。
在一实施方式中,图像特征提取模型可以采用ResNet-101模型。ResNet-101模型是一种CNN(Convolution Neural Network,卷积神经网络)模型,具有101层网络隐藏层。ResNet-101模型通过使用多个有参层来学习输入输出之间的残差表示,而非像一般CNN网络那样使用有参层来直接尝试学习输入、输出之间的映射。使用有参层来直接学习残差比直接学习输入、输出间映射要容易得多(收敛速度更快),也有效得多(可通过使用更多的层来达到更高的分类精度)。
在一实施方式中,音频特征提取模型可以采用VGG(Oxford Visual Geometry Group)深度网络模型。VGG深度网络模型由5层卷积层、3层全连接层、softmax输出层构成,层与层之间使用max-pooling(最大池化)分开。其中,VGG深度网络模型的深度增加和小卷积核的使用对最终的音频特征提取效果有很大的作用。
103、通过图像特征提取模型提取得到图像样本的图像特征,以及通过音频特征提取模型提取得到音频样本的音频特征。
在一实施方式中,通过图像特征提取模型提取得到图像样本的图像特征,以及通过音频特征提取模型提取得到音频样本的音频特征之前,先分别对图像特征提取模型、音频特征提取模型和分类模型进行预训练,然后,通过预训练后的图像特征提取模型提取得到图像样本的图像特征,以及通过预训练后的音频特征提取模型提取得到音频样本的音频特征。
在一实施方式中,将图像样本输入至图像特征提取模型如预训练后的ResNet-101模型 中进行图像特征提取,得到图像样本的图像特征。例如,可以将图像样本输入至预训练后的ResNet-101模型,将ResNet-101模型的101层网络隐藏层中最后一层全连接前的特征(即,倒数第二层全连接层输出的特征)作为图像样本的图像特征。
在一实施方式中,将音频样本输入至音频特征提取模型如预训练后的VGG深度网络模型中进行音频特征提取,得到音频样本的音频特征。例如,将音频样本输入至预训练后的VGG深度网络模型,将VGG深度网络模型中最后一层全连接前的特征(即,倒数第二层全连接层输出的特征)作为音频样本的音频特征。
其中,提取出的图像特征和音频特征能够反映视频样本的特征。当视频样本只有一段视频片段时,提取出的图像特征和音频特征一起反映该段视频片段的特征。
104、将图像特征以及音频特征输入分类模型进行分类,得到对应视频样本的预测标签。
在一实施例中,分类模型包括两个模块,特征融合模块及特征分类模块。将图像特征以及音频特征输入分类模型进行分类,得到对应视频样本的预测标签的步骤,包括:
将图像特征及音频特征输入特征融合模块中进行特征融合,得到视频样本的视频特征;
将视频样本的视频特征输入特征分类模块中进行分类,得到对应视频样本的预测标签。
图像特征提取模型输出图像样本的图像特征以及音频特征提取模型输出音频样本的音频特征后,这些图像特征和音频特征不输入基础模型以外的其他任何算法,而是由基础模型中的图像特征提取模型和音频特征提取模型输出后,直接进入基础模型中的分类模型。分类模型接收到由图像特征提取模型提取的图像特征和由音频特征提取模型提取的音频特征,通过对这些图像特征和音频特征进行融合和分类,得到视频样本的预测标签。
其中,特征融合模块可以将多帧特征融合为一个特征,例如,可以将多帧图像特征融合为一个图像特征,将多帧音频特征融合为一个音频特征。
其中,在特征融合模块,可以采用NeXtVLAD算法。将多帧图像特征作为变量x输入NeXtVLAD算法中,x可以为x 1、x 2、x 3等等。获取图像特征提取模块通过预训练得到的聚类中心C,具体的算法可以为:C与x 1相减后的差值乘以x 1对应的权重+C与x 2相减后的差值乘以x 2对应的权重+C与x 3相减后的差值乘以x 3对应的权重+…。以此,通过加权和与归一化的方式得到图像特征融合而成的视觉特征以及音频特征融合而成的声音特征。视觉特征和声音特征结合后,即形成该视频样本的视频特征。若图像样本和音频样本为视频样本中截取的一段视频片段划分而来,则通过NeXtVLAD算法得到的视觉特征和声音特征为对应于同一段视频片段的视觉特征和声音特征,视觉特征和声音特征结合形成的视频特征也即该段视频片段的视频特征。
在一实施方式中,特征分类模块包括权重分配单元和权重加权单元。权重分配单元可使用SE Context Gate(SE上下文门,神经网络中的一个层),权重加权单元可使用MoE(Mixture of Experts,混合专家)模型。
其中,SE Context Gate用于压制视频特征中不重要的信息,凸显重要的信息,例如, 视频中女孩在路上滑滑板,则滑板与女孩是重要信息,行人与汽车为不重要信息。
请参阅图2,图2为本申请实施例提供的模型训练方法的原理示意图,其中为SE Context Gate的结构示意。
在一实施例中,将视频特征X输入至全连接层,对视频特征X进行批量归一化后,输入至ReLU激活函数。将ReLU激活函数的输出值输入至下一个全连接层再次进行批量归一化后,将再次批量归一化的结果输入至Sigmoid激活函数,由Sigmoid激活函数计算出视频特征X的特征权重,并将特征权重与视频特征X相乘后得到输出Y。从输出Y中可以得到视频特征及对应的特征权重。
其中,ReLU激活函数的表达式为:
Figure PCTCN2020071021-appb-000001
f(x)=max(0,x)
当x<0时,ReLU硬饱和,而当x>0时,则不存在饱和问题。所以,ReLU能够在x>0时保持梯度不衰减。
其中,Sigmoid激活函数的表达式为:
Figure PCTCN2020071021-appb-000002
Sigmoid激活函数具有指数函数形状,它在物理意义上接近生物神经元。此外,由于Sigmoid激活函数的值始终位于区间(0,1)中,因而Sigmoid激活函数的输出还可以用于表示概率,或用于输入的归一化。
在一实施方式中,将视频特征输入SE Context Gate,计算得到视频特征的特征权重,其中,各项视频特征的特征权重不同,对于重要的信息,对应视频特征的特征权重大,而对于不重要的信息,对应视频特征的特征权重小。最终通过SE Context Gate将视频特征和视频特征对应的特征权重一起输出,并输入至MoE模型。
MoE模型接收到SE Context Gate传来的视频特征及对应的权重,利用MoE模型中的分类算法,将这些视频特征及对应的权重输入多个softmax分类器,对多个softmax分类器的分类结果进行加权投票,得到最终结果。
MoE模型中的分类算法可以为:
Figure PCTCN2020071021-appb-000003
Figure PCTCN2020071021-appb-000004
其中,类别h和类别h′均表示总类别He中的其中一个类别,p(h|x)为单个softmax分类器的分类结果,
Figure PCTCN2020071021-appb-000005
为分类结果p(h|x)对应的权重,p(e|x)为对MoE模型中多个softmax分类器的分类结果及对应的权重进行加权求和,最终得到的视频样本的类别。
105、根据预测标签与分类标签的差异对图像特征提取模型、音频特征提取模型以及分类模型的参数进行调整,直至基础模型收敛,并将收敛的基础模型作为用于视频分类的视 频分类模型。
在一实施例中,使用损失函数计算预测标签与分类标签之间的差距。将预测标签与分类标签代入损失函数,得到损失值,若损失值满足预设条件则认为基础模型收敛。例如,随着基础模型的训练,损失函数输出的损失值越来越小,根据需求设定损失值阈值,当损失值小于损失值阈值时,认为基础模型收敛,基础模型的训练结果符合预期,将该收敛的基础模型作为用于视频分类的视频分类模型。
或者,在基础模型的迭代过程中,两次迭代之间的权值变化已经很小,预先设置权值阈值,当基础模型两次迭代之间的权值变化小于预设权值阈值时,认为基础模型收敛,基础模型的训练结果符合预期,将该收敛的基础模型作为用于视频分类的视频分类模型。
又或者,预先设置基础模型的迭代次数,当基础模型的迭代次数超过预设迭代次数时,停止迭代,并认为基础模型收敛,基础模型的训练结果符合预期,将该收敛的基础模型作为用于视频分类的视频分类模型。
由上可知,本申请实施例提供的模型训练方法,通过获取视频样本以及对应视频样本的分类标签,将视频样本划分为图像样本和音频样本;构建基础模型,基础模型包括图像特征提取模型、音频特征提取模型和分类模型;通过图像特征提取模型提取得到图像样本的图像特征,以及通过音频特征提取模型提取得到音频样本的音频特征;将图像特征以及音频特征输入分类模型进行分类,得到对应视频样本的预测标签;根据预测标签与分类标签的差异对图像特征提取模型、音频特征提取模型以及分类模型的参数进行调整,直至基础模型收敛,并将收敛的基础模型作为用于视频分类的视频分类模型。以此可以通过损失函数对基础模型进行训练,以得到准确率更高的视频分类模型,提升了视频分类的准确率。
请参阅图3,图3是本申请实施例提供的模型训练方法的第二种流程示意图。该模型训练方法的流程可以包括:
201、获取视频样本以及对应视频样本的分类标签。
其中,电子设备可以通过有线连接或无线连接的方式获取视频样本。视频样本的时间可以从几秒钟到几十小时不等。其中,标签能够体现视频样本的内容,同一个视频样本可以有多个对应的标签。
例如,视频样本的内容为女孩在街道上滑滑板,则该视频样本的标签可以为女孩、滑板、街道,而忽略周边不重要的行人、车辆、建筑等。获取视频标签的过程可称为视频打标,通过视频打标对视频的内容进行分类,可作为用户寻找自己感兴趣的视频及某些商家或者平台推荐视频的依据。其中,分类标签可以为人工设置(即通过人工打标的方式设置)的标签。
202、从视频样本中截取视频片段。
203、将视频片段划分为图像样本和音频样本。
在一实施例方式中,在视频样本中截取视频片段,将视频片段划分为图像样本和音频 样本。其中,视频片段可以为一段,也可以为多段。当从视频中截取一段视频片段时,将这一段视频片段划分为一个图像样本和一个音频样本。当从视频中截取多段视频片段时,将多段视频片段划分为多个图像样本和多个音频样本。
其中,图像样本为从视觉上观察视频样本的内容,例如,图像样本中显示人在弹钢琴;音频样本为从听觉上观察视频样本的内容,例如,钢琴响起的声音。
204、构建基础模型,基础模型包括图像特征提取模型、音频特征提取模型和分类模型。
其中,构建基础模型包括构建基础模型中的图像特征提取模型、音频特征提取模型和分类模型。可以通过构建基础模型,使得训练完成的基础模型能应用于电子设备如智能手机中,进而对智能手机的视频进行分类。
在一实施方式中,图像特征提取模型可以采用ResNet-101模型。ResNet-101模型是一种CNN(Convolution Neural Network,卷积神经网络)模型,具有101层网络隐藏层。ResNet-101模型通过使用多个有参层来学习输入输出之间的残差表示,而非像一般CNN网络那样使用有参层来直接尝试学习输入、输出之间的映射。使用有参层来直接学习残差比直接学习输入、输出间映射要容易得多(收敛速度更快),也有效得多(可通过使用更多的层来达到更高的分类精度)。
在一实施方式中,音频特征提取模型可以采用VGG(Oxford Visual Geometry Group)深度网络模型。VGG深度网络模型由5层卷积层、3层全连接层、softmax输出层构成,层与层之间使用max-pooling(最大池化)分开。其中,VGG深度网络模型的深度增加和小卷积核的使用对最终的音频特征提取效果有很大的作用。
205、根据第一数据集对图像特征提取模型进行预训练。
其中,第一数据集可以为ImageNet数据集。ImageNet是一个用于视觉对象识别软件研究的大型可视化数据库,包含超过1400万个图像,其中120万个图像分为1000个类别(大约100万个图像含边界框和注释)。经过预训练后的图像特征提取模型可用于从图像样本中提取出图像特征。通过将图像特征提取模型在ImageNet上进行数据处理,在训练结束时可以得到模型参数较好的ResNet-101模型,并将该特征提取预训练完成的ResNet-101模型作为图像特征提取模型,这样可以极大的缩短的图像特征提取模型的训练时间。
在一实施方式中,在根据第一数据集对图像特征提取模型进行预训练之前,还包括:
对第一数据集进行预处理;
根据第一数据集对图像特征提取模型进行预训练的步骤,包括:
根据预处理后的第一数据集对图像特征提取模型进行预训练。
在一实施方式中,可以通过对第一数据集进行数据增益来对第一数据集进行预处理。其中,数据增益的方式包括:对第一数据集中的初始图像进行随机变化。例如,对第一数据集中的初始图像进行水平镜像翻转、垂直镜像翻转、裁剪、亮度调整、饱和度调整和色相调整中的一种以上。然后,将预处理后的第一数据集输入图像特征提取模型,根据图像 特征提取模型输出的图像特征与第一数据集中原始图像自带的原始图像特征之间的差异,调整图像特征提取模型的各项参数,从而达到对图像特征提取模型预训练的目的。
206、根据第二数据集对音频特征提取模型进行预训练。
其中,第二数据集可以为AudioSet数据集。AudioSet数据集包含了多类音频类别以及大量人工打标的声音片段,覆盖大范围的人类与动物声音、乐器与音乐流派声音、日常的环境声音等。
在一实施方式中,在根据第二数据集对音频特征提取模型进行预训练之前,还包括:
对第二数据集进行预处理;
根据第二数据集对音频特征提取模型进行预训练,包括:
根据预处理后的第二数据集对音频特征提取模型进行预训练。
在一实施方式中,可以通过对第二数据集中的声音片段进行短时傅里叶变换来对第二数据集进行预处理。其中,短时傅里叶变换是一种时谱分析方法,它通过时间窗内的一段信号来表示某一时刻的信号特征。在短时傅里叶变换过程中,窗的长度决定频谱图的时间分辨率和频率分辨率,窗长越长,截取的信号越长,信号越长,傅里叶变换后频率分辨率越高。
简单来说,短时傅里叶变换就是先把一个函数和窗函数进行相乘,然后再进行一维的傅里叶变换。并通过窗函数的滑动得到一系列的傅里叶变换结果,将这些结果排开便得到二维的表象。为方便处理,可以通过短时傅里叶变换将声音信号进行离散化处理。
在通过短时傅里叶变换得到第二数据集中声音片段的频谱图,并将频谱图输入至音频特征提取模型之后,根据音频特征提取模型输出的音频特征与第二数据集中声音片段自带的原始音频特征之间的差异,调整音频特征提取模型的各项参数,从而达到对音频特征提取模型预训练的目的。
207、根据预训练后的图像特征提取模型和预训练后的音频特征提取模型对分类模型进行预训练。
在一实施方式中,构建基础模型之后,对图像特征提取模型、音频特征提取模型和分类模型进行预训练。在预训练时,先分别对图像特征提取模型、音频特征提取模型进行预训练,然后,根据预训练后的图像特征提取模型和预训练后的音频特征提取模型对分类模型进行预训练。
其中,根据预训练后的图像特征提取模型和预训练后的音频特征提取模型对分类模型进行预训练包括:
将图像样本输入预训练后的图像特征提取模型,得到图像样本的图像特征;
将音频样本输入预训练后的音频特征提取模组,得到音频样本的音频特征;
将图像特征和音频特征输入分类模型进行分类,得到对应视频样本的预测标签;
根据预测标签与分类标签的差异对分类模型的参数进行调整,直至分类模型收敛。
其中,预测标签与分类标签的差异可以通过第一BCE Loss(损失函数)体现。将预测标签与分类标签输入第一损失函数,得到第一损失值。通过第一损失值体现预测标签与分类标签的差异。当第一损失值满足预设条件时,认为分类模型收敛。例如,当第一损失值小于第一预设阈值时,判定分类模型收敛,分类模型的预训练完成。
208、通过图像特征提取模型提取得到图像样本的图像特征,以及通过音频特征提取模型提取得到音频样本的音频特征。
在图像特征提取模型、音频特征提取模型和分类模型的预训练完成后,将预训练后的图像特征提取模型、音频特征提取模型和分类模型进行端对端的联合训练。
在联合训练中,图像样本和音频样本输入基础模型中的图像特征提取模型、音频特征提取模型进行特征提取,将图像特征提取模型、音频特征提取模型的输出作为分类模型的输入,并最终从分类模型输出视频样本的预测标签。整个训练过程在基础模型内部完成,在联合训练的过程中,不借助除基础模型以外的其它算法。
在一实施例中,在联合训练时,首先将图像样本输入至图像特征提取模型如ResNet-101模型中进行图像特征提得到图像样本的图像特征,将音频样本输入至音频特征提取模型如VGG深度网络模型模型中进行音频特征提得到音频样本的音频特征。例如,将图像样本输入至ResNet-101模型,将ResNet-101模型的101层网络隐藏层中最后一层全连接前的特征(倒数第二层全连接层输出的特征)作为图像样本的图像特征;将音频样本输入至VGG深度网络模型,将VGG深度网络模型中最后一层全连接前的特征(倒数第二层全连接层输出的特征)作为音频样本的音频特征。
其中,图像特征可以为多帧图像特征,音频特征可以为多帧音频特征。当视频样本只有一个时,提取出的图像特征和音频特征一起代表同一段视频片段的特征。
209、将图像特征及音频特征输入特征融合模块中进行特征融合,得到视频样本的视频特征。
其中,将图像特征及音频特征输入特征融合模块中进行特征融合,得到视频样本的视频特征的步骤包括:
(1)将图像特征输入特征融合神经网络模型中进行特征融合,得到目标视频的视觉特征。
(2)将音频特征输入特征融合神经网络模型中进行特征融合,得到目标视频的声音特征。
(3)将视觉特征与声音特征结合为目标视频的视频特征。
特征融合模块将可以将多帧特征融合为一个特征,例如,将多帧图像特征融合为一个图像特征,将多帧音频特征融合为一个音频特征。
在特征融合模块,可以采用NeXtVLAD算法。将多帧图像特征作为变量x输入NeXtVLAD算法中,x可以为x 1、x 2、x 3等等。获取图像特征提取模块通过预训练得到的 聚类中心C,具体的算法可以为:C与x 1相减后的差值乘以x 1对应的权重+C与x 2相减后的差值乘以x 2对应的权重+C与x 3相减后的差值乘以x 3对应的权重+…。以此,通过加权和与归一化的方式得到多帧图像特征融合而成的视觉特征以及多帧音频特征融合而成的声音特征。视觉特征和声音特征结合后,即形成该视频样本的视频特征。若图像样本和音频样本为视频样本中截取的一段视频片段划分而来,则通过NeXtVLAD算法得到的视觉特征和声音特征为对应于同一段视频片段的视觉特征和声音特征,视觉特征和声音特征结合形成的视频特征也即该段视频片段的视频特征。
210、将视频特征输入权重分配单元中进行计算权重,得到视频特征对应的特征权重。
在一实施例中,分类模型包括特征融合模块和分类模块,分类模块包括权重分配单元和权重加权单元。将视频样本的视频特征输入特征分类模块中进行分类,得到对应视频样本的预测标签的步骤,包括:
将视频特征输入权重分配单元计算权重,得到视频特征对应的特征权重;
将视频特征及对应的特征权重输入权重加权单元中计算加权和,得到对应视频样本的预测标签。
在分类模块的权重分配单元中,可以使用SE Context Gate。SE Context Gate用于压制视频特征中不重要的信息,凸显重要的信息,例如,视频中女孩在路上滑滑板,则滑板与女孩是重要信息,行人与汽车为不重要信息。
请参阅图2,图2为本申请实施例提供的模型训练方法的原理示意图,其中为SE Context Gate的结构示意。
在一实施例中,将视频特征X输入至全连接层,对视频特征X进行批量归一化后,输入至ReLU激活函数。将ReLU激活函数的输出值输入至下一个全连接层再次进行批量归一化后,将再次批量归一化的结果输入至Sigmoid激活函数,由Sigmoid激活函数计算出视频特征X的特征权重,并将特征权重与视频特征X相乘后得到输出Y。从输出Y中可以得到视频特征及对应的特征权重。
其中,ReLU激活函数的表达式为:
Figure PCTCN2020071021-appb-000006
f(x)=max(0,x)
当x<0时,ReLU硬饱和,而当x>0时,则不存在饱和问题。所以,ReLU能够在x>0时保持梯度不衰减。
其中,Sigmoid激活函数的表达式为:
Figure PCTCN2020071021-appb-000007
Sigmoid激活函数具有指数函数形状,它在物理意义上接近生物神经元。此外,由于Sigmoid激活函数的值始终位于区间(0,1)中,因而Sigmoid激活函数的输出还可以用于表示概率,或用于输入的归一化。
在一实施方式中,将视频特征输入SE Context Gate,计算得到视频特征的特征权重,其中,各项视频特征的特征权重不同,对于重要的信息,对应视频特征的特征权重大,而对于不重要的信息,对应视频特征的特征权重小。最终通过SE Context Gate将视频特征和视频特征对应的特征权重一起输出,并输入至MoE模型。
211、将视频特征及对应的特征权重输入权重加权单元中计算加权和,得到对应视频样本的预测标签。
在一实施方式中,权重加权单元包括多个预设分类器,将视频特征及对应的特征权重输入权重加权单元中,得到对应视频样本的预测标签的步骤,包括:
将视频特征及对应的特征权重输入多个预设分类器中,得到多个分类结果及对应分类结果的权重;
根据多个分类结果及对应多个分类结果的权重计算加权和,得到对应视频样本的预测标签。
在分类模型的权重加权单元中,可以使用MoE模型。MoE模型包含多个softmax分类器在内的分类算法。MoE模型接收到SE Context Gate传来的视频特征及对应的权重,利用MoE模型中的分类算法,将这些视频特征及对应的权重输入多个softmax分类器,对多个softmax分类器的分类结果进行加权投票,得到最终结果。
MoE模型中的分类算法可以为:
Figure PCTCN2020071021-appb-000008
Figure PCTCN2020071021-appb-000009
其中,类别h和类别h′均表示总类别He中的其中一个类别,p(h|x)为单个softmax分类器的分类结果,
Figure PCTCN2020071021-appb-000010
为分类结果p(h|x)对应的权重,p(e|x)为对MoE模型中多个softmax分类器的分类结果及对应的权重进行加权求和,最终得到的视频样本的类别。
212、根据预测标签与分类标签的差异对图像特征提取模型、音频特征提取模型以及分类模型的参数进行调整,直至基础模型收敛,并将收敛的基础模型作为用于视频分类的视频分类模型。
在图像特征提取模型、音频特征提取模型与分类模型的联合训练过程中,预测标签与分类标签的差异不止是用于调整分类模型的参数,而是用于一并调整图像特征提取模型、音频特征提取模型与分类模型这三个模型的参数。
在一实施例中,使用第二损失函数计算预测标签与分类标签之间的差距。将预测标签与分类标签代入第二损失函数,得到第二损失值,若第二损失值满足预设条件则判定基础模型收敛。例如,随着基础模型的训练,第二损失函数输出的第二损失值越来越小,根据需求设置第二预设阈值,当第二损失值小于第二预设阈值时,认为基础模型收敛,基础模型的训练结果符合预期,将该收敛的基础模型作为用于视频分类的视频分类模型。
或者,在基础模型的迭代过程中,两次迭代之间的权值变化变化已经很小,预先设置第三预设阈值,当两次迭代之间的权值变化小于第三预设阈值时,认为基础模型收敛,基础模型的训练结果符合预期,将该收敛的基础模型作为用于视频分类的视频分类模型。
又或者,预先设置基础模型的迭代次数,当基础模型的迭代次数超过预设迭代次数时,停止迭代,并认为基础模型收敛,基础模型的训练结果符合预期,将该收敛的基础模型作为用于视频分类的视频分类模型。
请参阅图4,图4是本申请实施例提供的模型训练方法的第三种流程示意图。
在一实施例中,从经过人工打标的视频样本中截取视频片段,对视频片段进行图像帧采样和音频采样,将视频片段划分为图像样本和音频样本,对图像样本和音频样本分别进行预处理。其中,对图像样本的预处理包括对图像尺寸的缩放;对音频样本的预处理包括对音频样本的音频信号进行短时傅里叶变换。
然后,构建基础模型,基础模型中包括图像特征提取模型、音频特征提取模型和分类模型。将预处理后的图像样本输入图像特征提取模型提取出图像特征,将预处理后的音频样本输入至音频特征提取模型提取出音频特征,将由图像特征提取模型和音频特征提取模型输出的图像特征和音频特征输入到分类模型中。在分类模型中,分别通过分类模型中特征融合模型的图像部分对输入的图像特征进行融合,得到视频样本的视觉特征;通过分类模型中特征融合模型的音频部分对输入的音频特征进行融合,得到视频样本的声音特征;将视觉特征与声音特征结合为视频样本的视频特征,并将视频特征输入至权重分配单元,计算得到视频特征对应的权重。将视频特征及对应的权重输入至权重加权单元,得到视频样本的预测结果,即预测标签。
根据预测标签与人工打标的分类标签的差异,不断调整基础模型的参数,包括调整基础模型中图像特征提取模型、音频特征提取模型和分类模型(包括特征融合模型、权重分配单元、权重加权单元)的参数。在图4中,加粗的模块即为参与训练的模块,也即在联合训练过程中,根据预测标签与分类标签的差异调整参数的模块。
由上可知,本申请实施例提供的模型训练方法,通过获取视频样本以及对应视频样本的分类标签,将视频样本划分为图像样本和音频样本;构建基础模型,基础模型包括图像特征提取模型、音频特征提取模型和分类模型;通过图像特征提取模型提取得到图像样本的图像特征,以及通过音频特征提取模型提取得到音频样本的音频特征;将图像特征以及音频特征输入分类模型进行分类,得到对应视频样本的预测标签;根据预测标签与分类标签的差异对图像特征提取模型、音频特征提取模型以及分类模型的参数进行调整,直至基础模型收敛,并将收敛的基础模型作为用于视频分类的视频分类模型。以此可以通过损失函数对基础模型进行训练,以得到准确率更高的视频分类模型,提升了视频分类的准确率。
请参阅图5,图5是本申请实施例提供的视频处理方法的流程示意图。该视频处理方法的流程可以包括:
301、接收视频处理请求。
其中,当电子设备接收到目标组件触控操作、预设语音操作或预设目标应用的开启指令等方式时触发生成视频处理请求。另外,电子设备还可以在间隔预设时长或者基于一定的触发规则去自动触发生成视频处理请求。例如,当电子设备检测到当前显示界面包括视频时,如检测到电子设备启动浏览器浏览视频时,可以自动触发生成视频处理请求,根据视频分类模型对当前目标视频进行分类。使得电子设备可以通过机器学习算法,自动生成目标视频的预测标签。
302、根据视频处理请求获取需要进行分类的目标视频,并将目标视频划分为目标图像和目标音频。
其中,目标视频可以是存储在电子设备中的视频,此时视频处理请求中包括用于指示目标视频所存储的位置的路径信息,电子设备可以通过该路径信息去获取到需要进行标签预测的目标视频。当然,当目标视频不为存储在电子设备中的视频时,电子设备可以根据视频处理请求通过有线连接或者无线连接的方式获取需要进行分类的目标视频。
在一实施方式中,在目标视频中截取视频片段,视频片段可以为一段,也可以为多段。当从目标视频中截取一段视频片段时,将这一段视频片段划分为一个目标图像和一个目标音频。当从目标视频中截取多段视频片段时,将多段视频片段划分为多个目标图像和多个目标音频。当只有一个目标图像和一个目标音频时,目标图像和目标音频对应目标视频中的同一个视频片段。
303、调用预先训练的视频分类模型。
其中,视频分类模型采用本实施例提供的模型训练方法训练得到。具体的模型训练过程可以参见上述实施例的相关描述,在此不再赘述。
304、将目标图像与目标音频输入视频分类模型进行分类,获得目标视频的分类标签。
其中,将目标图像与目标音频输入视频分类模型进行分类,以得到目标视频对应的分类标签。该分类标签可以代表目标视频的类别。
由上可知,本申请实施例提供的视频处理方法,通过接收视频处理请求;根据视频处理请求获取需要进行分类的目标视频,并将目标视频划分为目标图像和目标音频;调用预先训练的视频分类模型;将目标图像与目标音频输入视频分类模型进行分类,获得目标视频的分类标签;以此通过视频分类模型去对目标视频进行分类。
请参阅图6,图6为本申请实施例提供的模型训练装置400的第一种结构示意图。该模型训练装置可以包括第一获取模型401、构建模块402、提取模块403、分类模块404和调整模块405:
第一获取模块401,用于获取视频样本以及对应视频样本的分类标签,将视频样本划分为图像样本和音频样本;
构建模块402,用于构建基础模型,基础模型包括图像特征提取模型、音频特征提取 模型和分类模型;
提取模块403,用于通过图像特征提取模型提取得到图像样本的图像特征,以及通过音频特征提取模型提取得到音频样本的音频特征;
分类模块404,用于将图像特征以及音频特征输入分类模型进行分类,得到对应视频样本的预测标签;
调整模块405,用于根据预测标签与分类标签的差异对图像特征提取模型、音频特征提取模型以及分类模型的参数进行调整,直至基础模型收敛,并将收敛的基础模型作为用于视频分类的视频分类模型。
在一些实施方式中,第一获取模块401,具体用于从视频样本中截取视频片段;将视频片段划分为图像样本和音频样本。
在一些实施方式中,分类模块404,具体用于将图像特征及音频特征输入特征融合模块中进行特征融合,得到视频样本的视频特征;将视频样本的视频特征输入特征分类模块中进行分类,得到对应视频样本的预测标签。
在一些实施方式中,分类模块404,具体用于将图像特征输入特征融合模块中进行特征融合,得到视频样本的视觉特征;将音频特征输入特征融合模块中进行特征融合,得到视频样本的声音特征;将视觉特征与声音特征结合为视频样本的视频特征。
在一些实施方式中,特征分类模块包括权重分配单元和权重加权单元,分类模块404,具体用于将视频特征输入权重分配单元中计算权重,得到视频特征对应的特征权重;将视频特征及对应的特征权重输入权重加权单元中计算加权和,得到对应视频样本的预测标签。
在一些实施方式中,权重加权单元包括多个预设分类器,分类模块404,具体用于将视频特征及对应的特征权重输入多个预设分类器中,得到多个分类结果及对应的权重;根据多个分类结果及对应的权重计算加权和,得到对应视频样本的预测标签。
请参阅图7,图7为本申请实施例提供的模型训练装置的第二种结构示意图。在一些实施方式中,本申请实施例提供的模型训练装置还包括预训练模块406:
预训练模块406,用于根据第一数据集对图像特征提取模型进行预训练;根据第二数据集对将音频特征提取模型进行预训练;根据预训练后的图像特征提取模型和预训练后的音频特征提取模型对分类模型进行预训练。
其中,在预训练模块406根据第一数据集对图像特征提取模型进行预训练之前,对第一数据集进行预处理。然后,根据预处理后的第一数据集对图像特征提取模型进行预训练。
由上可知,本申请实施例提供的模型训练装置,通过第一获取模型401获取视频样本以及对应视频样本的分类标签,将视频样本划分为图像样本和音频样本;构建模块402构建基础模型,基础模型包括图像特征提取模型、音频特征提取模型和分类模型;提取模块403通过图像特征提取模型提取得到图像样本的图像特征,以及通过音频特征提取模型提取得到音频样本的音频特征;分类模块404将图像特征以及音频特征输入分类模型进行分 类,得到对应视频样本的预测标签;调整模块405根据预测标签与分类标签的差异对图像特征提取模型、音频特征提取模型以及分类模型的参数进行调整,直至基础模型收敛,并将收敛的基础模型作为用于视频分类的视频分类模型。以此可以通过损失函数对基础模型进行训练,以得到准确率更高的视频分类模型,提升了视频分类的准确率。
应当说明的是,本申请实施例提供的模型训练装置与上文实施例中的模型训练方法属于同一构思,在模型训练装置上可以运行模型训练方法实施例中提供的任一方法,其具体实现过程详见模型训练方法实施例,此处不再赘述。
请参阅图8,图8为本申请实施例提供的视频处理装置的结构示意图。该视频处理装置可以包括:接收模块501、第二获取模块502、调用模型503、预测模块504。
接收模块501,用于接收视频处理请求;
第二获取模块502,用于根据视频处理请求获取需要进行分类的目标视频,并将目标视频划分为目标图像和目标音频;
调用模块503,用于调用预先训练的视频分类模型;
预测模块504,用于将目标图像与目标音频输入视频分类模型进行标签预测,获得目标视频的目标标签;
其中,视频分类模型采用本申请实施例提供的模型训练方法训练得到。
由上可知,本申请实施例提供的视频处理装置,通过接收模块501接收视频处理请求;第二获取模块502根据视频处理请求获取需要进行分类的目标视频,并将目标视频划分为目标图像和目标音频;调用模块503调用预先训练的视频分类模型;预测模块504将目标图像与目标音频输入视频分类模型进行分类,获得目标视频的分类标签;以此通过视频分类模型去对目标视频进行分类。
应当说明的是,本申请实施例提供的视频处理装置与上文实施例中的视频处理方法属于同一构思,在视频处理装置上可以运行视频处理方法实施例中提供的任一方法,其具体实现过程详见视频处理方法实施例,此处不再赘述。
本申请实施例提供一种计算机可读的存储介质,其上存储有计算机程序,当其存储的计算机程序在计算机上执行时,使得计算机执行如本申请实施例提供的模型训练方法或视频处理方法。
其中,存储介质可以是磁碟、光盘、只读存储器(Read Only Memory,ROM,)或者随机存取器(Random Access Memory,RAM)等。
本申请实施例还提供一种电子设备,包括存储器,处理器,存储器中存储有计算机程序,处理器通过调用存储器中存储的计算机程序,用于执行如本申请实施例提供的模型训练方法或视频处理方法。
例如,上述电子设备可以是诸如平板电脑或者智能手机等移动终端。请参阅图9,图9为本申请实施例提供的电子设备的第一种结构示意图。
该电子设备600可以包括存储器601、处理器602等部件。本领域技术人员可以理解,图9中示出的电子设备结构并不构成对电子设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
存储器601可用于存储软件程序以及模块,处理器602通过运行存储在存储器601的计算机程序以及模块,从而执行各种功能应用以及数据处理。存储器601可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的计算机程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据等。
处理器602是电子设备的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或执行存储在存储器601内的应用程序,以及调用存储在存储器601内的数据,执行电子设备的各种功能和处理数据,从而对电子设备进行整体监控。
此外,存储器601可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器601还可以包括存储器控制器,以提供处理器602对存储器601的访问。
在本实施例中,电子设备中的处理器602会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行代码加载到存储器601中,并由处理器602来运行存储在存储器601中的应用程序,从而实现流程:
获取视频样本以及对应视频样本的分类标签,将视频样本划分为图像样本和音频样本;
构建基础模型,基础模型包括图像特征提取模型、音频特征提取模型和分类模型;
通过图像特征提取模型提取得到图像样本的图像特征,以及通过音频特征提取模型提取得到音频样本的音频特征;
将图像特征以及音频特征输入分类模型进行分类,得到对应视频样本的预测标签;
根据预测标签与分类标签的差异对图像特征提取模型、音频特征提取模型以及分类模型的参数进行调整,直至基础模型收敛,并将收敛的基础模型作为用于视频分类的视频分类模型。
在一些实施方式中,分类模型包括特征融合模块和特征分类模块,处理器502执行将图像特征以及音频特征输入分类模型进行分类,得到对应视频样本的预测标签时,可以执行:
将图像特征及音频特征输入特征融合模块中进行特征融合,得到视频样本的视频特征;
将视频样本的视频特征输入特征分类模块中进行分类,得到对应视频样本的预测标签。
在一些实施方式中,处理器602执行将图像特征及音频特征输入特征融合模块中进行特征融合,得到视频样本的视频特征时,可以执行:
将图像特征输入特征融合模块中进行特征融合,得到视频样本的视觉特征;
将音频特征输入特征融合模块中进行特征融合,得到视频样本的声音特征;
将视觉特征与声音特征结合为视频样本的视频特征。
在一些实施方式中,特征分类模块包括权重分配单元和权重加权单元,处理器602执行将视频样本的视频特征输入特征分类模块中进行分类,得到对应视频样本的预测标签时,可以执行:
将视频特征输入权重分配单元中计算权重,得到视频特征对应的特征权重;
将视频特征及对应的特征权重输入权重加权单元中计算加权和,得到对应视频样本的预测标签。
在一些实施方式中,权重加权单元包括多个预设分类器,处理器502执行将视频特征及对应的特征权重输入权重加权单元中,得到对应视频样本的预测标签时,可以执行:
将视频特征及对应的特征权重输入多个预设分类器中,得到多个分类结果及对应的权重;
根据多个分类结果及对应的权重计算加权和,得到对应视频样本的预测标签。
在一些实施方式中,处理器602执行将视频样本划分为图像样本和音频样本时,可以执行:
从视频样本中截取视频片段;
将视频片段划分为图像样本和音频样本。
在一些实施方式中,处理器602执行通过图像特征提取模型提取得到图像样本的图像特征,以及通过音频特征提取模型提取得到音频样本的音频特征之前,可以执行:
根据第一数据集对图像特征提取模型进行预训练;
根据第二数据集对将音频特征提取模型进行预训练;
根据预训练后的图像特征提取模型和预训练后的音频特征提取模型对分类模型进行预训练。
在一些实施方式中,处理器602执行根据第一数据集对图像特征提取模型进行预训练之前,可以执行:
对第一数据集进行预处理;
根据第一数据集对图像特征提取模型进行预训练的步骤,包括:
根据预处理后的第一数据集对图像特征提取模型进行预训练。
在本实施例中,电子设备中的处理器602会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行代码加载到存储器601中,并由处理器601来运行存储在存储器601中的应用程序,从而实现流程:
接收视频处理请求;
根据视频处理请求获取需要进行分类的目标视频,并将目标视频划分为目标图像和目标音频;
调用预先训练的视频分类模型;
将目标图像与目标音频输入视频分类模型进行分类,获得目标视频的分类标签;
其中,视频分类模型采用本申请实施例提供的模型训练方法训练得到。
请参照图10,图10为本申请实施例提供的电子设备的第二结构示意图,与图9所示电子设备的区别在于,电子设备还包括:摄像组件603、射频电路604、音频电路605以及电源606。其中,显示器603、射频电路604、音频电路605以及电源606分别与处理器602电性连接。
该显示器603可以用于显示由用户输入的信息或提供给用户的信息以及各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示器603可以包括显示面板,在某些实施方式中,可以采用液晶显示器(Liquid Crystal Display,LCD)、或者有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板。
射频电路604可以用于收发射频信号,以通过无线通信与网络设备或其他电子设备建立无线通讯,与网络设备或其他电子设备之间收发信号。
音频电路605可以用于通过扬声器、传声器提供用户与电子设备之间的音频接口。
电源606可以用于给电子设备600的各个部件供电。在一些实施例中,电源606可以通过电源管理系统与处理器602逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
尽管图10中未示出,电子设备600还可以包括摄像组件、蓝牙模块等,摄像组件可以包括图像处理电路,图像处理电路可以利用硬件和/或软件组件实现,可包括定义图像信号处理(Image Signal Processing)管线的各种处理单元。图像处理电路至少可以包括:多个摄像头、图像信号处理器(Image Signal Processor,ISP处理器)、控制逻辑器、图像存储器以及显示器等。其中每个摄像头至少可以包括一个或多个透镜和图像传感器。图像传感器可包括色彩滤镜阵列(如Bayer滤镜)。图像传感器可获取用图像传感器的每个成像像素捕捉的光强度和波长信息,并提供可由图像信号处理器处理的一组原始图像数据。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见上文针对模型训练方法/视频处理方法的详细描述,此处不再赘述。
本申请实施例提供的模型训练方法/视频处理方法装置与上文实施例中的模型训练方法/视频处理方法属于同一构思,在模型训练方法/视频处理方法装置上可以运行模型训练方法/视频处理方法实施例中提供的任一方法,其具体实现过程详见模型训练方法/视频处理方法实施例,此处不再赘述。
需要说明的是,对本申请实施例模型训练方法/视频处理方法而言,本领域普通技术人员可以理解实现本申请实施例模型训练方法/视频处理方法的全部或部分流程,是可以通过计算机程序来控制相关的硬件来完成,计算机程序可存储于一计算机可读取存储介质中,如存储在存储器中,并被至少一个处理器执行,在执行过程中可包括如模型训练方法/视频处理方法的实施例的流程。其中,的存储介质可为磁碟、光盘、只读存储器(ROM,Read  Only Memory)、随机存取记忆体(RAM,Random Access Memory)等。
对本申请实施例的模型训练方法/视频处理方法装置而言,其各功能模块可以集成在一个处理芯片中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中,存储介质譬如为只读存储器,磁盘或光盘等。
以上对本申请实施例所提供的一种模型训练方法、视频处理方法、装置、存储介质及电子设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上,本说明书内容不应理解为对本申请的限制。

Claims (19)

  1. 一种模型训练方法,包括:
    获取视频样本以及对应所述视频样本的分类标签,将所述视频样本划分为图像样本和音频样本;
    构建基础模型,所述基础模型包括图像特征提取模型、音频特征提取模型和分类模型;
    通过所述图像特征提取模型提取得到所述图像样本的图像特征,以及通过所述音频特征提取模型提取得到所述音频样本的音频特征;
    将所述图像特征以及所述音频特征输入所述分类模型进行分类,得到对应所述视频样本的预测标签;
    根据所述预测标签与所述分类标签的差异对所述图像特征提取模型、所述音频特征提取模型以及所述分类模型的参数进行调整,直至所述基础模型收敛,并将收敛的基础模型作为用于视频分类的视频分类模型。
  2. 根据权利要求1所述的方法,其中,所述分类模型包括特征融合模块和特征分类模块,所述将所述图像特征以及所述音频特征输入所述分类模型进行分类,得到对应所述视频样本的预测标签的步骤,包括:
    将所述图像特征及所述音频特征输入所述特征融合模块中进行特征融合,得到所述视频样本的视频特征;
    将所述视频样本的视频特征输入所述特征分类模块中进行分类,得到对应所述视频样本的预测标签。
  3. 根据权利要求2所述的方法,其中,所述将所述图像特征及所述音频特征输入所述特征融合模块中进行特征融合,得到所述视频样本的视频特征的步骤,包括:
    将所述图像特征输入所述特征融合模块中进行特征融合,得到所述视频样本的视觉特征;
    将所述音频特征输入所述特征融合模块中进行特征融合,得到所述视频样本的声音特征;
    将所述视觉特征与所述声音特征结合为所述视频样本的视频特征。
  4. 根据权利要求2所述的方法,其中,所述特征分类模块包括权重分配单元和权重加权单元,所述将所述视频样本的视频特征输入所述特征分类模块中进行分类,得到对应所述视频样本的预测标签的步骤,包括:
    将所述视频特征输入所述权重分配单元中计算权重,得到所述视频特征对应的特征权重;
    将所述视频特征及对应的特征权重输入所述权重加权单元中计算加权和,得到对应所述视频样本的预测标签。
  5. 根据权利要求4所述的方法,其中,所述权重加权单元包括多个预设分类器,所述 将所述视频特征及对应的特征权重输入所述权重加权单元中计算加权和,得到对应所述视频样本的预测标签的步骤,包括:
    将所述视频特征及对应的特征权重输入所述多个预设分类器中,得到多个分类结果及对应的权重;
    根据所述多个分类结果及对应的权重计算加权和,得到对应所述视频样本的预测标签。
  6. 根据权利要求1所述的方法,其中,所述将所述视频样本划分为图像样本和音频样本的步骤,包括:
    从所述视频样本中截取视频片段;
    将所述视频片段划分为图像样本和音频样本。
  7. 根据权利要求1所述的方法,其中,所述通过所述图像特征提取模型提取得到所述图像样本的图像特征,以及通过所述音频特征提取模型提取得到所述音频样本的音频特征的步骤之前,还包括:
    根据第一数据集对所述图像特征提取模型进行预训练;
    根据第二数据集对所述音频特征提取模型进行预训练;
    根据预训练后的所述图像特征提取模型和预训练后的所述音频特征提取模型对所述分类模型进行预训练。
  8. 根据权利要求7所述的方法,其中,所述根据第一数据集对所述图像特征提取模型进行预训练的步骤之前,还包括:
    对所述第一数据集进行预处理;
    所述根据第一数据集对所述图像特征提取模型进行预训练的步骤,包括:
    根据预处理后的第一数据集对所述图像特征提取模型进行预训练。
  9. 一种视频处理方法,包括:
    接收视频处理请求;
    根据所述视频处理请求获取需要进行分类的目标视频,并将所述目标视频划分为目标图像和目标音频;
    调用预先训练的视频分类模型;
    将所述目标图像与所述目标音频输入所述视频分类模型进行分类,获得所述目标视频的分类标签;
    其中,所述视频分类模型采用权利要求1至9任一项所述的模型训练方法训练得到。
  10. 一种模型训练装置,包括:
    第一获取模块,用于获取视频样本以及对应所述视频样本的分类标签,将所述视频样本划分为图像样本和音频样本;
    构建模块,用于构建基础模型,所述基础模型包括图像特征提取模型、音频特征提取模型和分类模型;
    提取模块,用于通过所述图像特征提取模型提取得到所述图像样本的图像特征,以及通过所述音频特征提取模型提取得到所述音频样本的音频特征;
    分类模块,用于将所述图像特征以及所述音频特征输入所述分类模型进行分类,得到对应所述视频样本的预测标签;
    调整模块,用于根据所述预测标签与所述分类标签的差异对所述图像特征提取模型、所述音频特征提取模型以及所述分类模型的参数进行调整,直至所述基础模型收敛,并将收敛的基础模型作为用于视频分类的视频分类模型。
  11. 一种视频处理装置,包括:
    接收模块,用于接收视频处理请求;
    第二获取模块,用于根据所述视频处理请求获取需要进行分类的目标视频,并将所述目标视频划分为目标图像和目标音频;
    调用模块,用于调用预先训练的视频分类模型;
    预测模块,用于将所述目标图像与所述目标音频输入所述视频分类模型进行分类,获得所述目标视频的分类标签;
    其中,所述视频分类模型采用权利要求1至8任一项所述的模型训练方法训练得到。
  12. 一种存储介质,其中,所述存储介质中存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行权利要求1至8任一项所述的模型训练方法或权利要求9所述的视频处理方法。
  13. 一种电子设备,其中,所述电子设备包括处理器和存储器,所述存储器中存储有计算机程序,所述处理器通过调用所述存储器中存储的所述计算机程序,用于执行:
    获取视频样本以及对应所述视频样本的分类标签,将所述视频样本划分为图像样本和音频样本;
    构建基础模型,所述基础模型包括图像特征提取模型、音频特征提取模型和分类模型;
    通过所述图像特征提取模型提取得到所述图像样本的图像特征,以及通过所述音频特征提取模型提取得到所述音频样本的音频特征;
    将所述图像特征以及所述音频特征输入所述分类模型进行分类,得到对应所述视频样本的预测标签;
    根据所述预测标签与所述分类标签的差异对所述图像特征提取模型、所述音频特征提取模型以及所述分类模型的参数进行调整,直至所述基础模型收敛,并将收敛的基础模型作为用于视频分类的视频分类模型。
  14. 根据权利要求13所述的电子设备,其中,所述分类模型包括特征融合模块和特征分类模块,所述处理器用于执行:
    将所述图像特征及所述音频特征输入所述特征融合模块中进行特征融合,得到所述视频样本的视频特征;
    将所述视频样本的视频特征输入所述特征分类模块中进行分类,得到对应所述视频样本的预测标签。
  15. 根据权利要求14所述的电子设备,其中,所述处理器用于执行:
    将所述图像特征输入所述特征融合模块中进行特征融合,得到所述视频样本的视觉特征;
    将所述音频特征输入所述特征融合模块中进行特征融合,得到所述视频样本的声音特征;
    将所述视觉特征与所述声音特征结合为所述视频样本的视频特征。
  16. 根据权利要求14所述的电子设备,其中,所述特征分类模块包括权重分配单元和权重加权单元,所述处理器用于执行:
    将所述视频特征输入所述权重分配单元中计算权重,得到所述视频特征对应的特征权重;
    将所述视频特征及对应的特征权重输入所述权重加权单元中计算加权和,得到对应所述视频样本的预测标签。
  17. 根据权利要求16所述的电子设备,其中,所述权重加权单元包括多个预设分类器,所述处理器用于执行:
    将所述视频特征及对应的特征权重输入所述多个预设分类器中,得到多个分类结果及对应的权重;
    根据所述多个分类结果及对应的权重计算加权和,得到对应所述视频样本的预测标签。
  18. 根据权利要求13所述的电子设备,其中,所述处理器用于执行:
    根据第一数据集对所述图像特征提取模型进行预训练;
    根据第二数据集对将所述音频特征提取模型进行预训练;
    根据预训练后的所述图像特征提取模型和预训练后的所述音频特征提取模型对所述分类模型进行预训练。
  19. 一种电子设备,其中,所述电子设备包括处理器和存储器,所述存储器中存储有计算机程序,所述处理器通过调用所述存储器中存储的所述计算机程序,用于执行:
    接收视频处理请求;
    根据所述视频处理请求获取需要进行分类的目标视频,并将所述目标视频划分为目标图像和目标音频;
    调用预先训练的视频分类模型;
    将所述目标图像与所述目标音频输入所述视频分类模型进行分类,获得所述目标视频的分类标签;
    其中,所述视频分类模型采用权利要求1至8任一项所述的模型训练方法训练得到。
PCT/CN2020/071021 2020-01-08 2020-01-08 模型训练方法、视频处理方法、装置、存储介质及电子设备 WO2021138855A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/071021 WO2021138855A1 (zh) 2020-01-08 2020-01-08 模型训练方法、视频处理方法、装置、存储介质及电子设备
CN202080084487.XA CN114787844A (zh) 2020-01-08 2020-01-08 模型训练方法、视频处理方法、装置、存储介质及电子设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/071021 WO2021138855A1 (zh) 2020-01-08 2020-01-08 模型训练方法、视频处理方法、装置、存储介质及电子设备

Publications (1)

Publication Number Publication Date
WO2021138855A1 true WO2021138855A1 (zh) 2021-07-15

Family

ID=76787433

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/071021 WO2021138855A1 (zh) 2020-01-08 2020-01-08 模型训练方法、视频处理方法、装置、存储介质及电子设备

Country Status (2)

Country Link
CN (1) CN114787844A (zh)
WO (1) WO2021138855A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408664A (zh) * 2021-07-20 2021-09-17 北京百度网讯科技有限公司 训练方法、分类方法、装置、电子设备以及存储介质
CN113672252A (zh) * 2021-07-23 2021-11-19 浙江大华技术股份有限公司 模型升级方法、视频监控系统、电子设备和可读存储介质
CN113807281A (zh) * 2021-09-23 2021-12-17 深圳信息职业技术学院 图像检测模型的生成方法、检测方法、终端及存储介质
CN113806536A (zh) * 2021-09-14 2021-12-17 广州华多网络科技有限公司 文本分类方法及其装置、设备、介质、产品
CN114528762A (zh) * 2022-02-17 2022-05-24 腾讯科技(深圳)有限公司 一种模型训练方法、装置、设备和存储介质
CN116996680A (zh) * 2023-09-26 2023-11-03 上海视龙软件有限公司 一种用于视频数据分类模型训练的方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050125223A1 (en) * 2003-12-05 2005-06-09 Ajay Divakaran Audio-visual highlights detection using coupled hidden markov models
US20150082349A1 (en) * 2013-09-13 2015-03-19 Arris Enterprises, Inc. Content Based Video Content Segmentation
CN104866596A (zh) * 2015-05-29 2015-08-26 北京邮电大学 一种基于自动编码器的视频分类方法及装置
CN109344781A (zh) * 2018-10-11 2019-02-15 上海极链网络科技有限公司 一种基于声音视觉联合特征的视频内表情识别方法
CN109840509A (zh) * 2019-02-15 2019-06-04 北京工业大学 网络直播视频中不良主播的多层次协同识别方法及装置
CN110263217A (zh) * 2019-06-28 2019-09-20 北京奇艺世纪科技有限公司 一种视频片段标签识别方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050125223A1 (en) * 2003-12-05 2005-06-09 Ajay Divakaran Audio-visual highlights detection using coupled hidden markov models
US20150082349A1 (en) * 2013-09-13 2015-03-19 Arris Enterprises, Inc. Content Based Video Content Segmentation
CN104866596A (zh) * 2015-05-29 2015-08-26 北京邮电大学 一种基于自动编码器的视频分类方法及装置
CN109344781A (zh) * 2018-10-11 2019-02-15 上海极链网络科技有限公司 一种基于声音视觉联合特征的视频内表情识别方法
CN109840509A (zh) * 2019-02-15 2019-06-04 北京工业大学 网络直播视频中不良主播的多层次协同识别方法及装置
CN110263217A (zh) * 2019-06-28 2019-09-20 北京奇艺世纪科技有限公司 一种视频片段标签识别方法及装置

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408664A (zh) * 2021-07-20 2021-09-17 北京百度网讯科技有限公司 训练方法、分类方法、装置、电子设备以及存储介质
CN113408664B (zh) * 2021-07-20 2024-04-16 北京百度网讯科技有限公司 训练方法、分类方法、装置、电子设备以及存储介质
CN113672252A (zh) * 2021-07-23 2021-11-19 浙江大华技术股份有限公司 模型升级方法、视频监控系统、电子设备和可读存储介质
CN113806536A (zh) * 2021-09-14 2021-12-17 广州华多网络科技有限公司 文本分类方法及其装置、设备、介质、产品
CN113806536B (zh) * 2021-09-14 2024-04-16 广州华多网络科技有限公司 文本分类方法及其装置、设备、介质、产品
CN113807281A (zh) * 2021-09-23 2021-12-17 深圳信息职业技术学院 图像检测模型的生成方法、检测方法、终端及存储介质
CN113807281B (zh) * 2021-09-23 2024-03-29 深圳信息职业技术学院 图像检测模型的生成方法、检测方法、终端及存储介质
CN114528762A (zh) * 2022-02-17 2022-05-24 腾讯科技(深圳)有限公司 一种模型训练方法、装置、设备和存储介质
CN114528762B (zh) * 2022-02-17 2024-02-20 腾讯科技(深圳)有限公司 一种模型训练方法、装置、设备和存储介质
CN116996680A (zh) * 2023-09-26 2023-11-03 上海视龙软件有限公司 一种用于视频数据分类模型训练的方法及装置
CN116996680B (zh) * 2023-09-26 2023-12-12 上海视龙软件有限公司 一种用于视频数据分类模型训练的方法及装置

Also Published As

Publication number Publication date
CN114787844A (zh) 2022-07-22

Similar Documents

Publication Publication Date Title
WO2021138855A1 (zh) 模型训练方法、视频处理方法、装置、存储介质及电子设备
WO2020221278A1 (zh) 视频分类方法及其模型的训练方法、装置和电子设备
CN109543714B (zh) 数据特征的获取方法、装置、电子设备及存储介质
CN111209970B (zh) 视频分类方法、装置、存储介质及服务器
Lee et al. Multi-view automatic lip-reading using neural network
WO2020108396A1 (zh) 视频分类的方法以及服务器
CN112069414A (zh) 推荐模型训练方法、装置、计算机设备及存储介质
Xu et al. A multi-view CNN-based acoustic classification system for automatic animal species identification
CN111491187B (zh) 视频的推荐方法、装置、设备及存储介质
CN112818861A (zh) 一种基于多模态上下文语义特征的情感分类方法及系统
CN113421547B (zh) 一种语音处理方法及相关设备
KR101617649B1 (ko) 영상의 관심 구간 추천 시스템 및 방법
WO2022048239A1 (zh) 音频的处理方法和装置
CN113515942A (zh) 文本处理方法、装置、计算机设备及存储介质
WO2021092808A1 (zh) 网络模型的训练方法、图像的处理方法、装置及电子设备
WO2023207541A1 (zh) 一种语音处理方法及相关设备
US20230134852A1 (en) Electronic apparatus and method for providing search result related to query sentence
CN113656563A (zh) 一种神经网络搜索方法及相关设备
WO2021134485A1 (zh) 视频评分方法、装置、存储介质及电子设备
CN116958852A (zh) 视频与文本的匹配方法、装置、电子设备和存储介质
WO2021147084A1 (en) Systems and methods for emotion recognition in user-generated video(ugv)
Carneiro et al. FaVoA: Face-Voice association favours ambiguous speaker detection
CN114912540A (zh) 迁移学习方法、装置、设备及存储介质
CN115129975A (zh) 推荐模型训练方法、推荐方法、装置、设备及存储介质
KR20210078122A (ko) 정보 처리 방법 및 정보 처리 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20911370

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20911370

Country of ref document: EP

Kind code of ref document: A1