CN114787844A

CN114787844A - Model training method, video processing method, device, storage medium and electronic equipment

Info

Publication number: CN114787844A
Application number: CN202080084487.XA
Authority: CN
Inventors: 郭子亮
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd; Shenzhen Huantai Technology Co Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd; Shenzhen Huantai Technology Co Ltd
Priority date: 2020-01-08
Filing date: 2020-01-08
Publication date: 2022-07-22
Also published as: WO2021138855A1

Abstract

The embodiment of the application discloses a model training method, a video processing device, a storage medium and electronic equipment. The model training method comprises the following steps: the method comprises the steps of obtaining a video sample and a classification label corresponding to the video sample, and dividing the video sample into an image sample and an audio sample; constructing a basic model, wherein the basic model comprises an image feature extraction model, an audio feature extraction model and a classification model; extracting image characteristics of the image sample through an image characteristic extraction model, and extracting audio characteristics of the audio sample through an audio characteristic extraction model; inputting the image characteristics and the audio characteristics into a classification model for classification to obtain a prediction label corresponding to a video sample; and adjusting parameters of the image feature extraction model, the audio feature extraction model and the classification model according to the difference between the prediction label and the classification label until the basic model converges, and taking the converged basic model as a video classification model for video classification.

Description

Model training method, video processing method, device, storage medium and electronic equipment

Technical Field

The present application relates to the field of machine learning, and in particular, to a model training method, a video processing method, an apparatus, a storage medium, and an electronic device.

Background

With the rapid development of mobile internet and the rapid popularization of smart phones, visual content data such as images and videos are increasing day by day, and video tags are derived. The video label is a hierarchical classification label formed by performing multi-dimensional analysis such as scene classification, character recognition, voice recognition, character recognition and the like on a video. The process of acquiring the video tags can be called video marking, the content of the videos is classified through the video marking, and the video marking can be used as a basis for a user to find videos interested in the user and videos recommended by certain merchants or platforms.

At present, the video marking mode is manual marking, and the video marking needs to be performed manually. However, the manual marking method has a problem of low efficiency.

Disclosure of Invention

The embodiment of the application provides a model training method, a video processing device, a storage medium and electronic equipment, and the efficiency of video marking can be improved through training a model.

In a first aspect, an embodiment of the present application provides a model training method, including:

acquiring a video sample and a classification label corresponding to the video sample, and dividing the video sample into an image sample and an audio sample;

constructing a basic model, wherein the basic model comprises an image feature extraction model, an audio feature extraction model and a classification model;

extracting the image characteristics of the image sample through the image characteristic extraction model, and extracting the audio characteristics of the audio sample through the audio characteristic extraction model;

inputting the image characteristics and the audio characteristics into the classification model for classification to obtain a prediction label corresponding to the video sample;

and adjusting parameters of the image feature extraction model, the audio feature extraction model and the classification model according to the difference between the prediction label and the classification label until a basic model converges, and taking the converged basic model as a video classification model for video classification.

In a second aspect, an embodiment of the present application provides a video processing method, including:

receiving a video processing request;

acquiring a target video to be classified according to the video processing request, and dividing the target video into a target image and a target audio;

calling a pre-trained video classification model;

classifying the target image and the target audio input video classification model to obtain a classification label of the target video;

the video classification model is obtained by training by adopting the model training method provided by the embodiment.

In a third aspect, an embodiment of the present application provides a model training apparatus, including:

the first acquisition module is used for acquiring the video samples and the classification labels corresponding to the video samples, and dividing the video samples into image samples and audio samples;

the building module is used for building a basic model, and the basic model comprises an image feature extraction model, an audio feature extraction model and a classification model;

the extraction module is used for extracting the image characteristics of the image sample through the image characteristic extraction model and extracting the audio characteristics of the audio sample through the audio characteristic extraction model;

the classification module is used for inputting the image characteristics and the audio characteristics into the classification model for classification to obtain a prediction label corresponding to the video sample;

and the adjusting module is used for adjusting parameters of the image feature extraction model, the audio feature extraction model and the classification model according to the difference between the prediction label and the classification label until the basic model converges, and taking the converged basic model as a video classification model for video classification.

In a fourth aspect, an embodiment of the present application provides a video processing apparatus, including:

the receiving module is used for receiving a video processing request;

the second acquisition module is used for acquiring a target video to be classified according to the video processing request and dividing the target video into a target image and a target audio;

the calling module is used for calling a pre-trained video classification model;

the prediction module is used for classifying the target image and the target audio input video classification model to obtain a classification label of the target video;

the video classification model is obtained by training by using the model training method provided by the embodiment.

In a fifth aspect, an embodiment of the present application provides a storage medium, on which a computer program is stored, wherein when the computer program is executed on a computer, the computer is caused to execute the model training method or the video processing method provided by the embodiment.

In a sixth aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to, by calling the computer program stored in the memory, execute:

the method comprises the steps of obtaining a video sample and a classification label corresponding to the video sample, and dividing the video sample into an image sample and an audio sample;

extracting image characteristics of the image sample through an image characteristic extraction model, and extracting audio characteristics of the audio sample through an audio characteristic extraction model;

inputting the image characteristics and the audio characteristics into a classification model for classification to obtain a prediction label corresponding to the video sample;

and adjusting parameters of the image feature extraction model, the audio feature extraction model and the classification model according to the difference between the prediction label and the classification label until the basic model converges, and taking the converged basic model as a video classification model for video classification.

In a seventh aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor, by calling the computer program stored in the memory, is configured to execute:

receiving a video processing request;

calling a pre-trained video classification model;

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a schematic flowchart of a first method for training a model according to an embodiment of the present disclosure.

Fig. 2 is a schematic diagram of a model training method provided in an embodiment of the present application.

Fig. 3 is a schematic flowchart of a second method for training a model according to an embodiment of the present disclosure.

Fig. 4 is a third flowchart of a model training method provided in the embodiment of the present application.

Fig. 5 is a schematic flowchart of a video processing method according to an embodiment of the present application.

Fig. 6 is a schematic structural diagram of a first model training device provided in an embodiment of the present application.

Fig. 7 is a schematic structural diagram of a first model training device according to an embodiment of the present application.

Fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application.

Fig. 9 is a schematic structural diagram of a first electronic device according to an embodiment of the present application.

Fig. 10 is a schematic structural diagram of a second electronic device according to an embodiment of the present application.

Detailed Description

Referring to the drawings, wherein like reference numbers refer to like elements, the principles of the present application are illustrated as being implemented in a suitable computing environment. The following description is based on illustrated embodiments of the application and should not be taken as limiting the application with respect to other embodiments that are not detailed herein.

The embodiment of the application provides a model training method. The execution subject of the model training method may be the model training apparatus provided in the embodiment of the present application, or an electronic device integrated with the model training apparatus. The model training device can be realized in a hardware or software mode, and the electronic equipment can be equipment with processing capability provided with a processor, such as a smart phone, a tablet computer, a palm computer, a notebook computer or a desktop computer. For convenience of description, the electronic device will be exemplified as an execution subject of the model training method.

Referring to fig. 1, fig. 1 is a first flowchart illustrating a model training method according to an embodiment of the present disclosure. The flow of the model training method can comprise the following steps:

101. the method comprises the steps of obtaining a video sample and a classification label corresponding to the video sample, and dividing the video sample into an image sample and an audio sample.

The electronic device can acquire the video sample through a wired connection or a wireless connection. The labels can reflect the content of the video samples, and the same video sample can have a plurality of corresponding labels.

For example, if the content of the video sample is a girl sliding a skateboard on a street, the label of the video sample may be girl, skateboard, street, and ignoring surrounding unimportant pedestrians, vehicles, buildings, etc. The process of acquiring the video tags can be called video marking, the content of the videos is classified through the video marking, and the video marking can be used as a basis for a user to find videos which the user is interested in and videos recommended by certain merchants or platforms. The classification label may be a label that is manually set (i.e., set by a manual marking method).

In one embodiment, a video segment is cut out of a video sample, and the video segment is divided into an image sample and an audio sample. The video clip may be a single segment or multiple segments. When a video segment is cut out of a video, the video segment is divided into an image sample and an audio sample. When a plurality of video segments are cut out of a video, the plurality of video segments are divided into a plurality of image samples and a plurality of audio samples.

Wherein the image sample is the content of the video sample observed visually, for example, the image sample shows a person in a piano; an audio sample is an auditory observation of the content of a video sample, e.g., the sound of a piano ring.

102. And constructing a basic model, wherein the basic model comprises an image feature extraction model, an audio feature extraction model and a classification model.

The basic model building comprises an image feature extraction model, an audio feature extraction model and a classification model in the basic model. The trained basic model can be applied to electronic equipment such as a smart phone by constructing the basic model, and then videos of the smart phone are classified.

In one embodiment, the image feature extraction model may employ the ResNet-101 model. The ResNet-101 model is a CNN (convolutional Neural Network) model, and has a 101-layer Network hidden layer. The ResNet-101 model learns the residual representation between inputs and outputs by using multiple layers of parameters, rather than attempting to learn the mapping between inputs and outputs directly using layers of parameters as in typical CNN networks. Using the parameter layers to directly learn the residual error is much easier (faster convergence) and much more efficient (higher classification accuracy can be achieved by using more layers) than directly learning the mapping between input and output.

In one embodiment, the audio feature extraction model may use a vgg (oxford Visual Geometry group) deep network model. The VGG deep network model is composed of 5 convolutional layers, 3 fully-connected layers and a softmax output layer, and the layers are separated by max-pooling. Wherein, the depth increase of the VGG deep network model and the use of a small convolution kernel have great effect on the final audio feature extraction effect.

103. And extracting the image characteristics of the image sample through an image characteristic extraction model, and extracting the audio characteristics of the audio sample through an audio characteristic extraction model.

In one embodiment, before the image features of the image sample are extracted and obtained by the image feature extraction model and the audio features of the audio sample are extracted and obtained by the audio feature extraction model, the image feature extraction model, the audio feature extraction model and the classification model are pre-trained respectively, then the image features of the image sample are extracted and obtained by the pre-trained image feature extraction model, and the audio features of the audio sample are extracted and obtained by the pre-trained audio feature extraction model.

In one embodiment, the image sample is input into an image feature extraction model, such as a pre-trained ResNet-101 model, for image feature extraction, so as to obtain the image features of the image sample. For example, the image sample may be input to the pre-trained ResNet-101 model, and the features before the last full connection (i.e., the features output by the penultimate full connection layer) in the 101-layer network hidden layer of the ResNet-101 model may be used as the image features of the image sample.

In an embodiment, the audio sample is input to an audio feature extraction model, such as a pre-trained VGG deep network model, for audio feature extraction, so as to obtain the audio feature of the audio sample. For example, the audio sample is input into the pre-trained VGG deep network model, and the feature before the last layer of full connection (i.e., the feature output by the last but one layer of full connection) in the VGG deep network model is taken as the audio feature of the audio sample.

Wherein the extracted image features and audio features can reflect the features of the video sample. When the video sample has only one video segment, the extracted image features and audio features together reflect the features of the segment.

104. And inputting the image characteristics and the audio characteristics into a classification model for classification to obtain a prediction label corresponding to the video sample.

In one embodiment, the classification model includes two modules, a feature fusion module and a feature classification module. Inputting the image characteristics and the audio characteristics into a classification model for classification to obtain a prediction label corresponding to the video sample, wherein the method comprises the following steps:

inputting the image characteristics and the audio characteristics into a characteristic fusion module for characteristic fusion to obtain video characteristics of the video sample;

and inputting the video characteristics of the video samples into a characteristic classification module for classification to obtain the prediction labels corresponding to the video samples.

After the image characteristic extraction model outputs the image characteristics of the image sample and the audio characteristics of the audio sample, the image characteristics and the audio characteristics are not input into any algorithm except the basic model, but directly enter the classification model in the basic model after being output by the image characteristic extraction model and the audio characteristic extraction model in the basic model. The classification model receives the image features extracted by the image feature extraction model and the audio features extracted by the audio feature extraction model, and the prediction labels of the video samples are obtained by fusing and classifying the image features and the audio features.

The feature fusion module may fuse the multi-frame features into one feature, for example, may fuse the multi-frame image features into one image feature, and fuse the multi-frame audio features into one audio feature.

In the feature fusion module, a NeXtVLAD algorithm may be used. Inputting the characteristics of the multi-frame images as a variable x into a NeXtVLAD algorithm, wherein x can be x₁、x ₂、x ₃And so on. Acquiring a clustering center C obtained by the image feature extraction module through pre-training, wherein a specific algorithm can be as follows: c and x₁Multiplying the subtracted difference by x₁Corresponding weights + C and x₂Multiplying the subtracted difference by x₂Corresponding weights + C and x₃Multiplying the subtracted difference by x₃Corresponding weight + …. Therefore, the visual characteristic fused by the image characteristic and the sound characteristic fused by the audio characteristic are obtained by a weighted sum and normalization mode. And combining the visual characteristic and the sound characteristic to form the video characteristic of the video sample. If the image sample and the audio sample are obtained by dividing a video segment intercepted from the video sample, the visual feature and the sound feature obtained through the NeXtVLAD algorithm are the visual feature and the sound feature corresponding to the same video segment, and the video feature formed by combining the visual feature and the sound feature is also the video feature of the video segment.

In one embodiment, the feature classification module includes a weight assignment unit and a weight weighting unit. The weight assigning unit may use SE Context Gate (SE Context Gate, a layer in a neural network), and the weight weighting unit may use MoE (mixed of Experts) model.

The SE Context Gate is used to suppress the unimportant information in the video feature and highlight the important information, for example, if girls slide on the road in the video, the slide and girls are important information, and pedestrians and cars are unimportant information.

Referring to fig. 2, fig. 2 is a schematic diagram illustrating a model training method according to an embodiment of the present application, where the diagram illustrates a structure of a SE Context Gate.

In one embodiment, the video features X are input to the full link layer, and after batch normalization of the video features X, the video features X are input to the ReLU activation function. And inputting the output value of the ReLU activation function to the next full-connection layer for batch normalization again, inputting the result of batch normalization again to the Sigmoid activation function, calculating the feature weight of the video feature X by the Sigmoid activation function, and multiplying the feature weight and the video feature X to obtain the output Y. From the output Y, the video features and corresponding feature weights can be derived.

Wherein, the expression of the ReLU activation function is:

f(x)＝max(0,x)

when x <0, ReLU is hard saturated, and when x >0, there is no saturation problem. Therefore, ReLU can keep the gradient unattenuated when x > 0.

The expression of the Sigmoid activation function is as follows:

the Sigmoid activation function has an exponential function shape, which is physically close to a biological neuron. Furthermore, since the value of the Sigmoid activation function is always located in the interval (0,1), the output of the Sigmoid activation function can also be used to represent the probability, or for normalization of the input.

In one embodiment, the video features are input into the SE Context Gate, and feature weights of the video features are calculated, wherein the feature weights of the video features are different, and for important information, the feature weight corresponding to the video feature is significant, and for unimportant information, the feature weight corresponding to the video feature is small. And finally, outputting the video features and the feature weights corresponding to the video features together through the SE Context Gate, and inputting the video features and the feature weights into the MoE model.

The MoE model receives the video features and the corresponding weights transmitted by the SE Context Gate, inputs the video features and the corresponding weights into a plurality of softmax classifiers by using a classification algorithm in the MoE model, and performs weighted voting on classification results of the softmax classifiers to obtain a final result.

The classification algorithm in the MoE model may be:

wherein, the category h and the category h' both represent one of the total categories He, p (h | x) is the classification result of a single softmax classifier,

and p (e | x) is the category of the finally obtained video sample by performing weighted summation on the classification results of a plurality of softmax classifiers in the MoE model and the corresponding weights.

105. And adjusting parameters of the image feature extraction model, the audio feature extraction model and the classification model according to the difference between the prediction label and the classification label until the basic model converges, and taking the converged basic model as a video classification model for video classification.

In one embodiment, the distance between the predicted label and the classification label is calculated using a loss function. Substituting the prediction label and the classification label into a loss function to obtain a loss value, and if the loss value meets a preset condition, considering that the basic model is converged. For example, as the base model is trained, the loss value output by the loss function is smaller and smaller, a loss value threshold is set according to the requirement, when the loss value is smaller than the loss value threshold, the base model is considered to be converged, the training result of the base model is in accordance with the expectation, and the converged base model is used as a video classification model for video classification.

Or in the iteration process of the basic model, the weight change between two iterations is very small, a weight threshold is preset, when the weight change between two iterations of the basic model is smaller than the preset weight threshold, the basic model is considered to be converged, the training result of the basic model is in line with expectation, and the converged basic model is used as a video classification model for video classification.

Or the iteration times of the basic model are preset, when the iteration times of the basic model exceed the preset iteration times, the iteration is stopped, the basic model is considered to be converged, the training result of the basic model is in accordance with expectation, and the converged basic model is used as a video classification model for video classification.

As can be seen from the above, in the model training method provided in the embodiment of the present application, the video sample is divided into the image sample and the audio sample by obtaining the video sample and the classification label corresponding to the video sample; constructing a basic model, wherein the basic model comprises an image feature extraction model, an audio feature extraction model and a classification model; extracting the image characteristics of the image sample through an image characteristic extraction model, and extracting the audio characteristics of the audio sample through an audio characteristic extraction model; inputting the image characteristics and the audio characteristics into a classification model for classification to obtain a prediction label corresponding to a video sample; and adjusting parameters of the image feature extraction model, the audio feature extraction model and the classification model according to the difference between the prediction label and the classification label until the basic model converges, and taking the converged basic model as a video classification model for video classification. Therefore, the basic model can be trained through the loss function, so that a video classification model with higher accuracy is obtained, and the accuracy of video classification is improved.

Referring to fig. 3, fig. 3 is a schematic flow chart of a model training method according to an embodiment of the present application. The process of the model training method can comprise the following steps:

201. and acquiring the video sample and the classification label corresponding to the video sample.

The electronic device can acquire the video sample through a wired connection or a wireless connection. The time of the video sample may vary from a few seconds to tens of hours. The label can reflect the content of the video sample, and the same video sample can have a plurality of corresponding labels.

202. A video segment is cut from a video sample.

203. The video segment is divided into image samples and audio samples.

In an embodiment mode, a video segment is intercepted in a video sample, and the video segment is divided into an image sample and an audio sample. The video clip may be a single segment or multiple segments. When a video segment is cut out of a video, the video segment is divided into an image sample and an audio sample. When a plurality of video segments are cut out from a video, the plurality of video segments are divided into a plurality of image samples and a plurality of audio samples.

204. And constructing a basic model, wherein the basic model comprises an image feature extraction model, an audio feature extraction model and a classification model.

In one embodiment, the image feature extraction model may employ the ResNet-101 model. The ResNet-101 model is a CNN (convolutional Neural Network) model, and has a 101-layer Network hidden layer. The ResNet-101 model learns the residual representation between inputs and outputs by using multiple layers of parameters, rather than attempting to learn the mapping between inputs and outputs directly using layers of parameters as in typical CNN networks. Using the parameter layers to directly learn the residual error is much easier (faster convergence speed) and much more efficient (higher classification accuracy can be achieved by using more layers) than directly learning the mapping between input and output.

In one embodiment, the audio feature extraction model may adopt a vgg (oxford Visual Geometry group) deep network model. The VGG deep network model is composed of 5 convolutional layers, 3 fully-connected layers and a softmax output layer, and the layers are separated by max-pooling. The depth increase of the VGG deep network model and the use of a small convolution kernel have great effect on the final audio feature extraction effect.

205. And pre-training the image feature extraction model according to the first data set.

Wherein the first data set may be an ImageNet data set. ImageNet is a large visualization database for visual object recognition software research containing over 1400 million images, 120 of which are divided into 1000 categories (about 100 million images contain bounding boxes and annotations). The pre-trained image feature extraction model can be used for extracting image features from the image sample. By processing the image feature extraction model on ImageNet, a ResNet-101 model with better model parameters can be obtained after training is finished, and the ResNet-101 model after the feature extraction pre-training is used as the image feature extraction model, so that the training time of the image feature extraction model can be greatly shortened.

In one embodiment, before pre-training the image feature extraction model according to the first data set, the method further includes:

preprocessing the first data set;

the pre-training step of the image feature extraction model according to the first data set comprises:

and pre-training the image feature extraction model according to the pre-processed first data set.

In an embodiment, the first data set may be preprocessed by applying a data gain to the first data set. The data gain mode comprises the following steps: the initial image in the first data set is randomly varied. For example, the initial image in the first data set is subjected to one or more of horizontal mirror inversion, vertical mirror inversion, cropping, brightness adjustment, saturation adjustment, and hue adjustment. Then, inputting the preprocessed first data set into an image feature extraction model, and adjusting various parameters of the image feature extraction model according to the difference between the image features output by the image feature extraction model and the original image features of the original images in the first data set, thereby achieving the purpose of pre-training the image feature extraction model.

206. The audio feature extraction model is pre-trained according to the second data set.

Wherein the second data set may be an AudioSet of data. The AudioSet data set comprises a plurality of audio categories and a large number of artificially marked sound fragments, and covers a large range of human and animal sounds, musical instrument and music genre sounds, daily environmental sounds and the like.

In one embodiment, before pre-training the audio feature extraction model according to the second data set, the method further comprises:

preprocessing the second data set;

pre-training the audio feature extraction model according to a second data set, comprising:

and pre-training the audio feature extraction model according to the pre-processed second data set.

In an embodiment, the second data set may be preprocessed by short-time fourier transforming the sound segments in the second data set. The short-time fourier transform is a time spectrum analysis method, which represents the signal characteristics at a certain time by a section of signal in a time window. In the short-time Fourier transform process, the length of a window determines the time resolution and the frequency resolution of a spectrogram, and the longer the window is, the longer the intercepted signal is, the longer the signal is, and the higher the frequency resolution after Fourier transform is.

Briefly, the short-time fourier transform is a multiplication of a function and a window function, followed by a one-dimensional fourier transform. And a series of Fourier transform results are obtained through the sliding of the window function, and the results are arranged to obtain a two-dimensional representation. For convenience of processing, the sound signal may be discretized by a short-time fourier transform.

And after the spectrogram of the sound segment in the second data set is obtained through short-time Fourier transform and is input into the audio characteristic extraction model, adjusting each parameter of the audio characteristic extraction model according to the difference between the audio characteristic output by the audio characteristic extraction model and the original audio characteristic of the sound segment in the second data set, thereby achieving the purpose of pre-training the audio characteristic extraction model.

207. And pre-training the classification model according to the pre-trained image feature extraction model and the pre-trained audio feature extraction model.

In one embodiment, after the base model is constructed, the image feature extraction model, the audio feature extraction model, and the classification model are pre-trained. During pre-training, the image feature extraction model and the audio feature extraction model are pre-trained respectively, and then the classification model is pre-trained according to the pre-trained image feature extraction model and the pre-trained audio feature extraction model.

Pre-training the classification model according to the pre-trained image feature extraction model and the pre-trained audio feature extraction model comprises the following steps:

inputting the image sample into a pre-trained image feature extraction model to obtain the image features of the image sample;

inputting the audio sample into a pre-trained audio feature extraction module to obtain the audio feature of the audio sample;

and adjusting parameters of the classification model according to the difference between the prediction label and the classification label until the classification model is converged.

Wherein, the difference between the prediction label and the classification label can be embodied by a first BCE Loss (Loss function). And inputting the prediction label and the classification label into a first loss function to obtain a first loss value. The difference between the prediction label and the classification label is embodied by the first loss value. And when the first loss value meets the preset condition, the classification model is considered to be converged. For example, when the first loss value is smaller than the first preset threshold, the classification model is determined to be converged, and the pre-training of the classification model is completed.

208. And extracting the image characteristics of the image sample through an image characteristic extraction model, and extracting the audio characteristics of the audio sample through an audio characteristic extraction model.

After the pre-training of the image feature extraction model, the audio feature extraction model and the classification model is completed, the pre-trained image feature extraction model, the audio feature extraction model and the classification model are subjected to end-to-end combined training.

In the joint training, the image sample and the audio sample are input into an image feature extraction model and an audio feature extraction model in a basic model for feature extraction, the output of the image feature extraction model and the output of the audio feature extraction model are used as the input of a classification model, and finally a prediction label of the video sample is output from the classification model. The whole training process is completed in the basic model, and other algorithms except the basic model are not used in the combined training process.

In an embodiment, during the joint training, the image sample is first input into an image feature extraction model, such as a ResNet-101 model, for image feature extraction to obtain the image feature of the image sample, and the audio sample is input into an audio feature extraction model, such as a VGG deep network model, for audio feature extraction to obtain the audio feature of the audio sample. For example, an image sample is input into the ResNet-101 model, and the features before the last full connection (the features output by the last but one full connection layer) in the 101-layer network hidden layer of the ResNet-101 model are used as the image features of the image sample; and inputting the audio sample into the VGG deep network model, and taking the feature before the last layer of full connection (the feature output by the last but one layer of full connection) in the VGG deep network model as the audio feature of the audio sample.

The image feature may be a multi-frame image feature, and the audio feature may be a multi-frame audio feature. When there is only one video sample, the extracted image feature and the audio feature together represent the feature of the same video segment.

209. And inputting the image characteristics and the audio characteristics into a characteristic fusion module for characteristic fusion to obtain the video characteristics of the video sample.

The method comprises the following steps of inputting image characteristics and audio characteristics into a characteristic fusion module for characteristic fusion, and obtaining video characteristics of a video sample, wherein the steps of inputting the image characteristics and the audio characteristics into the characteristic fusion module for characteristic fusion comprise:

(1) and inputting the image characteristics into the characteristic fusion neural network model for characteristic fusion to obtain the visual characteristics of the target video.

(2) And inputting the audio features into the feature fusion neural network model for feature fusion to obtain the sound features of the target video.

(3) And combining the visual features and the sound features into video features of the target video.

The feature fusion module can fuse multi-frame features into one feature, for example, fuse multi-frame image features into one image feature, and fuse multi-frame audio features into one audio feature.

In the feature fusion module, a NeXtVLAD algorithm may be employed. Inputting the characteristics of the multi-frame images as a variable x into a NeXtVLAD algorithm, wherein x can be x₁、x ₂、x ₃And so on. The method comprises the steps of obtaining a clustering center C obtained by an image feature extraction module through pre-training, wherein a specific algorithm can be as follows: c and x₁The subtracted difference is multiplied by x₁Corresponding weights + C and x₂The subtracted difference is multiplied by x₂Corresponding weights + C and x₃The subtracted difference is multiplied by x₃Corresponding weight + …. Therefore, the visual characteristic formed by fusing the multi-frame image characteristics and the sound characteristic formed by fusing the multi-frame audio characteristics are obtained through a weighting sum and normalization mode. And combining the visual characteristic and the sound characteristic to form the video characteristic of the video sample. If the image sample and the audio sample are obtained by dividing a video segment intercepted from the video sample, the visual feature and the sound feature obtained through the NeXtVLAD algorithm are the visual feature and the sound feature corresponding to the same video segment, and the video feature formed by combining the visual feature and the sound feature is also the video feature of the video segment.

210. And inputting the video features into a weight distribution unit to calculate the weight, so as to obtain the feature weight corresponding to the video features.

In one embodiment, the classification model comprises a feature fusion module and a classification module, and the classification module comprises a weight distribution unit and a weight weighting unit. The method comprises the steps of inputting video characteristics of video samples into a characteristic classification module for classification to obtain prediction labels corresponding to the video samples, and comprises the following steps:

inputting the video features into a weight distribution unit to calculate weights, and obtaining feature weights corresponding to the video features;

and inputting the video features and the corresponding feature weights into a weight weighting unit to calculate a weighted sum, so as to obtain a prediction label corresponding to the video sample.

In the weight assignment unit of the classification module, SE Context Gate may be used. The SE Context Gate is used for suppressing unimportant information in video features and highlighting important information, for example, if girls slide on a road in a video, the slide and girls are important information, and pedestrians and automobiles are unimportant information.

Referring to fig. 2, fig. 2 is a schematic diagram of a model training method according to an embodiment of the present application, in which a structure of an SE Context Gate is illustrated.

Wherein, the expression of the ReLU activation function is:

f(x)＝max(0,x)

when x <0, the ReLU is hard saturated, and when x >0, there is no saturation problem. Therefore, ReLU can keep the gradient unattenuated when x > 0.

The expression of the Sigmoid activation function is as follows:

sigmoid activation functions have an exponential function shape, which is close to biological neurons in a physical sense. Furthermore, since the value of the Sigmoid activation function is always located in the interval (0,1), the output of the Sigmoid activation function can also be used to represent probabilities, or for normalization of the inputs.

211. And inputting the video features and the corresponding feature weights into a weight weighting unit to calculate a weighted sum, so as to obtain a prediction label corresponding to the video sample.

In one embodiment, the weighting unit includes a plurality of preset classifiers, and the step of inputting the video features and the corresponding feature weights into the weighting unit to obtain the prediction labels corresponding to the video samples includes:

inputting the video features and the corresponding feature weights into a plurality of preset classifiers to obtain a plurality of classification results and weights corresponding to the classification results;

and calculating a weighted sum according to the classification results and the weights corresponding to the classification results to obtain a prediction label corresponding to the video sample.

In the weight weighting unit of the classification model, a MoE model may be used. The MoE model comprises a classification algorithm including a plurality of softmax classifiers. The MoE model receives the video features and the corresponding weights transmitted by the SE Context Gate, inputs the video features and the corresponding weights into a plurality of softmax classifiers by using a classification algorithm in the MoE model, and performs weighted voting on classification results of the softmax classifiers to obtain a final result.

The classification algorithm in the MoE model may be:

212. And adjusting parameters of the image feature extraction model, the audio feature extraction model and the classification model according to the difference between the prediction label and the classification label until the basic model converges, and taking the converged basic model as a video classification model for video classification.

In the joint training process of the image feature extraction model, the audio feature extraction model and the classification model, the difference between the prediction label and the classification label is not only used for adjusting the parameters of the classification model, but also used for adjusting the parameters of the three models, namely the image feature extraction model, the audio feature extraction model and the classification model.

In one embodiment, the distance between the prediction label and the classification label is calculated using a second loss function. And substituting the prediction label and the classification label into a second loss function to obtain a second loss value, and judging the convergence of the basic model if the second loss value meets the preset condition. For example, as the base model is trained, the second loss value output by the second loss function becomes smaller and smaller, the second preset threshold is set according to the requirement, when the second loss value is smaller than the second preset threshold, the base model is considered to be converged, the training result of the base model is in accordance with the expectation, and the converged base model is used as the video classification model for video classification.

Or in the iteration process of the basic model, the weight change between two iterations is very small, a third preset threshold value is preset, when the weight change between two iterations is smaller than the third preset threshold value, the basic model is considered to be converged, the training result of the basic model accords with the expectation, and the converged basic model is used as the video classification model for video classification.

Referring to fig. 4, fig. 4 is a third flowchart illustrating a model training method according to an embodiment of the present disclosure.

In one embodiment, video segments are cut from manually marked video samples, image frame sampling and audio sampling are performed on the video segments, the video segments are divided into image samples and audio samples, and the image samples and the audio samples are preprocessed respectively. Wherein the pre-processing of the image sample comprises scaling the image size; the pre-processing of the audio samples comprises a short-time fourier transform of the audio signal of the audio samples.

And then, constructing a basic model, wherein the basic model comprises an image feature extraction model, an audio feature extraction model and a classification model. Inputting the preprocessed image sample into the image feature extraction model to extract image features, inputting the preprocessed audio sample into the audio feature extraction model to extract audio features, and inputting the image features and the audio features output by the image feature extraction model and the audio feature extraction model into the classification model. In the classification model, respectively fusing the input image characteristics through the image part of the characteristic fusion model in the classification model to obtain the visual characteristics of the video sample; fusing the input audio features through an audio part of a feature fusion model in the classification model to obtain the sound features of the video sample; and combining the visual features and the sound features into video features of the video sample, inputting the video features into a weight distribution unit, and calculating to obtain weights corresponding to the video features. And inputting the video features and the corresponding weights into a weight weighting unit to obtain a prediction result of the video sample, namely a prediction label.

And continuously adjusting parameters of the basic model according to the difference between the prediction label and the classification label marked manually, wherein the parameters comprise parameters of an image feature extraction model, an audio feature extraction model and a classification model (comprising a feature fusion model, a weight distribution unit and a weight weighting unit) in the basic model. In fig. 4, the thickened module is a module participating in training, that is, a module adjusting parameters according to the difference between the prediction tag and the classification tag in the joint training process.

As can be seen from the above, in the model training method provided in the embodiment of the present application, the video sample is divided into the image sample and the audio sample by obtaining the video sample and the classification label corresponding to the video sample; constructing a basic model, wherein the basic model comprises an image feature extraction model, an audio feature extraction model and a classification model; extracting the image characteristics of the image sample through an image characteristic extraction model, and extracting the audio characteristics of the audio sample through an audio characteristic extraction model; inputting the image characteristics and the audio characteristics into a classification model for classification to obtain a prediction label corresponding to the video sample; and adjusting parameters of the image feature extraction model, the audio feature extraction model and the classification model according to the difference between the prediction label and the classification label until the basic model converges, and taking the converged basic model as a video classification model for video classification. Therefore, the basic model can be trained through the loss function, so that a video classification model with higher accuracy is obtained, and the accuracy of video classification is improved.

Referring to fig. 5, fig. 5 is a schematic flowchart illustrating a video processing method according to an embodiment of the present disclosure. The flow of the video processing method can comprise the following steps:

301. a video processing request is received.

The method comprises the steps that when the electronic equipment receives modes such as target assembly touch operation, preset voice operation or a preset starting instruction of a target application, a video processing request is triggered and generated. In addition, the electronic device can also automatically trigger the generation of the video processing request at preset time intervals or based on a certain trigger rule. For example, when the electronic device detects that the current display interface includes a video, such as when the electronic device detects that a browser is started to browse the video, the electronic device may automatically trigger generation of a video processing request to classify the current target video according to the video classification model. The electronic equipment can automatically generate the prediction label of the target video through a machine learning algorithm.

302. And acquiring a target video needing to be classified according to the video processing request, and dividing the target video into a target image and a target audio.

The target video may be a video stored in the electronic device, at this time, the video processing request includes path information used for indicating a location where the target video is stored, and the electronic device may acquire the target video needing tag prediction through the path information. Of course, when the target video is not a video stored in the electronic device, the electronic device may acquire the target video to be classified by means of wired connection or wireless connection according to the video processing request.

In an embodiment, a video clip is cut out from the target video, and the video clip may be one segment or multiple segments. When a video segment is cut out from a target video, the video segment is divided into a target image and a target audio. When multiple segments of video clips are cut out from a target video, the multiple segments of video clips are divided into multiple target images and multiple target audios. When there is only one target image and one target audio, the target image and the target audio correspond to the same video clip in the target video.

303. A pre-trained video classification model is invoked.

The video classification model is obtained by training by using the model training method provided by the embodiment. For a specific model training process, reference may be made to the related description of the foregoing embodiments, and details are not repeated herein.

304. And classifying the target image and the target audio input video classification model to obtain a classification label of the target video.

And classifying the target image and the target audio input video classification model to obtain a classification label corresponding to the target video. The category label may represent a category of the target video.

As can be seen from the above, the video processing method provided in the embodiment of the present application receives a video processing request; acquiring a target video to be classified according to the video processing request, and dividing the target video into a target image and a target audio; calling a pre-trained video classification model; classifying the target image and the target audio input video classification model to obtain a classification label of the target video; therefore, the target video is classified through the video classification model.

Referring to fig. 6, fig. 6 is a first structural diagram of a model training device 400 according to an embodiment of the present disclosure. The model training apparatus may include a first acquisition model 401, a construction module 402, an extraction module 403, a classification module 404, and an adjustment module 405:

a first obtaining module 401, configured to obtain a video sample and a classification label corresponding to the video sample, and divide the video sample into an image sample and an audio sample;

a building module 402, configured to build a basic model, where the basic model includes an image feature extraction model, an audio feature extraction model, and a classification model;

an extracting module 403, configured to obtain an image feature of the image sample through extraction by the image feature extraction model, and obtain an audio feature of the audio sample through extraction by the audio feature extraction model;

a classification module 404, configured to input the image features and the audio features into a classification model for classification, so as to obtain a prediction label corresponding to the video sample;

an adjusting module 405, configured to adjust parameters of the image feature extraction model, the audio feature extraction model, and the classification model according to a difference between the prediction tag and the classification tag until the basic model converges, and use the converged basic model as a video classification model for video classification.

In some embodiments, the first obtaining module 401 is specifically configured to intercept a video segment from a video sample; a video segment is divided into image samples and audio samples.

In some embodiments, the classification module 404 is specifically configured to input the image features and the audio features into a feature fusion module for feature fusion, so as to obtain video features of the video sample; and inputting the video characteristics of the video samples into a characteristic classification module for classification to obtain the prediction labels corresponding to the video samples.

In some embodiments, the classification module 404 is specifically configured to input the image features into the feature fusion module for feature fusion, so as to obtain visual features of the video sample; inputting the audio features into a feature fusion module for feature fusion to obtain the sound features of the video sample; and combining the visual characteristic and the sound characteristic into a video characteristic of the video sample.

In some embodiments, the feature classification module includes a weight assignment unit and a weight weighting unit, and the classification module 404 is specifically configured to input the video features into the weight assignment unit to calculate weights, so as to obtain feature weights corresponding to the video features; and inputting the video features and the corresponding feature weights into a weight weighting unit to calculate a weighted sum so as to obtain a prediction label corresponding to the video sample.

In some embodiments, the weight weighting unit includes a plurality of preset classifiers, and the classification module 404 is specifically configured to input the video features and the corresponding feature weights into the plurality of preset classifiers to obtain a plurality of classification results and corresponding weights; and calculating a weighted sum according to the classification results and the corresponding weights to obtain a prediction label corresponding to the video sample.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure. In some embodiments, the model training apparatus provided in the embodiments of the present application further includes a pre-training module 406:

a pre-training module 406, configured to pre-train the image feature extraction model according to the first data set; pre-training the audio feature extraction model according to the second data set; and pre-training the classification model according to the pre-trained image feature extraction model and the pre-trained audio feature extraction model.

Wherein the first data set is preprocessed before the pre-training module 406 pre-trains the image feature extraction model according to the first data set. And then, pre-training the image feature extraction model according to the pre-processed first data set.

As can be seen from the above, the model training apparatus provided in the embodiment of the present application obtains the video sample and the classification label corresponding to the video sample through the first obtaining model 401, and divides the video sample into the image sample and the audio sample; the construction module 402 constructs a basic model, which includes an image feature extraction model, an audio feature extraction model and a classification model; the extraction module 403 extracts image features of the image sample through an image feature extraction model, and extracts audio features of the audio sample through an audio feature extraction model; the classification module 404 inputs the image features and the audio features into a classification model for classification, so as to obtain a prediction label corresponding to the video sample; the adjustment module 405 adjusts parameters of the image feature extraction model, the audio feature extraction model, and the classification model according to a difference between the prediction tag and the classification tag until the basic model converges, and uses the converged basic model as a video classification model for video classification. Therefore, the basic model can be trained through the loss function, so that a video classification model with higher accuracy is obtained, and the accuracy of video classification is improved.

It should be noted that the model training apparatus provided in the embodiment of the present application and the model training method in the above embodiment belong to the same concept, and any method provided in the embodiment of the model training method may be run on the model training apparatus, and the specific implementation process thereof is described in detail in the embodiment of the model training method, and is not described herein again.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure. The video processing apparatus may include: a receiving module 501, a second obtaining module 502, a calling model 503 and a predicting module 504.

A receiving module 501, configured to receive a video processing request;

a second obtaining module 502, configured to obtain a target video to be classified according to the video processing request, and divide the target video into a target image and a target audio;

a calling module 503, configured to call a pre-trained video classification model;

the prediction module 504 is configured to input the target image and the target audio into the video classification model to perform tag prediction, so as to obtain a target tag of the target video;

the video classification model is obtained by training through the model training method provided by the embodiment of the application.

As can be seen from the above, the video processing apparatus provided in the embodiment of the present application receives a video processing request through the receiving module 501; the second obtaining module 502 obtains a target video to be classified according to the video processing request, and divides the target video into a target image and a target audio; the calling module 503 calls a pre-trained video classification model; the prediction module 504 classifies the target image and the target audio input video classification model to obtain a classification label of the target video; therefore, the target video is classified through the video classification model.

It should be noted that the video processing apparatus provided in this embodiment of the present application and the video processing method in the foregoing embodiment belong to the same concept, and any method provided in the video processing method embodiment may be run on the video processing apparatus, and a specific implementation process thereof is described in detail in the video processing method embodiment, and is not described herein again.

Embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and when the stored computer program is executed on a computer, the computer is caused to execute a model training method or a video processing method as provided by embodiments of the present application.

The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.

The embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the model training method or the video processing method provided in the embodiment of the present application by calling the computer program stored in the memory.

For example, the electronic device may be a mobile terminal such as a tablet computer or a smart phone. Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

The electronic device 600 may include components such as a memory 601, a processor 602, and the like. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 9 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The memory 601 may be used to store software programs and modules, and the processor 602 executes various functional applications and data processing by operating the computer programs and modules stored in the memory 601. The memory 601 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a computer program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like.

The processor 602 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing an application program stored in the memory 601 and calling the data stored in the memory 601, thereby performing overall monitoring of the electronic device.

Further, the memory 601 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 601 may also include a memory controller to provide the processor 602 with access to the memory 601.

In this embodiment, the processor 602 in the electronic device loads the executable code corresponding to the processes of one or more application programs into the memory 601 according to the following instructions, and the processor 602 runs the application programs stored in the memory 601, thereby implementing the following processes:

extracting the image characteristics of the image sample through an image characteristic extraction model, and extracting the audio characteristics of the audio sample through an audio characteristic extraction model;

inputting the image characteristics and the audio characteristics into a classification model for classification to obtain a prediction label corresponding to a video sample;

In some embodiments, the classification model includes a feature fusion module and a feature classification module, and when the processor 502 performs classification on the image features and the audio features input into the classification model to obtain the prediction labels corresponding to the video samples, the following steps may be performed:

In some embodiments, when the processor 602 performs feature fusion by inputting the image features and the audio features into the feature fusion module to obtain the video features of the video sample, the following steps may be performed:

inputting the image characteristics into a characteristic fusion module for characteristic fusion to obtain the visual characteristics of the video sample;

inputting the audio features into a feature fusion module for feature fusion to obtain the sound features of the video sample;

and combining the visual features and the sound features into video features of the video sample.

In some embodiments, the feature classification module includes a weight assignment unit and a weight weighting unit, and when the processor 602 performs to input the video features of the video sample into the feature classification module for classification, and obtains the prediction label of the corresponding video sample, it may perform:

and inputting the video features and the corresponding feature weights into a weight weighting unit to calculate a weighted sum so as to obtain a prediction label corresponding to the video sample.

In some embodiments, the weighting unit includes a plurality of preset classifiers, and when the processor 502 performs to input the video features and the corresponding feature weights into the weighting unit to obtain the prediction labels of the corresponding video samples, the following steps may be performed:

inputting the video features and the corresponding feature weights into a plurality of preset classifiers to obtain a plurality of classification results and corresponding weights;

and calculating a weighted sum according to the classification results and the corresponding weights to obtain a prediction label corresponding to the video sample.

In some embodiments, when the processor 602 performs the division of the video sample into the image sample and the audio sample, it may perform:

intercepting a video segment from a video sample;

a video segment is divided into image samples and audio samples.

In some embodiments, before the processor 602 performs the image feature extraction to obtain the image feature of the image sample through the image feature extraction model and performs the audio feature extraction to obtain the audio feature of the audio sample through the audio feature extraction model, it may perform:

pre-training the image feature extraction model according to the first data set;

pre-training the audio feature extraction model according to the second data set;

and pre-training the classification model according to the pre-trained image feature extraction model and the pre-trained audio feature extraction model.

In some embodiments, before the processor 602 performs pre-training of the image feature extraction model according to the first dataset, it may perform:

preprocessing the first data set;

In this embodiment, the processor 602 in the electronic device loads the executable code corresponding to the processes of one or more application programs into the memory 601 according to the following instructions, and the processor 601 runs the application programs stored in the memory 601, thereby implementing the following processes:

receiving a video processing request;

calling a pre-trained video classification model;

Referring to fig. 10, fig. 10 is a schematic view of a second structure of an electronic device according to an embodiment of the present application, and a difference from the electronic device shown in fig. 9 is that the electronic device further includes: a camera assembly 603, a radio frequency circuit 604, an audio circuit 605, and a power supply 606. The display 603, the rf circuit 604, the audio circuit 605 and the power supply 606 are electrically connected to the processor 602 respectively.

The display 603 may be used to display information input by or provided to the user as well as various graphical user interfaces, which may be made up of graphics, text, icons, video, and any combination thereof. The Display 603 may include a Display panel, and in some embodiments, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The rf circuit 604 may be used for transceiving rf signals to establish wireless communication with a network device or other electronic devices through wireless communication, and to transmit and receive signals to and from the network device or other electronic devices.

The audio circuit 605 may be used to provide an audio interface between the user and the electronic device through a speaker, microphone.

The power supply 606 may be used to power various components of the electronic device 600. In some embodiments, power supply 606 may be logically coupled to processor 602 via a power management system, such that functions such as managing charging, discharging, and power consumption may be performed via the power management system.

Although not shown in fig. 10, the electronic device 600 may also include a camera assembly, which may include Image Processing circuitry, which may be implemented using hardware and/or software components, a bluetooth module, etc., which may include various Processing units that define an Image Signal Processing (Image Signal Processing) pipeline. The image processing circuit may include at least: a plurality of cameras, an Image Signal Processor (ISP Processor), control logic, an Image memory, and a display. Where each camera may include at least one or more lenses and an image sensor. The image sensor may include an array of color filters (e.g., Bayer filters). The image sensor may acquire light intensity and wavelength information captured with each imaging pixel of the image sensor and provide a set of raw image data that may be processed by an image signal processor.

In the above embodiments, the descriptions of the embodiments are focused on, and parts that are not described in detail in a certain embodiment may refer to the above detailed description of the model training method/video processing method, and are not described again here.

The model training method/video processing method device provided in the embodiment of the present application and the model training method/video processing method in the above embodiments belong to the same concept, and any one of the methods provided in the model training method/video processing method embodiment may be run on the model training method/video processing method device, and the specific implementation process thereof is described in detail in the model training method/video processing method embodiment, and is not described herein again.

It should be noted that, for the model training method/video processing method of the embodiments of the present application, it can be understood by those skilled in the art that all or part of the process for implementing the model training method/video processing method of the embodiments of the present application can be implemented by controlling the related hardware through a computer program, the computer program can be stored in a computer readable storage medium, such as a memory, and executed by at least one processor, and the process of the embodiment such as the model training method/video processing method can be included in the execution process. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.

For the model training method/video processing method apparatus in the embodiment of the present application, each functional module may be integrated in one processing chip, or each module may exist alone physically, or two or more modules are integrated in one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer-readable storage medium, such as a read-only memory, a magnetic disk or an optical disk.

The model training method, the video processing device, the storage medium and the electronic device provided by the embodiments of the present application are described in detail above, and specific examples are applied in the present application to explain the principles and embodiments of the present application, and the description of the embodiments above is only used to help understanding the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, the specific implementation manner and the application scope may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

A model training method, comprising:

acquiring a video sample and a classification label corresponding to the video sample, and dividing the video sample into an image sample and an audio sample;

constructing a basic model, wherein the basic model comprises an image feature extraction model, an audio feature extraction model and a classification model;

extracting the image characteristics of the image sample through the image characteristic extraction model, and extracting the audio characteristics of the audio sample through the audio characteristic extraction model;

inputting the image characteristics and the audio characteristics into the classification model for classification to obtain a prediction label corresponding to the video sample;

and adjusting parameters of the image feature extraction model, the audio feature extraction model and the classification model according to the difference between the prediction label and the classification label until the basic model converges, and taking the converged basic model as a video classification model for video classification.
The method of claim 1, wherein the classification model comprises a feature fusion module and a feature classification module, and the step of inputting the image feature and the audio feature into the classification model for classification to obtain the prediction label corresponding to the video sample comprises:

inputting the image characteristics and the audio characteristics into the characteristic fusion module for characteristic fusion to obtain video characteristics of the video sample;

and inputting the video characteristics of the video samples into the characteristic classification module for classification to obtain the prediction labels corresponding to the video samples.
The method of claim 2, wherein the step of inputting the image feature and the audio feature into the feature fusion module for feature fusion to obtain the video feature of the video sample comprises:

inputting the image features into the feature fusion module for feature fusion to obtain visual features of the video sample;

inputting the audio features into the feature fusion module for feature fusion to obtain the sound features of the video sample;

combining the visual features and the sound features into video features of the video sample.
The method according to claim 2, wherein the feature classification module comprises a weight assignment unit and a weight weighting unit, and the step of inputting the video features of the video samples into the feature classification module for classification to obtain the prediction labels corresponding to the video samples comprises:

inputting the video features into the weight distribution unit to calculate weights, and obtaining feature weights corresponding to the video features;

and inputting the video features and the corresponding feature weights into the weight weighting unit to calculate a weighted sum, so as to obtain a prediction label corresponding to the video sample.
The method of claim 4, wherein the weighting unit comprises a plurality of preset classifiers, and the step of inputting the video features and the corresponding feature weights into the weighting unit to calculate a weighted sum to obtain the prediction labels corresponding to the video samples comprises:

inputting the video features and the corresponding feature weights into the preset classifiers to obtain a plurality of classification results and corresponding weights;

and calculating a weighted sum according to the classification results and the corresponding weights to obtain a prediction label corresponding to the video sample.
The method of claim 1, wherein the step of dividing the video sample into an image sample and an audio sample comprises:

intercepting a video segment from the video sample;

the video segment is divided into image samples and audio samples.
The method of claim 1, wherein the steps of obtaining the image feature of the image sample by the image feature extraction model and obtaining the audio feature of the audio sample by the audio feature extraction model are preceded by:

pre-training the image feature extraction model according to a first data set;

pre-training the audio feature extraction model according to a second data set;

and pre-training the classification model according to the pre-trained image feature extraction model and the pre-trained audio feature extraction model.
The method of claim 7, wherein the pre-training of the image feature extraction model from the first dataset is preceded by:

preprocessing the first data set;

the pre-training step of the image feature extraction model according to the first data set comprises:

and pre-training the image feature extraction model according to the pre-processed first data set.
A video processing method, comprising:

receiving a video processing request;

acquiring a target video to be classified according to the video processing request, and dividing the target video into a target image and a target audio;

calling a pre-trained video classification model;

inputting the target image and the target audio into the video classification model for classification to obtain a classification label of the target video;

wherein the video classification model is obtained by training by using the model training method of any one of claims 1 to 9.
A model training apparatus comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a video sample and a classification label corresponding to the video sample, and dividing the video sample into an image sample and an audio sample;

the system comprises a construction module, a classification module and a storage module, wherein the construction module is used for constructing a basic model, and the basic model comprises an image feature extraction model, an audio feature extraction model and a classification model;

the extraction module is used for extracting the image characteristics of the image sample through the image characteristic extraction model and extracting the audio characteristics of the audio sample through the audio characteristic extraction model;

the classification module is used for inputting the image characteristics and the audio characteristics into the classification model for classification to obtain a prediction label corresponding to the video sample;

and the adjusting module is used for adjusting the parameters of the image feature extraction model, the audio feature extraction model and the classification model according to the difference between the prediction label and the classification label until the basic model converges, and taking the converged basic model as a video classification model for video classification.
A video processing apparatus comprising:

the receiving module is used for receiving a video processing request;

the second acquisition module is used for acquiring a target video to be classified according to the video processing request and dividing the target video into a target image and a target audio;

the calling module is used for calling a pre-trained video classification model;

the prediction module is used for inputting the target image and the target audio into the video classification model for classification to obtain a classification label of the target video;

wherein, the video classification model is obtained by training by the model training method of any one of claims 1 to 8.
A storage medium having stored therein a computer program which, when run on a computer, causes the computer to perform the model training method of any one of claims 1 to 8 or the video processing method of claim 9.
An electronic device, wherein the electronic device comprises a processor and a memory, wherein the memory has a computer program stored therein, and the processor is configured to execute, by calling the computer program stored in the memory:

acquiring a video sample and a classification label corresponding to the video sample, and dividing the video sample into an image sample and an audio sample;

constructing a basic model, wherein the basic model comprises an image feature extraction model, an audio feature extraction model and a classification model;

extracting the image characteristics of the image sample through the image characteristic extraction model, and extracting the audio characteristics of the audio sample through the audio characteristic extraction model;

inputting the image characteristics and the audio characteristics into the classification model for classification to obtain a prediction label corresponding to the video sample;

and adjusting parameters of the image feature extraction model, the audio feature extraction model and the classification model according to the difference between the prediction label and the classification label until the basic model converges, and taking the converged basic model as a video classification model for video classification.
The electronic device of claim 13, wherein the classification model comprises a feature fusion module and a feature classification module, the processor to perform:

inputting the image characteristics and the audio characteristics into the characteristic fusion module for characteristic fusion to obtain video characteristics of the video sample;

and inputting the video characteristics of the video samples into the characteristic classification module for classification to obtain the prediction labels corresponding to the video samples.
The electronic device of claim 14, wherein the processor is configured to perform:

inputting the image features into the feature fusion module for feature fusion to obtain visual features of the video sample;

inputting the audio features into the feature fusion module for feature fusion to obtain the sound features of the video sample;

combining the visual features and the sound features into video features of the video sample.
The electronic device of claim 14, wherein the feature classification module comprises a weight assignment unit and a weight weighting unit, the processor to perform:

inputting the video features into the weight distribution unit to calculate weights, and obtaining feature weights corresponding to the video features;

and inputting the video features and the corresponding feature weights into the weight weighting unit to calculate a weighted sum so as to obtain a prediction label corresponding to the video sample.
The electronic device of claim 16, wherein the weight weighting unit comprises a plurality of preset classifiers, the processor configured to perform:

inputting the video features and the corresponding feature weights into the preset classifiers to obtain a plurality of classification results and corresponding weights;

and calculating a weighted sum according to the classification results and the corresponding weights to obtain a prediction label corresponding to the video sample.
The electronic device of claim 13, wherein the processor is configured to perform:

pre-training the image feature extraction model according to a first data set;

pre-training the audio feature extraction model according to a second data set;

and pre-training the classification model according to the pre-trained image feature extraction model and the pre-trained audio feature extraction model.
An electronic device, wherein the electronic device comprises a processor and a memory, wherein the memory has a computer program stored therein, and the processor is configured to execute, by calling the computer program stored in the memory:

receiving a video processing request;

acquiring a target video to be classified according to the video processing request, and dividing the target video into a target image and a target audio;

calling a pre-trained video classification model;

inputting the target image and the target audio into the video classification model for classification to obtain a classification label of the target video;

wherein, the video classification model is obtained by training by adopting the model training method of any one of claims 1 to 8.