CN110232340B

CN110232340B - Method and device for establishing video classification model and video classification

Info

Publication number: CN110232340B
Application number: CN201910461500.8A
Authority: CN
Inventors: 牛国成; 何伯磊; 肖欣延
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2021-01-22
Anticipated expiration: 2039-05-30
Also published as: CN110232340A

Abstract

The invention provides a method and a device for establishing a video classification model and video classification, wherein the method comprises the following steps: acquiring each video and a corresponding category marking result thereof; respectively extracting preset type features from each video, and constructing a combination feature corresponding to each video based on each feature; respectively taking each feature extracted from the video and the combination feature corresponding to the video as the input of a classification model, and acquiring the output result of the classification model aiming at each feature and combination feature of the video; respectively acquiring loss functions corresponding to all the characteristics and the combined characteristics of the same video according to the output results of all the characteristics and the combined characteristics of the video and the class marking results corresponding to the video by the classification model; and determining a loss function of the classification model according to the loss functions corresponding to the characteristics and the combined characteristics of the same video, and adjusting parameters of the classification model by using the loss function of the classification model to obtain the video classification model. The method and the device can improve the classification accuracy of the established video classification model.

Description

Method and device for establishing video classification model and video classification

[ technical field ] A method for producing a semiconductor device

The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a computer storage medium for creating a video classification model and video classification.

[ background of the invention ]

In the prior art, when a video classification model is established through a plurality of characteristics of a video, parameters of the classification model are often optimized only by using a loss function corresponding to a combined characteristic of the plurality of characteristics of the video. However, when only the combined features of the video are used for calculating the loss function of the classification model, the contribution of all the features in the video cannot be considered, and the problem of losing the information contained in the original single feature is caused, so that the accuracy of the finally trained video classification model on video classification is low.

[ summary of the invention ]

In view of this, the present invention provides a method, an apparatus, a device, and a computer storage medium for establishing a video classification model and video classification, which are used to consider contributions of all features in a video in a training process of the classification model, so as to improve classification accuracy of the established video classification model.

The technical scheme adopted by the invention for solving the technical problem is to provide a method for establishing a video classification model, which comprises the following steps: acquiring training data, wherein the training data comprises videos and category marking results corresponding to the videos; respectively extracting preset types of features from the videos, and constructing combined features corresponding to the videos based on the extracted features; respectively taking each feature extracted from a video and a combination feature corresponding to the video as the input of a classification model, and acquiring the output results of the classification model aiming at each feature and the combination feature of the video; respectively acquiring loss functions corresponding to all the features and the combined features of the same video according to the output results of all the features and the combined features of the video and the class marking results corresponding to the video by the classification model; and determining a loss function of the classification model according to the loss functions corresponding to the characteristics and the combined characteristics of the same video, and adjusting parameters of the classification model by using the loss function of the classification model to obtain the video classification model.

According to a preferred embodiment of the present invention, the preset type features are at least two of a visual feature of a video, an audio feature of a video, a text feature corresponding to an audio in a video, a feature of a subtitle and a drama name in a video, a face feature in a video, a feature of a cover picture of a video, a feature of a tag to which a video belongs, a feature of a video title, and a knowledge base feature of a video.

According to a preferred embodiment of the present invention, the constructing the combined feature corresponding to each video based on the extracted features includes: splicing all the features extracted from the video to obtain a splicing result; acquiring a combination relation among all features extracted from the video; adjusting the extracted features according to the attention of each feature in the extracted video to other features to obtain the adjusted features; and splicing the splicing result, the combination relation among the characteristics and the adjusted characteristics, and taking the splicing result as the combination characteristic corresponding to each video.

According to a preferred embodiment of the present invention, the obtaining the loss functions corresponding to the features and the combination features of the same video respectively includes: determining a classification task to which each video belongs according to a category marking result corresponding to each video; determining a calculation formula of a loss function according to the classification task to which each video belongs; and respectively acquiring the loss functions corresponding to the characteristics and the combined characteristics of the same video according to the determined calculation formula of the loss functions.

According to a preferred embodiment of the present invention, the determining a loss function of a classification model according to the loss functions corresponding to the features and the combined features of the same video includes: and determining the sum of the loss functions of all the characteristics of the same video and the loss functions of the combined characteristics as the loss function of the classification model.

According to a preferred embodiment of the invention, the method further comprises: the training objective of the classification model is to minimize the loss function of the classification model.

The technical scheme adopted by the invention for solving the technical problem is to provide a video classification method, which comprises the following steps: acquiring a video to be classified; extracting preset type features from the video to be classified, and constructing combined features corresponding to the video to be classified based on the extracted features; and taking the combination characteristics corresponding to the video to be classified as the input of a video classification model, and determining the category of the video to be classified according to the output result of the video classification model.

According to a preferred embodiment of the present invention, the constructing the combined features corresponding to the video to be classified based on the extracted features includes: splicing the features extracted from the videos to be classified to obtain a splicing result; acquiring a combination relation among the features extracted from the video to be classified; adjusting the extracted features according to the attention of each feature extracted from the video to be classified to other features, and acquiring each adjusted feature; and splicing the splicing result, the combination relation among the characteristics and the adjusted characteristics, and taking the splicing result as the combination characteristic corresponding to the video to be classified.

According to a preferred embodiment of the present invention, the determining the category of the video to be classified according to the output result of the video classification model includes: determining a probability value which is greater than a preset threshold value in an output result of the video classification model; and determining the category corresponding to the determined probability value as the category to which the video to be classified belongs.

The technical scheme adopted by the invention for solving the technical problem is to provide a device for establishing a video classification model, and the device comprises: the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring training data, and the training data comprises videos and class marking results corresponding to the videos; the first construction unit is used for respectively extracting the characteristics of a preset type from each video and constructing the combined characteristics corresponding to each video based on the extracted characteristics; the first processing unit is used for respectively taking each feature extracted from a video and a combined feature corresponding to the video as the input of a classification model and acquiring the output results of the classification model aiming at each feature and the combined feature of the video; the first training unit is used for respectively acquiring loss functions corresponding to all the features and the combined features of the same video according to the output results of all the features and the combined features of the video and the class marking result corresponding to the video by the classification model; and the second training unit is used for determining a loss function of the classification model according to the loss functions corresponding to the features and the combined features of the same video, and adjusting parameters of the classification model by using the loss function of the classification model to obtain the video classification model.

According to a preferred embodiment of the present invention, when constructing the combined feature corresponding to each video based on the extracted features, the first constructing unit specifically performs: splicing all the features extracted from the video to obtain a splicing result; acquiring a combination relation among all features extracted from the video; adjusting the extracted features according to the attention of each feature in the extracted video to other features to obtain the adjusted features; and splicing the splicing result, the combination relation among the characteristics and the adjusted characteristics, and taking the splicing result as the combination characteristic corresponding to each video.

According to a preferred embodiment of the present invention, when the first training unit respectively obtains the loss functions corresponding to the features and the combined features of the same video, the first training unit specifically performs: determining a classification task to which each video belongs according to a category marking result corresponding to each video; determining a calculation formula of a loss function according to the classification task to which each video belongs; and respectively acquiring the loss functions corresponding to the characteristics and the combined characteristics of the same video according to the determined calculation formula of the loss functions.

According to a preferred embodiment of the present invention, when determining the loss function of the classification model according to the loss functions corresponding to the features and the combined features of the same video, the second training unit specifically performs: and determining the sum of the loss functions of all the characteristics of the same video and the loss functions of the combined characteristics as the loss function of the classification model.

According to a preferred embodiment of the present invention, the second training unit further performs: a training objective for the classification model is to minimize a loss function of the classification model.

The technical solution adopted by the present invention to solve the technical problem is to provide a video classification apparatus, which includes: the second acquisition unit is used for acquiring the video to be classified; the second construction unit is used for extracting the characteristics of a preset type from the video to be classified and constructing the combined characteristics corresponding to the video to be classified based on the extracted characteristics; and the second processing unit is used for taking the combination characteristics corresponding to the video to be classified as the input of a video classification model and determining the category of the video to be classified according to the output result of the video classification model.

According to a preferred embodiment of the present invention, when the second constructing unit constructs the combined feature corresponding to the video to be classified based on the extracted features, the second constructing unit specifically performs: splicing the features extracted from the videos to be classified to obtain a splicing result; acquiring a combination relation among the features extracted from the video to be classified; adjusting the extracted features according to the attention of each feature extracted from the video to be classified to other features, and acquiring each adjusted feature; and splicing the splicing result, the combination relation among the characteristics and the adjusted characteristics, and taking the splicing result as the combination characteristic corresponding to the video to be classified.

According to a preferred embodiment of the present invention, when determining the category of the video to be classified according to the output result of the video classification model, the second processing unit specifically executes: determining a probability value which is greater than a preset threshold value in an output result of the video classification model; and determining the category corresponding to the determined probability value as the category to which the video to be classified belongs.

According to the technical scheme, the loss function of the classification model is obtained by utilizing the loss function corresponding to each feature of the video and the loss function corresponding to the combined feature corresponding to the video, and the video classification model is obtained by training according to the obtained loss function, so that the contribution of all features in the video can be considered in the training process of the classification model, the problem of information loss contained in a single feature in the video is avoided, and the classification accuracy of the established video classification model is improved.

[ description of the drawings ]

Fig. 1 is a flowchart of a method for building a video classification model according to an embodiment of the present invention;

fig. 2 is a flowchart of a video classification method according to an embodiment of the present invention;

FIG. 3 is a block diagram of an apparatus for creating a video classification model according to an embodiment of the present invention;

FIG. 4 is a block diagram of an apparatus for video classification according to an embodiment of the present invention;

fig. 5 is a block diagram of a computer system/server according to an embodiment of the invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

Fig. 1 is a flowchart of a method for building a video classification model according to an embodiment of the present invention, as shown in fig. 1, the method includes:

in 101, training data is obtained, where the training data includes videos and category labeling results corresponding to the videos.

In this step, each video and the category labeling result corresponding to each video are obtained as training data, and the obtained training data are used for training to obtain a video classification model. In this step, the training data may be obtained through the terminal or the server.

Specifically, the category labeling result corresponding to each video obtained in this step is a label set of each category of the videos belonging to the preset category. In this step, the labels of the categories to which the videos belong may be manually labeled as "1", and the labels of the other categories may be labeled as "0".

For example, if the preset categories include category 1, category 2, category 3, and category 4, respectively, and if the categories to which the obtained video a belongs are category 1 and category 3, the category labeling result corresponding to the video a obtained in this step may be [1,0,1,0 ].

At 102, preset types of features are extracted from the videos respectively, and combined features corresponding to the videos are constructed based on the extracted features.

In this step, preset types of features are extracted from each video acquired in step 101, and then a combined feature corresponding to each video is constructed based on the extracted features. In this step, the server may extract and construct features according to the acquired video.

Specifically, the preset type features extracted in this step are at least two of visual features of a video, audio features of the video, text features corresponding to audio in the video, features of subtitles and drama names in the video, face features in the video, features of a video cover picture, features of a label to which the video belongs, features of a video title, and features of a video knowledge base.

It will be appreciated that this step may be performed using existing techniques when extracting features of a preset type from a video. The present invention does not limit the manner in which the preset type of features in the video are extracted.

For example, the visual features of a video may be extracted in the following manner: sampling 10 frames of images from a video; combining the frame images according to the time sequence to obtain an image sequence; predicting each frame of image in the image sequence by using a pre-trained detection model, for example, using a RestNet (Representational State Transfer Net) model, and taking the output of each frame of image at a fully-connected layer of the last layer of the model as the representation of the image; representations of images included in the image sequence are combined, and the combination result is used as a visual feature of the video.

For example, the audio features of a video may be extracted in the following manner: extracting Mel Frequency Cepstrum Coefficient (MFCC) of audio from audio of video; and respectively carrying out convolution processing on the Mel frequency cepstrum coefficients corresponding to the words in the audio, and taking the processing result as the audio characteristic of the video.

For example, the following method can be used to extract the text features corresponding to the audio in the video: identifying the audio in the video as characters; performing word segmentation on the characters obtained by recognition, and obtaining a word vector sequence corresponding to the audio according to a word segmentation result; and processing the obtained word vector sequence by using a convolutional neural network, and taking a processing result as a character characteristic corresponding to the audio frequency in the video.

For example, the features of subtitles and titles in a video can be extracted in the following ways: performing OCR (Optical Character Recognition) on the subtitles in the video, and generating a Character sequence according to a Recognition result; performing word segmentation on the generated character sequence, and obtaining a word vector sequence according to a word segmentation result; processing the obtained word vector sequence by using a convolutional neural network, and taking a processing result as the characteristics of subtitles in the video; and extracting the drama name in the video, performing vector mapping, and taking the mapping result as the characteristic of the drama name in the video.

For example, the following method can be adopted to extract the facial features in the video: identifying figure information and position information of a figure face contained in the video by using a face identification tool; and performing vector mapping on the information obtained by the recognition by using a BoW (Bag of Word) model, and taking the mapping result as the face feature in the video.

For example, features of a video cover map may be extracted in the following manner: extracting a cover map of the video; and predicting the cover map by using a pre-trained RestNet model, and taking the output of the full-connection layer of the last layer of the model as the characteristic of the video cover map.

For example, the features of the label to which the video belongs can be extracted in the following ways: acquiring a label printed on a video by an author when the video is uploaded; and performing vector mapping on the label by using a BoW model, and taking a mapping result as the characteristic of the label of the video.

For example, the following way can be adopted to extract the title feature of the video: acquiring a title corresponding to a video; the obtained title is processed using the LSTM (long-short term memory) model, and the processing result is used as the title feature of the video.

For example, the knowledge base features of a video may be extracted in the following manner: extracting contents such as author labels, character names and drama names from the video; inquiring a preset knowledge base to obtain knowledge base information corresponding to the extracted content; and carrying out vector mapping on the extracted knowledge base information, and taking a mapping result as the knowledge base characteristic of the video.

In this step, after the preset type of features are extracted from each video, fusion can be performed based on the features extracted from each video, so as to construct the combined features corresponding to each video.

Specifically, in this step, when constructing the combined feature corresponding to each video based on the extracted features, the following manner may be adopted: splicing all the features extracted from the video to obtain a splicing result; acquiring a combination relationship between the extracted features from the video, for example, using a factor decomposition Machine (FM) to acquire the combination relationship between the extracted features; adjusting each extracted feature according to ATTENTION of each feature to other features in the extracted video, and acquiring each adjusted feature, for example, adjusting each extracted feature by using an ATTENTION (ATTENTION) mechanism; and splicing the obtained splicing result, the combination relation among the characteristics and the adjusted characteristics, and taking the splicing result as the combination characteristic corresponding to each video.

In the prior art, a method of directly splicing a plurality of features to obtain a combined feature is generally used, but the method may cause a problem that the obtained combined feature cannot sufficiently acquire information contained in each feature, so that the effect of combining the features is not obvious enough. By using the combination method provided by the invention, the characteristics and the information contained in the characteristics can be fully acquired, so that the combined characteristics with more obvious combination effect among the characteristics can be acquired.

It can be understood that, in this step, at least one of the above-mentioned splicing result, the combination relationship between the features, and the adjusted features may also be directly used as the combination feature corresponding to each video; in this step, any two of the splicing results, the combination relationship between the features, and the adjusted features may be used as the combination features corresponding to the videos.

In 103, each feature extracted from the video and the corresponding combined feature of the video are used as input of a classification model, and output results of the classification model for each feature and combined feature of the video are obtained.

In this step, the features extracted from the video in step 102 and the combined features corresponding to the constructed videos are used as input of the classification model, so as to obtain the output results of the classification model for the features and the combined features of the video.

For example, if the feature 1, the feature 2, and the feature 3 are extracted from the video a, and if the combined feature constructed from the extracted 3 features is the feature 4, the step takes the feature 1, the feature 2, the feature 3, and the feature 4 as the inputs of the classification model, so as to obtain the output result of the classification model for each input.

The output result of the classification model obtained in this step is a probability set that the video corresponding to the input features belongs to each preset category.

For example, if the feature 1 of the video a is used as the input of the classification model in this step, the output result of the model may be [0.6,0.3,0.7,0.1], which means that the probability that the video a belongs to the category 1 is 0.6, the probability that the video a belongs to the category 2 is 0.3, the probability that the video a belongs to the category 3 is 0.7, and the probability that the video a belongs to the category 4 is 0.1; similarly, the output results of the classification model for the features 2, 3 and 4 of the video a can be obtained in this step.

It is to be understood that the classification model in this step may be a support vector machine, a regression model or a deep learning model, and the type of the classification model is not limited by the present invention.

At 104, according to the output results of the classification model for each feature and combination feature of the video and the class labeling result corresponding to the video, the loss functions corresponding to each feature and combination feature of the same video are respectively obtained.

In this step, loss functions corresponding to the features and the combined features of the same video are respectively obtained according to the class labeling result corresponding to each video obtained in step 101 and the output result of the classification model for each feature and the combined feature of each video in step 103.

For example, if feature 1, feature 2, and feature 3 are extracted from the video a, and the combined feature obtained from the extracted 3 features is feature 4, the feature 1, feature 2, feature 3, and feature 4 are used as inputs to obtain the output result of the classification model, and the loss functions corresponding to feature 1, feature 2, feature 3, and feature 4 are calculated.

In this step, when the loss functions corresponding to the features and the combined features are obtained, the corresponding calculation formula needs to be selected according to the classification task to which the class marking result corresponding to the video belongs.

Specifically, if the category labeling result corresponding to the video is a binary task, the calculation formula of the loss function is as follows:

in the formula: i represents the ith preset category; y is_iRepresenting the labeling result corresponding to the category i in the category labeling results corresponding to the video; s_iAn output result representing the class i that the classification model outputs for the input.

If the category labeling result corresponding to the video is a multi-label classification task, the calculation formula of the loss function is as follows:

For example, if the class label result of the video a is [0, 1,0, 0], which indicates that the video a is a binary task, the loss function of each feature and the combined feature of the video a is calculated using the above-mentioned first calculation formula, and if the output result obtained according to the feature 1 of the video a is [0.3, 0.8, 0.2, 0.4], the loss function corresponding to the feature 1 is: log (0.8).

If the class labeling result of the video a is [0, 1, 1,0], which indicates that the video a is a multi-label classification task, the loss function of each feature and the combined feature of the video a is calculated by using the second calculation formula, and if the output result obtained according to the feature 1 of the video a is [0.3, 0.8, 0.2, 0.4], the loss function corresponding to the feature 1 is: [ -log (0.7) ] + [ -log (0.8) ] + [ -log (0.2) ] + [ -log (0.6) ].

In 105, a loss function of the classification model is determined according to the loss functions corresponding to the features and the combined features of the same video, and parameters of the classification model are adjusted by using the loss function of the classification model to obtain a video classification model.

In this step, a loss function of the classification model is determined according to the loss function of each feature of the same video and the loss function of the combined feature of the video, which are obtained in step 104, and then parameters of the classification model are adjusted by using the determined loss function of the classification model to obtain the video classification model.

That is to say, in this step, in addition to the loss contributed by the combined feature corresponding to the video, the loss contributed by each feature extracted from the video can be obtained, and the problem that the obtained combined feature loses the information of the original feature is avoided, so that the classification model can better consider the contributions of all the features in the video.

Specifically, in this step, when determining the loss function of the classification model according to the loss functions corresponding to the features and the combined features of the same video, the following method may be adopted: and determining the sum of the loss functions of all the characteristics of the same video and the loss functions of the combined characteristics as the loss function of the classification model.

For example, if the loss function obtained for feature 1 of video A is loss₁The loss function obtained in feature 2 is loss₂The loss function obtained in feature 3 is loss₃The loss function obtained in feature 4 is loss₄Then the penalty function of the classification model is: loss₁+loss₂+loss₃+loss₄。

It can be understood that, in the step of training the classification model, the training goal is to minimize the loss function of the classification model. The minimizing the loss function of the classification model may include: the loss functions obtained within the preset number of times are equal, or the difference between the loss functions obtained within the preset number of times is less than or equal to a preset threshold, and so on.

After the classification model is trained, the video classification model is obtained. By using the video classification model, the probability set of the video belonging to each preset category can be obtained according to the input combination characteristics of the video to be classified, and the category of the video to be classified is further determined according to the obtained probability set.

Fig. 2 is a flowchart of a method for video classification according to an embodiment of the present invention, as shown in fig. 2, the method includes:

in 201, a video to be classified is acquired.

In this step, the video to be classified is acquired. It can be understood that the acquired video to be classified can be a video shot by a user through a terminal in real time, can also be a video stored in the local of the terminal for the user, and can also be a video selected by the user from the internet through the terminal; in the step, after the video to be classified is obtained, the video is sent to a server side for subsequent processing.

In 202, preset types of features are extracted from the video to be classified, and combined features corresponding to the video to be classified are constructed based on the extracted features.

In this step, a preset type of feature is extracted from the video to be classified acquired in step 201, and a combined feature corresponding to the video to be classified is constructed based on the extracted feature. In the step, the server side can extract and construct the features according to the acquired video to be classified.

The preset type of features extracted from the video to be classified in the step are at least two of the features of visual features of the video, audio features of the video, character features corresponding to audio in the video, features of subtitles and drama names in the video, human face features in the video, features of a video cover picture, features of a label to which the video belongs, features of a video title, and features of a video knowledge base. The extraction method for each feature is described above, and is not described herein.

Specifically, in this step, when constructing the combined feature corresponding to the video to be classified based on the extracted features, the following method may be adopted: splicing all the features extracted from the videos to be classified to obtain a splicing result; acquiring a combination relation among all features extracted from a video to be classified; according to the attention of each feature extracted from the video to be classified to other features, adjusting each extracted feature to obtain each adjusted feature; and splicing the obtained splicing result, the combination relation among the characteristics and the adjusted characteristics, and taking the splicing result as the combination characteristic corresponding to the video to be classified.

In 203, the combined features corresponding to the video to be classified are used as the input of a video classification model, and the category of the video to be classified is determined according to the output result of the video classification model.

In this step, the combined features corresponding to the video to be classified constructed in step 202 are input into a video classification model obtained by pre-training, and the category of the video to be classified is determined according to the output result of the video classification model. And the output result of the video classification model is a probability set of the video to be classified belonging to each preset category.

Specifically, when determining the category of the video to be classified according to the output result of the video classification model, the following method may be adopted: determining a probability value which is greater than a preset threshold value in an output result of the video classification model; and determining the category corresponding to the determined probability value as the category to which the video to be classified belongs.

For example, if the video to be classified is a video B, the combined feature corresponding to the constructed video B is a feature 5, the feature 5 is input into the video classification model obtained by pre-training, if the obtained output result is [0.3, 0.7, 0.8, 0.2], and if the preset threshold is 0.5, the categories corresponding to the probability values 0.7 and 0.8 are determined as the categories to which the video B belongs.

Fig. 3 is a block diagram of an apparatus for building a video classification model according to an embodiment of the present invention, as shown in fig. 3, the apparatus includes: a first acquisition unit 31, a first construction unit 32, a first processing unit 33, a first training unit 34 and a second training unit 35.

The first obtaining unit 31 is configured to obtain training data, where the training data includes videos and category labeling results corresponding to the videos.

The first obtaining unit 31 obtains each video and a category labeling result corresponding to each video as training data, and the obtained training data is used for training to obtain a video classification model. The first obtaining unit 31 may obtain the training data through a terminal or a server.

Specifically, the category labeling result corresponding to each video acquired by the first acquiring unit 31 is a label set of each category of the videos belonging to the preset category. The first obtaining unit 31 may manually label the label of the category to which the video belongs as "1" and label of other categories as "0".

The first constructing unit 32 is configured to extract features of preset types from the videos, respectively, and construct a combined feature corresponding to each video based on the extracted features.

The first constructing unit 32 extracts features of preset types from each video acquired by the first acquiring unit 31, and further constructs a combined feature corresponding to each video based on the extracted features. The first constructing unit 32 may be configured to perform feature extraction and construction on the basis of the acquired video.

Specifically, the preset type features extracted by the first constructing unit 32 are at least two of visual features of a video, audio features of a video, text features corresponding to audio in a video, features of subtitles and drama names in a video, face features in a video, features of a video cover picture, features of a tag to which a video belongs, features of a video title, and knowledge base features of a video.

It will be appreciated that the first construction element 32 may use existing techniques for extracting the predetermined type of features from the video. The present invention does not limit the manner in which the preset type of features in the video are extracted.

After extracting the preset type of features from each video, the first constructing unit 32 can perform fusion based on the features extracted from each video, so as to construct the combined features corresponding to each video.

Specifically, the first constructing unit 32 may adopt the following manner when constructing the combined feature corresponding to each video based on the extracted features: splicing all the features extracted from the video to obtain a splicing result; acquiring a combination relation among all features extracted from a video; adjusting the extracted features according to the attention of each feature in the extracted video to other features to obtain the adjusted features; and splicing the obtained splicing result, the combination relation among the characteristics and the adjusted characteristics, and taking the splicing result as the combination characteristic corresponding to each video.

It is understood that the first constructing unit 32 may also directly use at least one of the above-mentioned splicing result, the combination relationship among the features, and the adjusted features as the combination feature corresponding to each video; the first constructing unit 32 may further use the splicing result, the combination relationship between the features, and the adjusted splicing result of any two of the features as the combination feature corresponding to each video.

The first processing unit 33 is configured to take each feature extracted from the video and the corresponding combined feature of the video as input of a classification model, and obtain an output result of the classification model for each feature and combined feature of the video.

The first processing unit 33 takes each feature extracted from the video by the first constructing unit 32 and the combined feature corresponding to each constructed video as input of the classification model, and obtains an output result output by the classification model for each feature and combined feature of the video.

The output result of the classification model obtained by the first processing unit 33 is a probability set that the video corresponding to the input features belongs to each preset class.

It is to be understood that the classification model in the first processing unit 33 may be a support vector machine, a regression model or a deep learning model, and the invention does not limit the type of the classification model.

The first training unit 34 is configured to obtain loss functions corresponding to each feature and combined feature of the same video according to the output result of each feature and combined feature of the video and the class labeling result corresponding to the video of the classification model.

The first training unit 34 obtains the loss functions corresponding to the features and the combined features of the same video respectively, based on the class labeling result corresponding to each video obtained by the first obtaining unit 31 and the output result of the classification model of the first processing unit 33 for each feature and the combined feature of each video.

When obtaining the loss functions corresponding to the features and the combined features, the first training unit 34 needs to select a corresponding calculation formula according to the classification task to which the class labeling result corresponding to the video belongs.

And the second training unit 35 is configured to determine a loss function of the classification model according to the loss function corresponding to each feature and the combined feature of the same video, and adjust parameters of the classification model by using the loss function of the classification model to obtain a video classification model.

The second training unit 35 determines a loss function of the classification model according to the loss function of each feature of the same video and the loss function of the combined feature of the video, which are obtained by the first training unit 34, and further adjusts the parameters of the classification model by using the determined loss function of the classification model to obtain the video classification model.

That is to say, the second training unit 35 can obtain the loss contributed by each feature extracted from the video in addition to the loss contributed by the combined feature corresponding to the video, and avoid the problem that the obtained combined feature loses the information of the original feature, so that the classification model can better consider the contributions of all the features in the video.

Specifically, when the second training unit 35 determines the loss function of the classification model according to the loss functions corresponding to the features and the combined features of the same video, the following method may be adopted: and determining the sum of the loss functions of all the characteristics of the same video and the loss functions of the combined characteristics as the loss function of the classification model.

It is to be understood that the second training unit 35, when training the classification model, trains the objective to minimize the loss function of the classification model. The minimizing the loss function of the classification model may include: the loss functions obtained within the preset number of times are equal, or the difference between the loss functions obtained within the preset number of times is less than or equal to a preset threshold, and so on.

The second training unit 35 obtains the video classification model after training the classification model. By using the video classification model, the probability set of the video belonging to each preset category can be obtained according to the input combination characteristics of the video to be classified, and the category of the video to be classified is further determined according to the obtained probability set.

Fig. 4 is a block diagram of an apparatus for classifying videos according to an embodiment of the present invention, as shown in fig. 4, the apparatus includes: a second acquisition unit 41, a second construction unit 42 and a second processing unit 43.

And a second obtaining unit 41, configured to obtain a video to be classified.

The second acquisition unit 41 acquires a video to be classified. It can be understood that the acquired video to be classified can be a video shot by a user through a terminal in real time, can also be a video stored in the local of the terminal for the user, and can also be a video selected by the user from the internet through the terminal; after acquiring the video to be classified, the second acquiring unit 41 sends the video to the server for subsequent processing.

And a second constructing unit 42, configured to extract a preset type of feature from the video to be classified, and construct a combined feature corresponding to the video to be classified based on the extracted feature.

The second constructing unit 42 extracts features of a preset type from the video to be classified acquired in the second acquiring unit 41, and constructs a combined feature corresponding to the video to be classified based on the extracted features. The second constructing unit 42 may be configured to perform feature extraction and construction on the video to be classified by the server according to the acquired video to be classified.

The preset type features extracted from the video to be classified by the second constructing unit 42 are at least two of the features of a visual feature of the video, an audio feature of the video, a character feature corresponding to an audio in the video, a feature of a caption and a drama name in the video, a face feature in the video, a feature of a cover picture of the video, a feature of a label to which the video belongs, a feature of a video title, a knowledge base feature of the video, and the like. The extraction method for each feature is described above, and is not described herein.

Specifically, the second constructing unit 42 may adopt the following manner when constructing the combined feature corresponding to the video to be classified based on the extracted feature: splicing all the features extracted from the videos to be classified to obtain a splicing result; acquiring a combination relation among all features extracted from a video to be classified; according to the attention of each feature extracted from the video to be classified to other features, adjusting each extracted feature to obtain each adjusted feature; and splicing the obtained splicing result, the combination relation among the characteristics and the adjusted characteristics, and taking the splicing result as the combination characteristic corresponding to the video to be classified.

And the second processing unit 43 is configured to use the combined features corresponding to the video to be classified as input of a video classification model, and determine the category of the video to be classified according to the output result of the video classification model.

The second processing unit 43 inputs the combination features corresponding to the video to be classified, which are constructed by the second construction unit 42, into the video classification model obtained by pre-training, and determines the category of the video to be classified according to the output result of the video classification model. And the output result of the video classification model is a probability set of the video to be classified belonging to each preset category.

Specifically, when determining the category of the video to be classified according to the output result of the video classification model, the second processing unit 43 may adopt the following manner: determining a probability value which is greater than a preset threshold value in an output result of the video classification model; and determining the category corresponding to the determined probability value as the category to which the video to be classified belongs.

As shown in fig. 5, the computer system/server 012 is embodied as a general purpose computing device. The components of computer system/server 012 may include, but are not limited to: one or more processors or processing units 016, a system memory 028, and a bus 018 that couples various system components including the system memory 028 and the processing unit 016.

Bus 018 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer system/server 012 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 012 and includes both volatile and nonvolatile media, removable and non-removable media.

System memory 028 can include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)030 and/or cache memory 032. The computer system/server 012 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 034 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be connected to bus 018 via one or more data media interfaces. Memory 028 can include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the present invention.

Program/utility 040 having a set (at least one) of program modules 042 can be stored, for example, in memory 028, such program modules 042 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof might include an implementation of a network environment. Program modules 042 generally perform the functions and/or methodologies of embodiments of the present invention as described herein.

The computer system/server 012 may also communicate with one or more external devices 014 (e.g., keyboard, pointing device, display 024, etc.), hi the present invention, the computer system/server 012 communicates with an external radar device, and may also communicate with one or more devices that enable a user to interact with the computer system/server 012, and/or with any device (e.g., network card, modem, etc.) that enables the computer system/server 012 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 022. Also, the computer system/server 012 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 020. As shown, the network adapter 020 communicates with the other modules of the computer system/server 012 via bus 018. It should be appreciated that, although not shown, other hardware and/or software modules may be used in conjunction with the computer system/server 012, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 016 executes programs stored in the system memory 028, thereby executing various functional applications and data processing, such as implementing the method flow provided by the embodiment of the present invention.

With the development of time and technology, the meaning of media is more and more extensive, and the propagation path of computer programs is not limited to tangible media any more, and can also be downloaded from a network directly and the like. Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

By utilizing the technical scheme provided by the invention, the loss function of the classification model is obtained through the loss function corresponding to each feature of the video and the loss function corresponding to the combined feature corresponding to the video, and then the video classification model is obtained through training according to the obtained loss function, so that the contribution of all features in the video can be considered in the training process of the classification model, the problem of information loss contained in a single feature in the video is avoided, and the classification accuracy of the established video classification model can be improved because richer information is used for training the classification model.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method of building a video classification model, the method comprising:

acquiring training data, wherein the training data comprises videos and category marking results corresponding to the videos;

extracting at least two preset types of features from each video respectively, and constructing a combined feature corresponding to each video based on each extracted feature;

respectively taking each feature extracted from the same video and the combination feature corresponding to the same video as the input of a classification model, and acquiring the output results of the classification model aiming at each feature and the combination feature of the same video;

respectively acquiring loss functions corresponding to the features and the combined features of the same video according to the output results of the features and the combined features of the same video and the class marking result corresponding to the same video by the classification model;

and determining a loss function of the classification model according to the loss functions corresponding to the characteristics and the combined characteristics of the same video, and adjusting parameters of the classification model by using the loss function of the classification model to obtain the video classification model.

2. The method of claim 1, wherein the constructing the corresponding combined features for the videos based on the extracted features comprises:

splicing all the features extracted from the video to obtain a splicing result;

acquiring a combination relation among all features extracted from the video;

adjusting the extracted features according to the attention of each feature in the extracted video to other features to obtain the adjusted features;

and splicing the splicing result, the combination relation among the characteristics and the adjusted characteristics to be used as the combination characteristics corresponding to the videos.

3. The method according to claim 1, wherein the obtaining the loss functions corresponding to the features and the combined features of the same video respectively comprises:

determining a classification task to which each video belongs according to a category marking result corresponding to each video;

determining a calculation formula of a loss function according to the classification task to which each video belongs;

and respectively acquiring the loss functions corresponding to the characteristics and the combined characteristics of the same video according to the determined calculation formula of the loss functions.

4. The method according to claim 1, wherein the determining the loss function of the classification model according to the loss functions corresponding to the features and the combined features of the same video comprises:

and determining the sum of the loss functions of all the characteristics of the same video and the loss functions of the combined characteristics as the loss function of the classification model.

5. The method of claim 1, further comprising:

the training objective of the classification model is to minimize the loss function of the classification model.

6. A method of video classification, the method comprising:

acquiring a video to be classified;

extracting preset type features from the video to be classified, and constructing combined features corresponding to the video to be classified based on the extracted features;

the combined features corresponding to the video to be classified are used as the input of a video classification model, and the category of the video to be classified is determined according to the output result of the video classification model;

the video classification model is constructed in advance according to the method for establishing the video classification model in any claim of 1-5.

7. The method according to claim 6, wherein the constructing the corresponding combined features of the video to be classified based on the extracted features comprises:

splicing the features extracted from the videos to be classified to obtain a splicing result;

acquiring a combination relation among the features extracted from the video to be classified;

adjusting the extracted features according to the attention of each feature extracted from the video to be classified to other features, and acquiring each adjusted feature;

and splicing the splicing result, the combination relation among the characteristics and the adjusted characteristics to be used as the combination characteristics corresponding to the video to be classified.

8. The method according to claim 6, wherein the determining the category of the video to be classified according to the output result of the video classification model comprises:

determining a probability value which is greater than a preset threshold value in an output result of the video classification model;

and determining the category corresponding to the determined probability value as the category to which the video to be classified belongs.

9. An apparatus for building a video classification model, the apparatus comprising:

the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring training data, and the training data comprises videos and class marking results corresponding to the videos;

the first construction unit is used for respectively extracting at least two preset types of features from each video and constructing a combined feature corresponding to each video based on each extracted feature;

the first processing unit is used for respectively taking each feature extracted from the same video and the combined feature corresponding to the same video as the input of a classification model and acquiring the output results of the classification model aiming at each feature and the combined feature of the same video;

the first training unit is used for respectively acquiring loss functions corresponding to the features and the combined features of the same video according to the output results of the features and the combined features of the same video and the class marking result corresponding to the same video by the classification model;

and the second training unit is used for determining a loss function of the classification model according to the loss functions corresponding to the features and the combined features of the same video, and adjusting parameters of the classification model by using the loss function of the classification model to obtain the video classification model.

10. The apparatus according to claim 9, wherein the first constructing unit, when constructing the combined feature corresponding to each video based on the extracted features, specifically performs:

splicing all the features extracted from the video to obtain a splicing result;

acquiring a combination relation among all features extracted from the video;

11. The apparatus according to claim 9, wherein the first training unit, when obtaining the loss functions corresponding to the features and the combined features of the same video, specifically performs:

12. The apparatus according to claim 9, wherein the second training unit, when determining the loss function of the classification model according to the loss functions corresponding to the features and the combined features of the same video, specifically performs:

13. The apparatus of claim 9, wherein the second training unit further performs:

a training objective for the classification model is to minimize a loss function of the classification model.

14. An apparatus for video classification, the apparatus comprising:

the second acquisition unit is used for acquiring the video to be classified;

the second construction unit is used for extracting the characteristics of a preset type from the video to be classified and constructing the combined characteristics corresponding to the video to be classified based on the extracted characteristics;

the second processing unit is used for taking the combination characteristics corresponding to the video to be classified as the input of a video classification model and determining the category of the video to be classified according to the output result of the video classification model;

the video classification model is constructed in advance according to the device for establishing the video classification model in any one of claims 9-13.

15. The apparatus according to claim 14, wherein the second constructing unit specifically performs, when constructing the combined feature corresponding to the video to be classified based on the extracted feature:

16. The apparatus according to claim 14, wherein the second processing unit, when determining the category of the video to be classified according to the output result of the video classification model, specifically performs:

17. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any one of claims 1 to 8.

18. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 8.