CN110781960B

CN110781960B - Training method, classification method, device and equipment of video classification model

Info

Publication number: CN110781960B
Application number: CN201911025860.XA
Authority: CN
Inventors: 尹康
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-10-25
Filing date: 2019-10-25
Publication date: 2022-06-28
Anticipated expiration: 2039-10-25
Also published as: CN110781960A

Abstract

The application discloses a training method, a classification method, a device and equipment of a video classification model. The method comprises the following steps: acquiring a coarse label data set; obtaining a first classification model and a second classification model, wherein the classification precision of the second classification model is higher than that of the first classification model; calling the second classification model to predict the soft label of the video in the rough label data set to obtain a soft label data set, wherein the soft label is a label which adopts probability to represent the category of the video; and carrying out fine tuning training on the first classification model according to the soft label data set to obtain the video classification model. The soft label data set is generated by a machine instead of being obtained by adopting a manual labeling mode, and the problems of high cost and low efficiency of manual labeling are solved.

Description

Training method, classification method, device and equipment of video classification model

Technical Field

The present application relates to the field of computer vision, and in particular, to a training method, a classification method, an apparatus, and a device for a video classification model.

Background

Automatic understanding of video content has become a key technology for many application scenarios, such as autopilot, video-based search, and intelligent robotics, among others. Video tag classification through machine learning is one way to automatically understand video content.

In the related art, a video is encoded into a series of feature vectors including visual features and audio features, and the feature vectors are input into a trained deep learning model to obtain a label corresponding to the video. The tag is a video level tag. Typically, the deep learning model is trained based on the Youtube-8M dataset. The Youtube-8M dataset is a large tagged video dataset that includes 610 ten thousand video sets and 3862 classes.

Since the prediction accuracy of the deep learning model is particularly dependent on the volume of the data set and the accuracy of the label, although the accuracy of the label is obviously improved by manual labeling, the cost of the manual labeling is high and the efficiency is low. And, the labeling difficulty can be further improved with the increase of the number of categories.

Disclosure of Invention

The embodiment of the application provides a training method, a classification method, a device and equipment of a video classification model, and can solve the problems that although the accuracy of labels is obviously improved by manual labeling, the cost of the manual labeling is very high and the efficiency is low. The technical scheme is as follows:

according to an aspect of the present application, there is provided a training method of a video classification model, the method including:

Acquiring a coarse label data set;

obtaining a first classification model and a second classification model, wherein the classification precision of the second classification model is higher than that of the first classification model;

calling a second classification model to predict a soft label of the video in the rough label data set to obtain a soft label data set, wherein the soft label is a label which adopts probability to represent the category of the video;

performing fine tuning training on a first classification model according to the soft label data set to obtain the video classification model;

wherein the second classification model has a higher classification accuracy than the first classification model.

According to another aspect of the present application, there is provided a video classification method, the method including:

acquiring a video to be classified;

extracting the features of the video to obtain a feature vector of the video;

calling a video classification model to predict the feature vector to obtain a classification label of the video; the video classification model is obtained by performing fine tuning training on a first classification model according to a soft label data set, the soft label data set is obtained by calling a second classification model to predict soft labels of the videos in the coarse label data set, and the soft labels are labels which represent the categories of the videos by adopting probability;

According to another aspect of the present application, there is provided an apparatus for training a video classification model, the apparatus comprising:

the sample acquisition module is used for acquiring a rough label data set;

the model obtaining module is used for obtaining a first classification model and a second classification model, and the classification precision of the second classification model is higher than that of the first classification model;

the soft label prediction module is used for calling the second classification model to predict the soft label of the video in the coarse label data set to obtain a soft label data set, wherein the soft label is a label which adopts probability to represent the category of the video;

and the fine tuning training module is used for performing fine tuning training on the first classification model according to the soft label data set to obtain the video classification model.

According to another aspect of the present application, there is provided a video classification apparatus, the apparatus including:

the acquisition module is used for acquiring videos to be classified;

the extraction module is used for extracting the features of the video to obtain the feature vector of the video;

the calling module is used for calling a video classification model to predict the characteristic vector to obtain a label of the video; the video classification model is obtained by performing fine tuning training on a first classification model according to a soft label data set, the soft label data set is obtained by calling a second classification model to predict soft labels of the videos in the coarse label data set, and the soft labels are labels which represent the categories of the videos by adopting probability;

According to another aspect of the present application, there is provided a computer device including: a processor and a memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement a method of training a video classification model as described above, or a method of video classification as in the above aspects.

According to another aspect of the present application, there is provided a computer-readable storage medium having stored therein at least one instruction, at least one program, code set, or set of instructions, which is loaded and executed by the processor to implement a method of training a video classification model as described above, or a method of video classification as described above.

The embodiment of the application has at least the following beneficial effects:

the soft label of the video in the rough label data set is predicted by calling the second classification model to obtain a soft label data set, fine tuning training is carried out on the first classification model according to the soft label data set to obtain a video classification model, and the classification precision of the second classification model is higher than that of the first classification model, so that the label accuracy of the soft label data set is superior to that of the rough label data set, the prediction accuracy of the first classification model can be improved in the fine tuning training process, and the video classification model with higher accuracy is obtained. Meanwhile, the soft label data set is generated by a machine instead of being obtained by adopting a manual labeling mode, and the problems of high cost and low efficiency of manual standards are solved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flowchart of a method for training a video classification model according to an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a method for training a video classification model according to an exemplary embodiment of the present application;

FIG. 3 is a flowchart of a method for training a video classification model according to another exemplary embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a method for training a video classification model according to another exemplary embodiment of the present application;

FIG. 5 is a flowchart of a method for training a video classification model according to another exemplary embodiment of the present application;

FIG. 6 is a schematic diagram illustrating a method for training a video classification model according to another exemplary embodiment of the present application;

FIG. 7 is a flow chart of a video classification method provided by another illustrative embodiment of the present application;

FIG. 8 is a block diagram of an apparatus for training a video classification model according to another exemplary embodiment of the present application;

FIG. 9 is a block diagram of a video classification model provided in another illustrative embodiment of the present application;

FIG. 10 is a block diagram of a computer device provided in another illustrative embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

First, a number of technical terms provided in the embodiments of the present application are summarized:

Youtube-8M video understanding challenge matches: the video understanding challenge game sponsored by Kaggle and Google requires a machine learning model below 1G to perform a video label classification task. It is held once a year, has been held twice, and is held for the third time in 2019.

Youtube-8M dataset: a large tagged data set containing 610 ten thousand videos and 3862 classes (or tags). The data set encodes the original video as a series of feature vectors, including visual features and audio features. These features are frames extracted from the original video at a frequency of 1Hz, and are generated by a pre-trained convolutional neural network. The dataset includes three levels of datasets:

-frame level datasets

The frame-level data set is data obtained by performing feature acquisition in units of "frames". Each frame corresponds to a respective label.

-video level datasets

And averaging the characteristic vector sequence corresponding to a single video in the frame-level data set to obtain the characteristic vector corresponding to the video in the video-level data set. Each video corresponds to a respective tag.

In the first two data sets, the labels are generated by adopting two strategies of machine automatic labeling and manual labeling, so that the accuracy is low. According to the technical report of the Youttube-8M dataset, the tag accuracy and recall rate for the frame-level dataset were only 78.8% and 14.5%.

Hard tag: labels denoted 0 and 1 are used. For example, for a certain label category, when the value is 1, it represents that the video belongs to the label category; when the value is 0, the video does not belong to the label category.

Soft labeling: the labels are labeled with a probability between 0 and 1. For example, for a certain label category, when the value is 0.68, the probability that the representative video belongs to the label category is 68%.

Fig. 1 shows a flowchart of a training method of a video classification model according to an exemplary embodiment of the present application. The method can be applied to computer equipment. The method comprises the following steps:

102, acquiring a coarse label data set;

the coarse label dataset includes: correspondence between video and tags. The length of the video in the bold label dataset is greater than a first time period, such as 20 seconds. The tags in the coarse tag dataset are hard tags. In one example, the number of videos in the coarse label dataset is a first number, which is large, such as millions of videos, tens of millions of videos, or billions of videos. The coarse label dataset may be the Youtube-8M dataset.

All or a portion of the hard tags in the set of coarse tag data are machine labeled, such as by video title, video comment, category manually given by the user while watching the video, and so forth.

104, acquiring a first classification model and a second classification model, wherein the classification precision of the second classification model is higher than that of the first classification model;

illustratively, the first classification model is a classification model trained from a coarse label data set, and the second classification model is a classification model trained from a manually labeled fine label data set. The fine label dataset includes: and (4) corresponding relation between the video and the manual label. The accuracy of manually labeling tags is much higher than hard tags in the coarse tag dataset.

In one example, the fine label dataset is manually tagged with a portion of the video in the coarse label dataset. The videos in the fine label dataset are a subset of the videos in the coarse label dataset. The number of videos in the fine label dataset is a second number, which is small, such as hundreds of thousands of videos.

In one example, the fine label dataset may also be multiple videos captured in addition. At this time, there may be no intersection between the video in the fine label data set and the video in the coarse label data set, or there may be a small portion of intersection.

Step 106, calling a second classification model to predict soft labels of the videos in the rough label data set to obtain a soft label data set, wherein the soft labels are labels which adopt probability to express the categories of the videos;

and 108, performing fine tuning training on the first classification model according to the soft label data set to obtain a video classification model.

As shown in fig. 2, there are a first classification model 21 with a poor classification accuracy and a second classification model 23 with a good classification accuracy. Because the number of videos in the rough label data set 22 is large, the second classification model 23 with high accuracy is used for predicting the videos in the rough label data set 22, so that a soft label data set 24 with high accuracy can be obtained, and the first classification model 21 is subjected to fine tuning training 26 by using the soft label data set 24, so that a video classification model 25 can be obtained.

Since the first classification model after the fine-tuning training will be higher than the first classification model before the fine-tuning training, step 108 may be performed multiple times in order to optimize the first classification model as high as possible.

In summary, in the method provided in this embodiment, the second classification model is called to predict the soft label of the video in the coarse label data set to obtain the soft label data set, and the first classification model is subjected to fine tuning training according to the soft label data set to obtain the video classification model. Meanwhile, the soft label data set is generated by a machine instead of being obtained by adopting a manual labeling mode, and the problems of high cost and low efficiency of manual standards are solved.

Fig. 3 is a flowchart illustrating a method for training a video classification model according to another exemplary embodiment of the present application. The method can be applied to computer equipment. The method comprises the following steps:

step 301, acquiring a coarse label data set;

The coarse label dataset includes: correspondence between video and tags. The length of the video in the bold label dataset is greater than a first time period, such as 20 seconds. The tags in the coarse tag dataset are hard tags. In one example, the number of videos in the coarse label dataset is a first number, which is large, such as millions of videos, tens of millions of videos, or billions of videos.

In one example, the coarse label dataset is a video level dataset in the Youtube-8M dataset.

Step 302, training according to a rough label data set to obtain a first classification model;

for each video in the coarse label dataset, a feature vector is extracted for the video, the feature vector comprising at least one of a visual feature and an auditory feature. And inputting the feature vector of the video into the first classification model for prediction to obtain a prediction label of the video. And training by adopting an error back propagation algorithm according to the error between the prediction label and the hard label of the video to obtain a first classification model.

Optionally, extracting a feature vector for the video includes: sampling video frames in a video by adopting a preset frequency (1HZ or 24HZ), extracting frame feature vectors of the video frames by adopting a two-dimensional convolution neural network, and performing feature fusion on the frame feature vectors of a plurality of video frames to obtain the feature vectors of the video. The extraction of the frame feature vector can adopt IncepotionNet or mobile terminal lightweight network MobileNet provided by Google corporation; feature fusion may employ a NetVLAD network.

Referring collectively to FIG. 4, the first classification model may be a pre-trained model. The first classification model may employ a NetVLAD model.

Step 303, randomly selecting the videos in the rough label data set to obtain a candidate video subset;

in the coarse label dataset, a portion of the video (or video segment) is randomly extracted as a subset of candidate videos. In one example, more than 10% of the videos are randomly drawn as the candidate video subset. In another example, each video in the rough label data set is equally divided into several video segments, and more than 10% of the video segments are randomly extracted as the candidate video subsets. The video clips are used as candidate video subsets, so that the classification capability of the model on short videos can be improved.

Step 304, manually labeling videos in the candidate video subset to obtain a fine label data set;

exemplary, manual labeling methods include: for each video (or video segment) in the candidate video subset, the computer device inquires the annotator whether a certain label exists in the video (or video segment) (other labels which are not inquired do not need to be annotated), so that the annotation difficulty and the false annotation and missing annotation probability are not increased along with the increase of the label category. To increase the percentage of manually labeled positive samples, the labels to be queried may also be selected from among the bold labels to which the video (or video segment) belongs, since the number of bold labels to which each video belongs may be more than one. The manually labeled data set is called a fine label data set, the fine label data set comprises the corresponding relation between the video and the manually labeled hard label, the number of samples of the fine label training set is small, and the label precision is high. That is, the number of videos in the fine label training set is a second number, which is smaller than the first number, for example, the second number is about 10% of the first number.

305, training the first classification model according to the fine label data set to obtain a second classification model;

For each video in the fine label dataset, a feature vector is extracted for the video, the feature vector comprising at least one of a visual feature and an auditory feature. And inputting the characteristic vector of the video into the first classification model for prediction to obtain a prediction label of the video. And training by adopting an error back propagation algorithm according to the error between the prediction label of the video and the manually marked hard label to obtain a second classification model.

With combined reference to FIG. 4, the first classification model may be a pre-trained model and the second classification model may be model A. Since the second classification model is trained from the manually labeled fine label dataset, the accuracy of the second classification model is higher than that of the first classification model.

Step 306, calling a second classification model to predict the soft label of the video in the rough label data set to obtain a soft label data set, wherein the soft label is a label which adopts probability to represent the category of the video;

in one example, a second classification model is called to perform soft label prediction on videos in a massive coarse label data set, so as to obtain a soft label of each video. And storing the corresponding relation between the video and the soft label as a soft label data set.

In another example, a video in a massive coarse label data set is equally divided into a plurality of video segments, and a second classification model is invoked to perform soft label prediction on the massive video segments (a part of the video segments can be randomly extracted instead of all the video segments), so as to obtain a soft label of each video segment. And storing the corresponding relation between the video clip and the soft label as a soft label data set.

The label accuracy of the soft label dataset is higher than the label accuracy of the coarse label dataset and equal to or lower than the label accuracy of the fine label dataset.

Step 307, training the first classification model according to the soft label data set to obtain an ith fine-tuning classification model, wherein the initial value of i is 1;

for each video in the soft label dataset, a feature vector is extracted for the video, the feature vector comprising at least one of a visual feature and an auditory feature. And inputting the feature vector of the video into the first classification model for prediction to obtain a prediction label of the video. And training by adopting an error back propagation algorithm according to the error between the prediction label and the soft label of the video to obtain a 1 st fine-tuning classification model.

Referring collectively to fig. 4, the 1 st fine-tuned classification model may be model B. Since model B is trained through the 1 st soft label dataset, the accuracy of model B is higher than that of the pre-trained model.

Step 308, calling the ith fine-tuning classification model to predict the soft labels of the videos in the coarse label data set to obtain an (i + 1) th soft label data set;

in one example, the ith fine-tuning classification model is called to perform soft label prediction on videos in a massive coarse label data set, so as to obtain a soft label of each video. And storing the corresponding relation between the video and the soft label as the (i + 1) th soft label data set.

In another example, a video in a massive coarse label data set is equally divided into a plurality of video segments, and an ith fine-tuning classification model is called to perform soft label prediction on the massive video segments (a part of the video segments can be randomly extracted instead of all the video segments), so as to obtain a soft label of each video segment. And storing the corresponding relation between the video clip and the soft label as the (i + 1) th soft label data set.

The tag accuracy of the (i + 1) th soft tag data set is higher than that of the (i) th soft tag data set.

Step 309, performing fine tuning training on the first classification model according to the (i + 1) th soft label data set to obtain an (i + 1) th fine tuning classification model;

for each video in the soft label dataset, a feature vector is extracted for the video, the feature vector comprising at least one of a visual feature and an auditory feature. And inputting the feature vector of the video into the ith fine-tuning classification model for prediction to obtain a prediction label of the video. And (3) training by adopting an error back propagation algorithm according to the error between the prediction label and the soft label of the video to obtain the (i + 1) th fine-tuning classification model.

Referring collectively to FIG. 4, the 2 nd fine-tuned classification model may be model C. Since model C is trained through the 2 nd soft label dataset, the accuracy of model C is higher than that of model B.

Step 310, when i +1 is smaller than the threshold n, making i +1 equal to i, and then executing the two steps again;

when i +1 is smaller than the threshold n, the two

steps

308 and 309 are executed again after i +1 is made equal to i. In the present embodiment, the threshold n is 2. However, n may have different values in different embodiments. According to the experimental result of the inventor, n is 2 or 3, and the fine tuning training for more than 3 times does not have obvious performance improvement on the classification model.

And 311, when i +1 is equal to n, determining the i +1 th fine tuning classification model as the video classification model.

In summary, in the method provided in this embodiment, the second classification model is called to predict the soft label of the video in the coarse label data set to obtain the soft label data set, and the first classification model is subjected to fine tuning training according to the soft label data set to obtain the video classification model. Meanwhile, the soft label data set is generated by a machine instead of being obtained by adopting a manual labeling mode, and the problems of high cost and low efficiency of manual labeling are solved.

The method provided by the embodiment can perform the whole training process based on the 'massive but inaccurate coarse label data set' and the 'small but accurate artificial standard fine label data set' as the original data set, and can reduce the dependence on the artificial labeling work as much as possible and improve the classification accuracy of the model as much as possible.

In the method provided by this embodiment, at least two progressive fine tuning trainings are further adopted, so that the classification accuracy of the soft label data set on the model can be progressively improved. Through 2 and 3 times of obvious improvement, the classification accuracy of the model can be improved as much as possible under the limited calculation amount.

It should be noted that the first classification model may also be a classification model that is not pre-trained. But the final classification effect will be worse than the pre-trained classification model.

In an alternative embodiment, referring to fig. 5 and 6 in combination, step 306 may include the steps of:

step 306a, segmenting the video in the coarse video data set to obtain a plurality of video segments of the video;

the video in the coarse video data set is randomly equally divided into 5-10 video segments. In one example, video frames in the coarse video data set are sampled at 1HZ, a number from 5-10 is randomly selected as the number of packets, and the sampled video frames are equally divided by the number of packets M to obtain M video segments of the video.

For example, in FIG. 6, a video samples 25 video frames, 0-4 video frames as a first video segment, 5-9 video frames as a second video segment, 10-14 video frames as a third video segment, 15-19 video frames as a fourth video segment, and 20-25 video frames as a fifth video segment.

Step 306b, randomly extracting k x D video segments for a plurality of video segments of each video, wherein D is the number of coarse labels of the video in the coarse video data set, and k and D are integers;

there is at least one tag in the coarse tag dataset for each video. Illustrated in fig. 6 is that there are 2 tags "1" and "3" for 1 video. For 5 video segments of the video, 2 video segments (or 4 video segments, integer multiples of 2 tags) are randomly extracted.

Step 306c, calling a second classification model to predict the probability that each video clip in the ith group of video clips belongs to the ith rough label of the video, wherein the ith group of video clips comprises k video clips, and i is an integer not greater than D;

for the randomly extracted 3 rd video segment (1 st group of video segments), a second classification model is called to predict the probability that the 3 rd video segment belongs to the coarse label "1" of the video, and the probability of 0.71 is taken as the soft label of the 3 rd video segment.

For the 4 th video segment (group 1 video segment) which is randomly extracted, a second classification model is called to predict the probability that the 4 th video segment belongs to the rough label '3' of the video, and the probability 0.24 is used as the soft label of the 4 th video segment.

Fig. 6 illustrates that each group of video segments includes one video segment, but in other embodiments, each group of video segments may include 2, 3 or more video segments, which is not repeated herein.

And step 306d, determining all randomly extracted video clips and corresponding probabilities as a soft label data set.

It should be noted that, for the generation manner of the i +1 th soft tag data set in step 308, reference may be made to this embodiment similarly.

Fig. 7 is a flowchart illustrating a video classification method according to an exemplary embodiment of the present application. The method may be performed by a computer device, the method comprising:

step 701, acquiring a video to be classified;

step 702, extracting the characteristics of the video to obtain a characteristic vector of the video;

illustratively, a video frame in a video is sampled by using a preset frequency (1HZ or 24HZ), a frame feature vector of the video frame is extracted by using a two-dimensional convolutional neural network, and feature fusion is performed on the frame feature vectors of a plurality of video frames to obtain a feature vector of the video. The extraction of the frame feature vector can adopt IncepotionNet or a mobile terminal lightweight network MobileNet; feature fusion may employ a NetVLAD network.

Step 703, calling a video classification model to predict the feature vector to obtain a video label;

the video classification model is obtained by performing fine tuning training on a first classification model according to a soft label data set, the soft label data set is obtained by calling a second classification model to predict soft labels of videos in a coarse label data set, and the soft labels are labels which represent the categories of the videos by adopting probabilities. And the classification precision of the second classification model is higher than that of the first classification model.

The video classification model is trained according to the above embodiment.

In the following, embodiments of the apparatus of the present application are referred to, and details not described in detail in the embodiments of the apparatus refer to the embodiments of the method described above.

Fig. 8 is a block diagram of an apparatus for training a video classification model according to an exemplary embodiment of the present application, the apparatus including:

a sample acquisition module 820 for acquiring a coarse label data set;

a model obtaining module 840, configured to obtain a first classification model and a second classification model, where the classification accuracy of the second classification model is higher than that of the first classification model;

a soft label prediction module 860, configured to invoke the second classification model to predict a soft label of the video in the coarse label data set, so as to obtain a soft label data set, where the soft label is a label that represents a category to which the video belongs by using probability;

And the fine tuning training module 880 is configured to perform fine tuning training on the first classification model according to the soft label data set, so as to obtain the video classification model.

In one embodiment, the first classification model is a classification model trained from the coarse label dataset and the second classification model is a classification model trained from a manually labeled fine label dataset.

In an embodiment, the fine tuning training module 880 is configured to train the first classification model according to the soft label data set to obtain an ith fine tuning classification model, where an initial value of i is 1; calling the ith fine-tuning classification model to predict the soft label of the video in the coarse label data set to obtain an (i + 1) th soft label data set; performing fine tuning training on the first classification model according to the (i + 1) th soft label data set to obtain an (i + 1) th fine tuning classification model; when the i +1 is smaller than the threshold n, making i +1 equal to i, and then executing the two steps again; when the i +1 is equal to the n, determining the i +1 th fine-tuning classification model as the video classification model.

In one embodiment, the apparatus further comprises:

And the model obtaining module 840 is configured to obtain a first classification model according to the coarse label data set.

In one embodiment, the apparatus further comprises:

a model obtaining module 840, configured to randomly select videos in the coarse label dataset to obtain a candidate video subset; manually labeling the videos in the candidate video subset to obtain a fine label data set; and training the first classification model according to the fine label data set to obtain the second classification model.

In an embodiment, the model obtaining module 840 is configured to segment videos in the candidate video subset to obtain a plurality of video segments of the videos; for a plurality of video segments of each video, randomly extracting m video segments; and manually labeling the m video segments to obtain the fine label data set.

In one embodiment, the soft label prediction module 860 is configured to segment a video in the coarse video data set to obtain a number of video segments of the video; for a plurality of video segments of each video, randomly extracting k x D video segments, wherein D is the number of coarse labels of the video in the coarse video data set, and k and D are integers; calling the second classification model to predict the probability that each video clip in the ith group of video clips belongs to the ith coarse label of the video for the ith group of video clips, wherein the ith group of video clips comprises k video clips, and i is an integer not greater than D; and determining all randomly extracted video clips and corresponding probabilities as the soft label data set.

Fig. 9 is a block diagram of a video classification apparatus provided in an exemplary embodiment of the present application, the apparatus including:

an obtaining module 920, configured to obtain a video to be classified;

an extracting module 940, configured to perform feature extraction on the video to obtain a feature vector of the video;

a calling module 960, configured to call a video classification model to predict the feature vector, so as to obtain a label of the video; the video classification model is obtained by performing fine tuning training on a first classification model according to a soft label data set, the soft label data set is obtained by calling a second classification model to predict soft labels of the videos in the coarse label data set, and the soft labels are labels which represent the categories of the videos by adopting probability;

In one embodiment, the video classification model is trained by the following steps:

Training the first classification model according to the soft label data set to obtain an ith fine-tuning classification model, wherein the initial value of i is 1;

calling the ith fine-tuning classification model to predict the soft label of the video in the coarse label data set to obtain an (i + 1) th soft label data set;

performing fine tuning training on the first classification model according to the (i + 1) th soft label data set to obtain an (i + 1) th fine tuning classification model;

when the i +1 is smaller than the threshold n, making i +1 equal to i, and then performing the two steps again;

when the i +1 is equal to the n, determining the i +1 th fine-tuning classification model as the video classification model.

The application further provides a computer device, which includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the video classification model training method or the video classification method provided by the above method embodiments. It should be noted that the computer device may be a computer device as provided in fig. 10 below.

Referring to fig. 10, a schematic structural diagram of a computer device according to an exemplary embodiment of the present application is shown. Specifically, the method comprises the following steps: the computer apparatus 1000 includes a Central Processing Unit (CPU)1001, a system memory 1004 including a Random Access Memory (RAM)1002 and a Read Only Memory (ROM)1003, and a system bus 1005 connecting the system memory 1004 and the central processing unit 1001. The computer device 1000 also includes a basic input/output system (I/O system) 1006, which facilitates the transfer of information between devices within the computer, and a mass storage device 1007, which stores an operating system 1013, application programs 1014, and other program modules 1010.

The basic input/output system 1006 includes a display 1008 for displaying information and an input device 1009, such as a mouse, keyboard, etc., for user input of information. Wherein a display 1008 and an input device 1009 are connected to the central processing unit 1001 via an input-output controller 1010 connected to the system bus 1005. The basic input/output system 1006 may also include an input/output controller 1010 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input-output controller 1010 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1007 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1007 and its associated computer-readable media provide non-volatile storage for the computer device 1000. That is, the mass storage device 1007 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROI drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1004 and mass storage device 1007 described above may be collectively referred to as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1001, the one or more programs containing instructions for implementing the video classification model training method or the video classification method described above, and the central processing unit 1001 executes the one or more programs to implement the video classification model training method or the video classification method provided by the various method embodiments described above.

According to various embodiments of the present application, the computer device 1000 may also operate as a remote computer connected to a network via a network, such as the Internet. That is, the computer device 1000 may be connected to the network 1012 through the network interface unit 1011 connected to the system bus 1005, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1011.

The memory further includes one or more programs, the one or more programs are stored in the memory, and the one or more programs include a training method or a video classification method for performing the video classification model provided in the embodiments of the present application.

The embodiment of the present application further provides a computer device, where the computer device includes a memory and a processor, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded by the processor and implements the video classification model training method or the video classification method.

Embodiments of the present application further provide a computer-readable storage medium, where at least one instruction, at least one program, a code set, or a set of instructions is stored in the computer-readable storage medium, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the video classification model training method or the video classification method described above.

The present application further provides a computer program product, which when running on a computer, causes the computer to execute the training method of the video classification model or the video classification method provided by the above-mentioned method embodiments.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for training a video classification model, the method comprising:

acquiring a coarse label data set;

calling the second classification model to predict the soft label of the video in the rough label data set to obtain a soft label data set, wherein the soft label is a label which adopts probability to represent the category of the video;

training the first classification model according to the soft label data set to obtain an ith fine tuning classification model, wherein the initial value of i is 1;

When the i +1 is smaller than the threshold n, after i +1 is made to be i, the two steps are executed again, and n is an integer larger than 1;

2. The method of claim 1, wherein the first classification model is a classification model trained from the coarse label dataset and the second classification model is a classification model trained from a manually labeled fine label dataset.

3. The method according to claim 1 or 2, characterized in that the method further comprises:

randomly selecting the videos in the rough label data set to obtain a candidate video subset;

manually labeling the videos in the candidate video subset to obtain a fine label data set;

and training the first classification model according to the fine label data set to obtain the second classification model.

4. The method of claim 3, wherein manually labeling the videos in the subset of candidate videos, and wherein obtaining the fine label dataset comprises:

segmenting videos in the candidate video subset to obtain a plurality of video segments of the videos;

For a plurality of video segments of each video, randomly extracting m video segments, wherein m is an integer greater than 1;

and manually labeling the m video segments to obtain the fine label data set.

5. The method of claim 1 or 2, wherein said invoking a second classification model to predict soft labels of said video in said coarse label dataset to obtain a soft label dataset comprises:

segmenting the video in the rough label data set to obtain a plurality of video segments of the video;

for a plurality of video segments of each video, randomly extracting k x D video segments, wherein D is the number of coarse labels of the video in the coarse label data set, and k and D are integers;

calling the second classification model to predict the probability that each video clip in the ith group of video clips belongs to the ith coarse label of the video for the ith group of video clips, wherein the ith group of video clips comprises k video clips, and i is an integer not greater than D;

and determining all randomly extracted video clips and corresponding probabilities as the soft label data set.

6. A method for video classification, the method comprising:

Acquiring videos to be classified;

extracting the features of the video to obtain a feature vector of the video;

calling a video classification model to predict the feature vector to obtain a label of the video; the video classification model is obtained by performing fine tuning training on a first classification model according to a soft label data set, the soft label data set is obtained by calling a second classification model to predict soft labels of videos in a coarse label data set, and the soft labels are labels which represent the categories of the videos by adopting probability;

wherein the second classification model has a higher classification accuracy than the first classification model;

the video classification model is obtained by training through the following steps:

When the i +1 is smaller than the threshold n, making i +1 equal to i, and then repeating the two steps, wherein n is an integer larger than 1;

7. The method of claim 6, wherein the first classification model is a classification model trained from the coarse label dataset and the second classification model is a classification model trained from a manually labeled fine label dataset.

8. An apparatus for training a video classification model, the apparatus comprising:

the sample acquisition module is used for acquiring a coarse label data set;

the fine tuning training module is used for training the first classification model according to the soft label data set to obtain an ith fine tuning classification model, and the initial value of i is 1; calling the ith fine-tuning classification model to predict the soft labels of the videos in the coarse label data set to obtain an (i + 1) th soft label data set; performing fine tuning training on the first classification model according to the (i + 1) th soft label data set to obtain an (i + 1) th fine tuning classification model; when the i +1 is smaller than the threshold n, after i +1 is made to be i, the two steps are executed again, and n is an integer larger than 1; when the i +1 is equal to the n, determining the i +1 th fine-tuning classification model as the video classification model.

9. An apparatus for video classification, the apparatus comprising:

the acquisition module is used for acquiring videos to be classified;

the extraction module is used for extracting the features of the video to obtain a feature vector of the video;

the calling module is used for calling a video classification model to predict the characteristic vector to obtain a label of the video; the video classification model is obtained by performing fine tuning training on a first classification model according to a soft label data set, the soft label data set is obtained by calling a second classification model to predict soft labels of videos in a coarse label data set, and the soft labels are labels which represent the categories of the videos by adopting probability;

10. A computer device, characterized in that the computer device comprises: a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement the method of training a video classification model as claimed in any one of claims 1 to 5 above, or the method of video classification as claimed in claim 6 or 7 above.

11. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement a method of training a video classification model as claimed in any one of claims 1 to 5 above or a method of video classification as claimed in claim 6 or 7 above.