CN111444878A

CN111444878A - Video classification method and device and computer readable storage medium

Info

Publication number: CN111444878A
Application number: CN202010272792.3A
Authority: CN
Inventors: 尹康; 吴宇斌; 郭烽
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2020-07-24
Anticipated expiration: 2040-04-09
Also published as: CN111444878B

Abstract

The application provides a video classification method, a video classification device and a computer readable storage medium, wherein the video classification method comprises the following steps: acquiring an original training sample set comprising a plurality of video samples marked with classification labels; selecting a video sample combination and a corresponding classification label from an original training sample set for weighted fusion to obtain an augmented training sample set; inputting the video samples in the augmented training sample set into a neural network for training to obtain a video classification model; and classifying the videos to be classified based on the video classification model. Through the implementation of the scheme, the original video sample and the classification label are fused in a weighting fusion mode in the model training stage, the augmented training sample set can be obtained, the scale and the diversity of the training sample set are guaranteed, meanwhile, the operation complexity of the construction of the training sample set is effectively reduced, and the realizability of the construction of the training sample set is improved.

Description

Video classification method and device and computer readable storage medium

Technical Field

The present application relates to the field of electronic technologies, and in particular, to a video classification method and apparatus, and a computer-readable storage medium.

Background

As a fundamental task in the field of computer vision, video classification has been a research focus in the industry. With the continuous development of hardware equipment such as high-definition video equipment and the like, an artificial intelligence solution based on a video classification technology is widely applied to aspects of video interest recommendation, video security, intelligent home and the like, and the application scene is extremely wide.

In practical application, compared with an image classification model for classifying a single-frame image, when videos are classified, the video classification model requires to construct a larger model structure due to the fact that the correlation among multiple frames of input images needs to be captured, and then training data with larger quantity needs to be used in the model training process. However, at present, a manual labeling mode is usually adopted for constructing the training data set, and the operation complexity and the realizability of the training data set construction are high.

Disclosure of Invention

The embodiment of the application provides a video classification method, a video classification device and a computer-readable storage medium, which can at least solve the problems of high operation complexity and poor realizability caused by performing class marking on training data required by a video classification model in a manual marking mode in the related art.

A first aspect of an embodiment of the present application provides a video classification method, including:

acquiring an original training sample set comprising a plurality of video samples marked with classification labels;

selecting a video sample combination and the corresponding classification label from the original training sample set for weighted fusion to obtain an augmented training sample set; wherein the sample size of the augmented training sample set is larger than the original training sample set;

inputting the video samples in the augmented training sample set into a neural network for training to obtain a video classification model;

and classifying the video to be classified based on the video classification model.

A second aspect of the embodiments of the present application provides a video classification apparatus, including:

the system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring an original training sample set comprising a plurality of video samples marked with classification labels;

the augmentation module is used for selecting a video sample combination and the corresponding classification label from the original training sample set to carry out weighted fusion to obtain an augmentation training sample set; wherein the sample size of the augmented training sample set is larger than the original training sample set;

the training module is used for inputting the video samples in the augmented training sample set into a neural network for training to obtain a video classification model;

and the classification module is used for classifying the video to be classified based on the video classification model.

A third aspect of embodiments of the present application provides an electronic apparatus, including: a memory, a processor, and a bus; the bus is used for realizing the connection communication between the memory and the processor; a processor for executing a computer program stored on the memory; when the processor executes the computer program, the steps in the video classification method provided by the first aspect of the embodiment of the present application are implemented.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the video classification method provided in the first aspect of the embodiments of the present application.

As can be seen from the above, according to the video classification method, apparatus, and computer-readable storage medium provided in the present application, an original training sample set including a plurality of video samples labeled with classification labels is obtained; selecting a video sample combination and a corresponding classification label from an original training sample set for weighted fusion to obtain an augmented training sample set; inputting the video samples in the augmented training sample set into a neural network for training to obtain a video classification model; and classifying the videos to be classified based on the video classification model. Through the implementation of the scheme, the original video sample and the classification label are fused in a weighting fusion mode in the model training stage, the augmented training sample set can be obtained, the scale and the diversity of the training sample set are guaranteed, meanwhile, the operation complexity of the construction of the training sample set is effectively reduced, and the realizability of the construction of the training sample set is improved.

Drawings

Fig. 1 is a schematic basic flowchart of a video classification method according to a first embodiment of the present application;

fig. 2 is a flowchart illustrating a specific video classification method according to a first embodiment of the present application;

fig. 3 is a schematic flowchart of a training sample augmentation method according to a first embodiment of the present disclosure;

fig. 4 is a schematic diagram of sample weighted fusion according to a first embodiment of the present application;

FIG. 5 is a schematic flow chart illustrating a model training method according to a first embodiment of the present disclosure;

fig. 6 is a schematic flowchart of a model testing method according to a first embodiment of the present application;

fig. 7 is a schematic flowchart of a refinement method of a video classification method according to a second embodiment of the present application;

fig. 8 is a schematic diagram illustrating program modules of a video classification apparatus according to a third embodiment of the present application;

fig. 9 is a schematic diagram illustrating program modules of another video classification apparatus according to a third embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application.

Detailed Description

In order to make the objects, features and advantages of the present invention more apparent and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to overcome the defects of high operation complexity and poor realizability caused by performing class marking on training data required by a video classification model in a manual marking mode in the related art, a first embodiment of the application provides a video classification method. As shown in fig. 1, which is a basic flowchart of a video classification method provided in this embodiment, the video classification method includes the following steps:

step 101, obtaining an original training sample set comprising a plurality of video samples marked with classification labels.

Specifically, in practical application, the neural network is trained under a supervised learning framework, so that training samples need to be obtained in the embodiment, and the neural network is trained based on different training samples. Wherein each sample in each sample set is provided with a classification label for representing the category of each sample, such as drama, war, psychology, comedy, etc.

It should be noted that, in this embodiment, the original training sample set is a batch of video samples with manually labeled categories acquired by the user himself collecting, labeling or downloading a common data set, and the original training sample set is a small-scale training sample set.

In some embodiments of this embodiment, in order to ensure the accuracy of the subsequently trained model, after the original training sample set is obtained, the video samples in the original training sample set may be subjected to an adjustment process. Firstly, uniformly sampling a video sample according to a preset sampling frequency fs, wherein the fs can be preferably 0.5 Hz; then, the sampled image frame is scaled to make the length of the long side scaled to a preset length value W, and the length of the short side is extended to W by means of black dots (RGB value is (0,0,0)), where W in this embodiment may be 512 pixels.

In addition, in order to increase the speed of subsequently reading the video sample, the adjusted video sample may be stored as a binary file in the embodiment, and the format of the binary file may be tfrecrd, so that the efficiency of subsequently training the model may be effectively improved.

And 102, selecting a video sample combination and a corresponding classification label from the original training sample set for weighted fusion to obtain an augmented training sample set.

Specifically, the sample size of the augmented training sample set of the present embodiment is larger than that of the original training sample set, that is, the number of samples of the augmented training sample set is larger than that of the original training sample set. In the present embodiment, "augmentation" may be understood as adding and augmenting, and the augmented training sample set is a large-scale training sample set obtained after sample augmentation is performed on the basis of the original training sample set.

In this embodiment, before selecting a video sample combination and performing weighted fusion on a corresponding classification label, training samples in an original training sample set may be preprocessed, a first preset value M and a second preset value H are set, where M may preferably be 12, and H may preferably be 448 pixels, first, video samples in the original training sample set are randomly clipped in a spatial dimension, that is, a sub-image region of H × H is randomly selected in a region of an image frame W × W, then, video samples in the original training sample set are randomly clipped in a temporal dimension, that is, the original video samples are recorded to have N frames, if M is less than N, consecutive M frames are randomly selected in the original video, if M is greater than N, a pure black frame (RGB value is (0,0,0), and) with M-N frame size being H × H pixels is supplemented after the original video samples, and if M is equal to N, no operation is performed.

And 103, inputting the video samples in the augmented training sample set into a neural network for training to obtain a video classification model.

Specifically, the present embodiment implements video classification based on a deep learning algorithm, wherein the neural network used may include any one of a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), and a Recurrent Neural Network (RNN). In the embodiment, based on the training samples in the augmented training sample set, a certain optimization algorithm is adopted to perform neural network training in a specific training environment, wherein the learning rate and the training times during training can be determined according to actual requirements, and are not limited uniquely herein. It should be understood that the neural network of this embodiment may be correspondingly determined according to an algorithm operation scenario, for example, in a scenario insensitive to an operation duration of the classification algorithm, a neural network with a higher structural complexity may be adopted, and performance of the algorithm may be ensured, so that accuracy of a final classification result may be improved.

And 104, classifying the video to be classified based on the video classification model.

Specifically, in this embodiment, the video to be classified is used as an input of the trained video classification model, and the video classification model predicts the category of the video to be classified and assigns a corresponding classification label to the category of the video to be classified, so as to classify the video to be classified. Because the video classification model of the embodiment is obtained by training based on the augmented training sample set, the trained video classification model has strong generalization capability, and the accuracy of the model classification result is higher.

As shown in fig. 2, which is a schematic flow chart of a specific video classification method provided in this embodiment, in some embodiments of this embodiment, classifying videos to be classified based on a video classification model specifically includes the following steps:

step 201, preprocessing a video to be classified to obtain a plurality of video segments;

step 202, inputting a plurality of video segments into a video classification model to obtain a plurality of prediction classification label vectors;

and 203, determining the classification of the video to be classified based on the prediction classification label vector of which the maximum value of the classification label is greater than a preset threshold value in the plurality of prediction classification label vectors.

Further, preprocessing the video to be classified to obtain a plurality of video segments may include: uniformly sampling videos to be classified according to a preset sampling frequency; and equally dividing the sampled video to be classified according to the preset video segment length to obtain a plurality of video segments.

Specifically, in this embodiment, the foregoing method of adjusting the video samples in the original training sample set may be first adopted to adjust the to-be-classified mode, that is, fs is used to uniformly sample the to-be-classified video, then the size of the to-be-classified video is adjusted to W × W, and finally the size of the to-be-classified video is scaled to H × H, then the to-be-classified video is recorded to have T frames in total, if T is smaller than M, the to-be-classified video is supplemented to M frames according to the preprocessing method described in the foregoing embodiment, if T is greater than M, the to-be-classified video is equally divided according to M frames of each segment, if the length of the last segment is less than M frames, the to-be-classified video is supplemented to M frames, and if T is.

In addition, in this embodiment, m video segments obtained by preprocessing are input into the model obtained by training to obtain m prediction vectors, and if n classes to which the video to be classified may belong are total, the ith prediction vector is marked as pred_i＝{p_{i_1},p_{i_2},…,p_{i_n}}. Then, go through m prediction vectors bit-wise, if p_{1_j},p_{2_j},…,p_{m_j}If the maximum value in the values is greater than the preset threshold value t, it indicates that the input video belongs to the jth class, otherwise, the input video does not belong to the jth class, and in this embodiment, t may be preferably set to 0.5.

Further, before uniformly sampling the video to be classified according to the preset sampling frequency, the method comprises the following steps: obtaining the allowable time consumption of classification operation corresponding to the video to be classified; allowing the sampling frequency to be determined from the classification operation.

Specifically, in the embodiment, when performing uniform sampling, the sampling frequencies used in different scenes may be different, for example, the embodiment may determine the corresponding sampling frequency according to the allowed time consumption of the classification operation of each classification scene, that is, in a scene insensitive to the operation duration of the classification algorithm, a relatively high sampling frequency may be used.

As shown in fig. 3, a flow diagram of a training sample augmentation method provided in this embodiment is further provided, in some embodiments of this embodiment, a video sample combination and a corresponding class label are selected from an original training sample set for weighted fusion, so as to obtain an augmented training sample set, which specifically includes the following steps:

301, randomly selecting two video samples from an original training sample set;

step 302, performing weighted fusion on the two video samples and the corresponding classification labels according to a preset weighted fusion formula to obtain augmented video samples correspondingly marked with the classification labels;

and step 303, obtaining an augmented training sample set based on all the augmented video samples.

Specifically, as shown in fig. 4, a sample weighted fusion diagram provided in this embodiment is shown, and a weighted fusion formula of this embodiment is expressed as:

wherein x is₁、x₂Representing two video samples, y, respectively₁、y₂Respectively, a classification label corresponding to two video samples, x an augmented video sample, and y a classification label corresponding to the augmented video sample, β -B (a, a), indicating beta distribution subject to preset parameters, a may preferably be 0.4.

As shown in fig. 5, which is a schematic flow chart of a model training method provided in this embodiment, in some embodiments of this embodiment, inputting video samples in an augmented training sample set to a neural network for training, and obtaining a video classification model specifically includes the following steps:

step 501, inputting video samples in an augmented training sample set into a neural network for training to obtain a predicted classification label vector actually output by the iterative training;

step 502, comparing the classification label vector corresponding to the augmented training sample set with the prediction classification label vector by adopting a preset loss function;

and 503, when the comparison result meets a preset model convergence condition, determining the network model obtained by the iterative training as the trained video classification model.

Specifically, in this embodiment, a training process is repeated for a plurality of times of iterative optimization, an output predicted by each training of the neural network and a classification label carried by a sample are calculated as a loss Function (L oss Function), if the CNN structure is moblie-V3 + NeXtV V L AD, the loss Function may be cross entropy loss, then, for example, a BP algorithm is used to update a trainable parameter in the network in a reverse gradient manner, parameters such as a weight of the neural network are adjusted to reduce a loss Function value of a next iteration, when the loss Function value satisfies a preset standard, it is determined that a model convergence condition is satisfied, that is, the training process of the entire neural network model is completed, otherwise, the next iteration training is continued until the model convergence condition is satisfied.

As shown in fig. 6, which is a schematic flow chart of a model testing method provided in this embodiment, in some embodiments of this embodiment, after inputting video samples in an augmented training sample set to a neural network for training to obtain a video classification model, the method further includes the following steps:

601, obtaining a test sample set comprising a plurality of video samples marked with classification labels;

step 602, inputting video samples in a test sample set into a video classification model to obtain a test classification label vector;

603, carrying out correlation calculation on the test classification label vector and the classification label vector marked by the test sample set;

and step 604, determining that the video classification model is effective when the correlation degree is greater than a preset correlation degree threshold value.

Specifically, in this embodiment, after the video classification model is trained, the test sample is further used to verify the validity of the trained video classification model, that is, the test sample is input to the trained model, and then the correlation between the output label vector and the label vector of the test sample is compared to determine the validity of the model. When the correlation is greater than the preset threshold value, it is determined that the trained video classification model is a correct and effective model, and then the step of classifying the video to be classified based on the video classification model is allowed to be further executed, otherwise, the model performance is poor, and the model needs to be trained again.

Based on the technical scheme of the embodiment of the application, an original training sample set comprising a plurality of video samples marked with classification labels is obtained; selecting a video sample combination and a corresponding classification label from an original training sample set for weighted fusion to obtain an augmented training sample set; inputting the video samples in the augmented training sample set into a neural network for training to obtain a video classification model; and classifying the videos to be classified based on the video classification model. Through the implementation of the scheme, the original video sample and the classification label are fused in a weighting fusion mode in the model training stage, the augmented training sample set can be obtained, the scale and the diversity of the training sample set are guaranteed, meanwhile, the operation complexity of the construction of the training sample set is effectively reduced, and the realizability of the construction of the training sample set is improved.

The method in fig. 7 is a refined video classification method provided in the second embodiment of the present application, and the video classification method includes:

step 701, obtaining an original training sample set including a plurality of video samples marked with classification labels.

In this embodiment, the original training sample set is a batch of video samples with manually labeled categories acquired by a user in a manner of collecting, labeling or downloading a public data set, and the original training sample set is a small-scale training sample set.

Step 702, selecting a video sample combination and a corresponding classification label from an original training sample set to perform weighted fusion to obtain an augmented training sample set.

Specifically, the augmented training sample set of the present embodiment is a large-scale training sample set obtained after sample expansion is performed on the basis of the original training sample set, and the sample scale of the augmented training sample set is larger than that of the original training sample set. In this embodiment, any two samples and their corresponding label vectors are taken, and the augmented sample and its label vector are generated according to a weighted fusion method.

And 703, inputting the video samples in the augmented training sample set into a neural network for training to obtain a predicted classification label vector actually output by the iterative training.

Step 704, comparing the classification label vector corresponding to the augmented training sample set with the predicted classification label vector by using a preset loss function.

In this embodiment, the training process is repeated for a plurality of times to perform iterative optimization, the output obtained by each training prediction of the neural network and the classification label carried by the sample are calculated as a loss Function (L oss Function), and then, for example, a BP algorithm is used to update the trainable parameters in the network with a backward gradient, and parameters such as the weight of the neural network are adjusted to reduce the loss Function value of the next iteration.

Step 705, when the comparison result meets the model convergence condition, determining the network model obtained by the iterative training as the trained video classification model.

Specifically, in this embodiment, when the loss function value satisfies the preset standard, it is determined that the model convergence condition is satisfied, that is, the training process of the whole neural network model is completed, otherwise, the next iterative training is continued until the model convergence condition is satisfied.

Step 706, inputting a plurality of video segments obtained by preprocessing the video to be classified into the video classification model to obtain a plurality of prediction classification label vectors.

Specifically, the embodiment uniformly samples the video to be classified according to a preset sampling frequency; and equally dividing the sampled video to be classified according to the preset video segment length to obtain a plurality of video segments. In addition, in this embodiment, m video segments obtained by preprocessing are input into the model obtained by training to obtain m prediction vectors, and if n classes to which the video to be classified may belong are total, the ith prediction vector is marked as pred_i＝{p_{i_1},p_{i_2},…,p_{i_n}}。

And 707, determining the classification of the video to be classified based on the predicted classification label vector of which the maximum value of the classification label is greater than a preset threshold value in the plurality of predicted classification label vectors.

Specifically, the present embodiment traverses m prediction vectors bitwise, if p_{1_j},p_{2_j},…,p_{m_j}If the maximum value is greater than the preset threshold value t, the input video is indicatedAnd belongs to the jth class, otherwise, the input video does not belong to the jth class, and the embodiment may preferably set t to 0.5.

It should be understood that, the size of the serial number of each step in this embodiment does not mean the execution sequence of the step, and the execution sequence of each step should be determined by its function and inherent logic, and should not be limited uniquely to the implementation process of the embodiment of the present application.

The embodiment of the application discloses a video classification method, which comprises the steps of obtaining an original training sample set comprising a plurality of video samples marked with classification labels; selecting a video sample combination and a corresponding classification label from an original training sample set for weighted fusion to obtain an augmented training sample set; inputting the video samples in the augmented training sample set into a neural network for training to obtain a video classification model; and classifying the videos to be classified based on the video classification model. Through the implementation of the scheme, the original video sample and the classification label are fused in a weighting fusion mode in the model training stage, the augmented training sample set can be obtained, the scale and the diversity of the training sample set are guaranteed, meanwhile, the operation complexity of the construction of the training sample set is effectively reduced, and the realizability of the construction of the training sample set is improved.

Fig. 8 is a video classification apparatus according to a third embodiment of the present application. The video classification apparatus can be used to implement the video classification method in the foregoing embodiments. As shown in fig. 8, the video classification apparatus mainly includes:

an obtaining module 801, configured to obtain an original training sample set including a plurality of video samples labeled with classification labels;

the augmentation module 802 is configured to select a video sample combination and a corresponding classification label from an original training sample set to perform weighted fusion, so as to obtain an augmented training sample set; wherein the sample scale of the augmented training sample set is larger than that of the original training sample set;

the training module 803 is configured to input the video samples in the augmented training sample set to a neural network for training to obtain a video classification model;

the classification module 804 is configured to classify the video to be classified based on the video classification model.

In some embodiments of the present embodiment, the augmentation module 802 is specifically configured to: randomly selecting two video samples from an original training sample set; performing weighted fusion on the two video samples and the corresponding classification labels according to a preset weighted fusion formula to obtain the corresponding augmented video samples marked with the classification labels; and obtaining an augmented training sample set based on all the augmented video samples. The weighted fusion formula of the present embodiment can be expressed as:

wherein x is₁、x₂Representing two video samples, y, respectively₁、y₂Respectively, a classification label corresponding to two video samples, x represents an augmented video sample, y represents a classification label corresponding to the augmented video sample, and β represents a beta distribution subject to a preset parameter.

In some embodiments of this embodiment, the training module 803 is specifically configured to: inputting the video samples in the augmented training sample set into a neural network for training to obtain a predicted classification label vector actually output by the iterative training; comparing the classification label vector corresponding to the augmented training sample set with the prediction classification label vector by adopting a preset loss function; and when the comparison result meets the preset model convergence condition, determining the network model obtained by the iterative training as the trained video classification model.

As shown in fig. 9, another video classification apparatus provided in this embodiment is another video classification apparatus according to this embodiment, in another embodiment of this embodiment, the video classification apparatus further includes: the testing module 805 is configured to input the video samples in the augmented training sample set to a neural network for training to obtain a video classification model, and then obtain a testing sample set including a plurality of video samples labeled with classification labels; inputting the video samples in the test sample set into a video classification model to obtain a test classification label vector; carrying out correlation calculation on the test classification label vector and the classification label vector marked by the test sample set; and when the correlation degree is greater than a preset correlation degree threshold value, determining that the video classification model is effective. Correspondingly, the classification module 804 performs its function when the video classification model is valid.

In other embodiments of this embodiment, the classification module 804 is specifically configured to: preprocessing a video to be classified to obtain a plurality of video segments; inputting a plurality of video segments into a video classification model to obtain a plurality of prediction classification label vectors; and determining the classification of the video to be classified based on the prediction classification label vector of which the maximum value of the classification label is greater than a preset threshold value in the prediction classification label vectors.

Further, in some embodiments of this embodiment, when the classification module 804 preprocesses the video to be classified to obtain a plurality of video segments, it is specifically configured to: uniformly sampling videos to be classified according to a preset sampling frequency; and equally dividing the sampled video to be classified according to the preset video segment length to obtain a plurality of video segments.

Referring to fig. 9, in some embodiments of the present invention, the video classification apparatus further includes: a determining module 806, configured to obtain allowable time consumption of a classification operation corresponding to a video to be classified before uniformly sampling the video to be classified according to a preset sampling frequency; allowing the sampling frequency to be determined from the classification operation.

It should be noted that, the video classification methods in the first and second embodiments can be implemented based on the video classification device provided in this embodiment, and persons skilled in the art can clearly understand that, for convenience and simplicity of description, the specific working process of the video classification device described in this embodiment may refer to the corresponding process in the foregoing method embodiment, and details are not described here.

According to the video classification device provided by the embodiment, an original training sample set comprising a plurality of video samples marked with classification labels is obtained; selecting a video sample combination and a corresponding classification label from an original training sample set for weighted fusion to obtain an augmented training sample set; inputting the video samples in the augmented training sample set into a neural network for training to obtain a video classification model; and classifying the videos to be classified based on the video classification model. Through the implementation of the scheme, the original video sample and the classification label are fused in a weighting fusion mode in the model training stage, the augmented training sample set can be obtained, the scale and the diversity of the training sample set are guaranteed, meanwhile, the operation complexity of the construction of the training sample set is effectively reduced, and the realizability of the construction of the training sample set is improved.

Referring to fig. 10, fig. 10 is an electronic device according to a fourth embodiment of the present disclosure. The electronic device can be used for implementing the video classification method in the foregoing embodiment. As shown in fig. 10, the electronic device mainly includes:

a memory 1001, a processor 1002, a bus 1003 and a computer program stored on the memory 1001 and executable on the processor 1002, the memory 1001 and the processor 1002 being connected by the bus 1003. The processor 1002, when executing the computer program, implements the video classification method in the foregoing embodiments. Wherein the number of processors may be one or more.

The Memory 1001 may be a high-speed Random Access Memory (RAM) Memory or a non-volatile Memory (e.g., a disk Memory). The memory 1001 is used for storing executable program code, and the processor 1002 is coupled to the memory 1001.

Further, an embodiment of the present application also provides a computer-readable storage medium, where the computer-readable storage medium may be provided in an electronic device in the foregoing embodiments, and the computer-readable storage medium may be the memory in the foregoing embodiment shown in fig. 10.

The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the video classification method in the foregoing embodiments. Further, the computer-readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a RAM, a magnetic disk, or an optical disk.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of modules is merely a division of logical functions, and an actual implementation may have another division, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a readable storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned readable storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In view of the above description of the video classification method, apparatus and computer-readable storage medium provided by the present application, those skilled in the art will recognize that there may be variations in the embodiments and applications of the video classification method, apparatus and computer-readable storage medium provided by the present application.

Claims

1. A method of video classification, comprising:

2. The video classification method according to claim 1, wherein the selecting a video sample combination and the corresponding classification label from the original training sample set for weighted fusion to obtain an augmented training sample set comprises:

randomly selecting two video samples from the original training sample set;

performing weighted fusion on the two video samples and the corresponding classification labels according to a preset weighted fusion formula to obtain the augmented video samples marked with the classification labels correspondingly; the weighted fusion formula is expressed as:

wherein x is₁、x₂Respectively representing the two video samples, y₁、y₂Respectively representing classification labels corresponding to the two video samples, x representing the augmented video sample, y representing a classification label corresponding to the augmented video sample, β representing a beta distribution subject to a preset parameter;

and obtaining an augmented training sample set based on all the augmented video samples.

3. The video classification method according to claim 1, wherein the inputting the video samples in the augmented training sample set into a neural network for training to obtain a video classification model comprises:

inputting the video samples in the augmented training sample set into a neural network for training to obtain a predicted classification label vector actually output by the iterative training;

comparing the classification label vector corresponding to the augmented training sample set with the prediction classification label vector by adopting a preset loss function;

and when the comparison result meets a preset model convergence condition, determining the network model obtained by the iterative training as a trained video classification model.

4. The video classification method according to claim 1, wherein after the video samples in the augmented training sample set are input to a neural network for training, and a video classification model is obtained, the method further comprises:

obtaining a test sample set comprising a plurality of video samples marked with classification labels;

inputting the video samples in the test sample set into the video classification model to obtain a test classification label vector;

performing correlation calculation on the test classification label vector and the classification label vector marked by the test sample set;

and when the correlation degree is greater than a preset correlation degree threshold value, determining that the video classification model is effective, and then executing the step of classifying the video to be classified based on the video classification model.

5. The video classification method according to any one of claims 1 to 4, wherein the classifying the video to be classified based on the video classification model comprises:

preprocessing a video to be classified to obtain a plurality of video segments;

inputting the video segments into the video classification model to obtain a plurality of prediction classification label vectors;

and determining the classification of the video to be classified based on the prediction classification label vector of which the maximum value of the classification label is greater than a preset threshold value in the plurality of prediction classification label vectors.

6. The video classification method according to claim 5, wherein the preprocessing the video to be classified to obtain a plurality of video segments comprises:

uniformly sampling the video to be classified according to a preset sampling frequency;

and equally dividing the sampled video to be classified according to the preset video segment length to obtain a plurality of video segments.

7. The video classification method according to claim 6, wherein before uniformly sampling the video to be classified according to a preset sampling frequency, the method comprises:

acquiring allowable time consumption of classification operation corresponding to the video to be classified;

the sampling frequency is determined according to the allowed time of the classification operation.

8. A video classification apparatus, comprising:

9. An electronic device, comprising: a memory, a processor, and a bus;

the bus is used for realizing connection communication between the memory and the processor;

the processor is configured to execute a computer program stored on the memory;

the processor, when executing the computer program, performs the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.