CN110347873B

CN110347873B - Video classification method and device, electronic equipment and storage medium

Info

Publication number: CN110347873B
Application number: CN201910562350.XA
Authority: CN
Inventors: 康健
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2023-04-07
Anticipated expiration: 2039-06-26
Also published as: CN110347873A

Abstract

The present disclosure provides a video classification method, apparatus, electronic device and computer readable storage medium, which relate to the technical field of image processing, and the video classification method includes: sparse sampling is carried out on a video to be processed to obtain a plurality of key frames; processing the plurality of key frames through a feature extraction network in a preset model to extract features of the plurality of key frames; and fusing the characteristics of the plurality of key frames through the trained attention network in the preset model, and processing the fused characteristics to obtain the classification result of the video to be processed. The video classification method and the video classification device can reduce the calculation amount and improve the video classification speed and efficiency.

Description

Video classification method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a video classification method, a video classification apparatus, an electronic device, and a computer-readable storage medium.

Background

With the development of video technology, users can obtain various videos from various channels. Because the number of the videos is too large, the videos can be conveniently searched and used by a user through the classification processing of the videos, and the user experience is improved.

In the related art, the video classification method may include a long-short term memory network-based method, a 3D convolution-based method, and a dual stream network-based method.

In the above methods, the processing speed is slow due to the large network structure and the large amount of calculated parameters. In addition, in the above modes, when inter-frame information is processed, global operation is performed on a single frame, which causes waste of computing resources; and the classification result may be inaccurate due to the inability to utilize the information between frames.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

An object of the present disclosure is to provide a video classification method, apparatus, electronic device and computer-readable storage medium, which overcome, at least to some extent, the problem of slow video classification speed due to the limitations and disadvantages of the related art.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to an aspect of the present disclosure, there is provided a video classification method including: sparse sampling is carried out on a video to be processed to obtain a plurality of key frames; processing the plurality of key frames through a feature extraction network in a preset model to extract features of the plurality of key frames; and fusing the characteristics of the plurality of key frames through the trained attention network in the preset model, and processing the fused characteristics to obtain the classification result of the video to be processed.

In an exemplary embodiment of the present disclosure, the feature extraction network includes a residual network, and processing the plurality of key frames through the feature extraction network in the preset model to extract features of the plurality of key frames includes: and taking the plurality of key frames as a batch, and inputting the batch into the residual error network to extract the features of the plurality of key frames.

In an exemplary embodiment of the present disclosure, the fusing the features of the plurality of key frames through the trained attention network in the preset model, and processing the fused features to obtain the classification result of the to-be-processed video includes: inputting the features of the plurality of key frames into the trained attention network to obtain fused features; and determining the probability of the video to be processed belonging to each category according to the fused features, and determining the classification result according to the probability.

In an exemplary embodiment of the disclosure, before inputting the features of the plurality of keyframes into the trained attention network and obtaining the fused features, the method further includes: and fixing the residual error network, and training the attention network to obtain the trained attention network.

In an exemplary embodiment of the present disclosure, the method further comprises: and after the trained attention network is obtained, training the preset model to obtain the trained preset model.

In an exemplary embodiment of the present disclosure, training the preset model, and obtaining the trained preset model includes: and performing end-to-end training on the preset model to obtain the trained preset model.

In an exemplary embodiment of the present disclosure, the method further comprises: compressing the trained preset model based on regression loss; and/or adjusting the parameter type of the trained preset model.

According to an aspect of the present disclosure, there is provided a video classification apparatus including: the key frame acquisition module is used for performing sparse sampling on a video to be processed to obtain a plurality of key frames; the feature extraction module is used for processing the plurality of key frames through a feature extraction network in a preset model so as to extract features of the plurality of key frames; and the classification result determining module is used for fusing the characteristics of the plurality of key frames through the trained attention network in the preset model and processing the fused characteristics to obtain the classification result of the video to be processed.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the video classification method of any one of the above via execution of the executable instructions.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the video classification method of any one of the above.

In the video classification method, the video classification device, the electronic device, and the computer-readable storage medium provided by the present exemplary embodiment, the features of the key frames of the video to be processed are extracted, and the features of the key frames are fused by using the attention network, so that the video to be processed is classified. On one hand, the characteristics of a plurality of key frames of the video to be processed are extracted through the characteristic extraction network in the preset model, so that the parameters input to the characteristic extraction network are reduced, and the number of the processed parameters is reduced due to the small network structure of the characteristic extraction network, so that the time waste caused by extracting the characteristics of all frames of the video to be processed in the related technology is avoided, the characteristic extraction speed is increased, and the processing efficiency is improved. On the other hand, the attention network is utilized to fuse the characteristics of a plurality of key frames to obtain the classification result of the video to be processed, the characteristics of the plurality of key frames can be fused to process information among different frames uniformly, the step that the global operation is carried out on each single key frame in the related technology is avoided, the waste of computing resources is reduced, and the resource consumption is reduced; and the interframe information can be effectively utilized, so that the videos to be processed can be accurately classified, and the accuracy of the classification result is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

Fig. 1 schematically illustrates a video classification method in an exemplary embodiment of the present disclosure.

Fig. 2 schematically illustrates a structural diagram of a preset model according to an exemplary embodiment of the present disclosure.

Fig. 3 schematically illustrates a flow chart for determining a classification result in an exemplary embodiment of the present disclosure.

Fig. 4 schematically illustrates an overall flow chart for classifying videos in an exemplary embodiment of the present disclosure.

Fig. 5 schematically illustrates a block diagram of a video classification apparatus in an exemplary embodiment of the present disclosure.

Fig. 6 schematically shows a schematic view of an electronic device in an exemplary embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the embodiments of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

In the exemplary embodiment, a video classification method is first provided, and the video classification method can be applied to any scene that classifies photos, videos, or pictures. Next, a video classification method in the present exemplary embodiment will be described in detail with reference to fig. 1.

In step S110, sparse sampling is performed on the video to be processed to obtain a plurality of key frames.

In this exemplary embodiment, the videos to be processed may include a large number of videos stored in a certain folder in the terminal (for example, videos in an album of the smart terminal), or a large number of videos uploaded and stored in some information interaction platforms. The specific type of the video to be processed can be determined according to the actual operation function requirement, for example, when classification is required, the video to be processed refers to the video to be classified.

Since there is little difference between consecutive frames of the video to be processed, it is not necessary to take every frame information of the video to be processed as an input of the subsequent processing procedure in the present exemplary embodiment. To select a portion of the frames of the video to be processed for processing, the video to be processed may be sampled. Sampling refers to a process of sampling the video to be processed as a sample size at intervals in a time domain. For example, given a video sequence V to be processed, whose duration is T, the video sequence may be divided equally into T +1 segments, each segment containing the same number of video frames, and then a frame is randomly selected from each segment as a sample for sampling. In this way, a plurality of key frames of the video to be processed can be obtained from the T +1 segment. In the exemplary embodiment, a plurality of key frames are obtained by performing sparse sampling on a video to be processed, so that the number of sampling points can be reduced under the condition that data are within a fidelity range, parameters of an input feature extraction network are also reduced, and the operation amount is reduced.

With continued reference to fig. 1, in step S120, the plurality of key frames are processed through a feature extraction network in a preset model to extract features of the plurality of key frames.

In this exemplary embodiment, the preset model refers to an entire model for processing a plurality of key frames to obtain a classification result of a video to be processed. The preset model may mainly comprise two parts: wherein the first part is a feature extraction network and the second part is an attention network. The features of the plurality of key frames may specifically be represented by feature vectors.

First, the feature extraction network will be explained. The feature extraction network is mainly used for extracting features of a plurality of key frames of each to-be-processed video input into the feature extraction network. The feature extraction network may include any network model that can extract features, such as a suitable machine learning model, which may include, but is not limited to, convolutional neural networks, cyclic neural networks, residual network models, and so forth. If the feature extraction network is a convolutional neural network, the convolutional neural network may include a plurality of convolutional layers and a pooling layer, each convolutional layer is used for extracting a different feature, and the pooling layer is used for reducing dimensionality to extract a main feature, so that subsequent processing is performed with the main feature as a final feature.

In the exemplary embodiment, the PC-based network body used by the feature extraction network is extracted, but it is also within the scope of the present application to perform the feature extraction work by using a mobile-based network such as MobileNet and ThunderNet.

In this exemplary embodiment, if the feature extraction network is a residual error network, the specific process of processing the plurality of key frames through the feature extraction network to extract the features of the plurality of key frames includes: and taking the plurality of key frames as a batch, and inputting the batch into the residual error network to extract the features of the plurality of key frames. The residual network may be any one of various residual networks such as a residual network at 18 layers, a residual network at 34 layers, and the like, and here, the residual network ResNet18 at 18 layers is taken as an example for description.

The residual network is composed of residual blocks (difference between output and input) and uses congruent mapping to pass the previous layer output directly to the next layer. Assuming that the input of a certain segment of neural network is x, the expected output is H (x). In a residual network, the input x can be passed directly to the output as a result, and the objective to learn is the residual H (x) -x, rather than the complete output.

The construction of a ResNet network is to pile up a plurality of residual blocks, and the method of changing a common convolution into a residual network through the network is to add all jump connections and add a shortcut to each two layers to form a residual block. For example, 5 residual blocks are connected together to form a residual network. Any one of the sequentially connected residual blocks in the residual network comprises an identity mapping and at least two convolution layers, and the identity mapping of any one of the residual blocks is directed to the output end of any one of the residual blocks from the input end of any one of the residual blocks. The specific network structure, the number of layers, and the like of the residual error network may be set according to the requirements of computing resource consumption, identification performance, and the like, and are not particularly limited herein. It should be noted that the residual error network ResNet18 of the coding part used in this step is a model trained in advance, and therefore, in this exemplary embodiment, it is not necessary to perform training optimization on the model.

In the present exemplary embodiment, after acquiring a plurality of key frames of a plurality of videos to be processed in step S110, the plurality of key frames may be regarded as one Batch. The Batch size is a hyper-parameter that defines the number of samples to be processed before updating the internal model parameters. The batch process is treated as a loop to iterate one or more samples and make a prediction. At the end of the batch process, the prediction is compared to expected output variables and an error is calculated. From this error, the update algorithm is used to improve the model, for example moving down the error gradient. When all samples are used to create a Batch, the learning algorithm is called Batch gradient descent. Since all key frames constitute one batch, the update frequency and the number of updates to the network can be reduced.

Specifically, a plurality of key frames are used as a batch to be input into a first residual block of a residual network; for any residual block, receiving the output of the last residual block, and performing feature extraction on the output of the last residual block based on the first convolution layer, the second convolution layer and the third convolution layer; obtaining the output of the third convolutional layer, and transmitting the output of the third convolutional layer and the output of the last residual block to the next residual block; and obtaining the output of the last residual block of the residual network to obtain the characteristics of a plurality of key frames.

In the exemplary embodiment, because the 18-layer residual error network is used as the network for feature extraction, the network has a strong capability of extracting features of an image, and the number of network layers is small, thereby reducing network parameters. The problem of gradient dispersion caused by too deep network layer number is solved, the feature extraction can be carried out by a deeper network structure, the accuracy of the feature extraction is ensured, and the calculated amount is reduced.

In the steps S110 and S120, a plurality of key frames are obtained by performing sparse sampling on the video to be processed, and information of each frame of the video to be processed is no longer used as input of the next step, so that input parameters are reduced. Moreover, the residual error network has the capability of extracting features of the image, the number of network layers is small, and the number of parameters is further reduced. Therefore, the characteristics of the key frame are extracted through sparse sampling and the characteristic extraction network with fewer layers, the quantity of parameters needing to be transmitted and calculated is reduced, and the calculation resources are saved.

Continuing to refer to fig. 1, in step S130, the features of the plurality of key frames are fused through the attention network trained in the preset model, and the fused features are processed to obtain the classification result of the video to be processed.

In the present exemplary embodiment, the preset model refers to a trained preset model. Fig. 2 schematically shows a specific structure diagram of the preset model, and referring to fig. 2, the preset model may further include a BN layer, a fully connected layer, and a softmax in addition to the feature extraction network and the attention network, so as to obtain a multi-class label result according to a vector output by the softmax. The network inputs a batch frame and outputs a plurality of feature vectors of key frames; the attention network is connected with the feature extraction network, the input of the attention network is the feature vectors of a plurality of key frames, and the output of the attention network is the fused vector; the BN layer is connected with the attention network and is used for carrying out normalization processing on each neuron so as to accelerate the training speed and improve the model precision; a fully connected layer (FC) is connected to the BN layer, which functions as a classifier in the entire convolutional neural network; and connecting softmax with the full-link layer, and finally outputting a prediction vector, wherein each dimension of the prediction vector represents the probability of the corresponding category.

Because the convolutional neural network does not have the capability of fusing interframe information, for the extracted characteristics of the key frames, the attention network can be used for fusing the characteristics among a plurality of different key frames so as to obtain a classification result aiming at the video to be processed. The attention network may be an inter-frame attention network, the input of which may be the feature of Batch composed of the plurality of key frames obtained in step S120, and the output of which is a fused vector.

Fig. 3 schematically shows a flowchart for determining the classification result, and referring to fig. 3, mainly includes step S310 and step S320, where:

in step S310, the features of the plurality of key frames are input into the trained attention network to obtain a fused feature.

In this step, the attention network refers to a network based on an attention mechanism, which allows a neural network to focus on only a portion of the information of its inputs, and to select a particular input. Note that the force mechanism can be applied to any type of input, regardless of its shape, for inputs in the form of a matrix, such as an image or a vector, etc.

In order to ensure the accuracy of the fused features, the attention network can be trained before the fused features are calculated, so that the features of a plurality of key frames of the video to be processed are fused through the trained attention network. The specific process of training the attention network may include: and fixing the residual error network, and training the attention network to obtain the trained attention network. That is, in the whole training process of the model, since the ResNet18 network of the coding part used in the front is a pre-trained model, in the training process, the parameter is fixed first, only the following attention network is trained, and after the loss function of the attention network tends to be stable, the training process of the attention network is stopped, so that the trained attention network is obtained. Specifically, the attention network in the present exemplary embodiment may be as shown in equation (1):

wherein a represents a vector of an input attention network, namely a feature vector of a plurality of key frames; c is a calculated fused vector of features of the plurality of keyframes. The parameters of the vector a input to the attention network are calculated as shown in equations (2) and (3):

e _i ＝w ^T a _i formula (2)

W is a parameter learned in the training process, and a vector c obtained by fusing the features of a plurality of key frames can be calculated by using a trained attention network through the learned parameter.

When the attention network is trained, firstly, image data of a plurality of key frames can be obtained, and the category of the video to be processed is determined manually; then, the attention network is trained by using the category and the image data so as to continuously adjust the weight of each convolution kernel in the attention network until the category and the manually set category are obtained, thereby obtaining the trained attention network.

The specific steps of the fusion through the attention network may include: the entire convolutional layer information is taken as input to obtain the first secondary concentrated point, thereby indicating attention to different locations. After the attention vector is obtained, the previous part of the attention vector is multiplied by the vector of the convolutional layer, and the vector obtained by the multiplication represents the position information of the point to be noticed. After the position information and the time sequence information are combined and transmitted into the network, new position vectors and output prediction probability information are obtained through calculation under the current time sequence. New location point information is continuously generated by combining the output with the convolutional layer to obtain new attention, and new output information is obtained using the new attention in combination with the input. In the exemplary embodiment, resNet18 with softmax removed is used as a network for extracting features, the network inputs batch composed of a plurality of key frames, outputs a plurality of feature vectors corresponding to a series of the plurality of key frames, and then connects the inter-frame attention network to obtain fusion vectors corresponding to the plurality of feature vectors.

Based on the method, the interframe information can be effectively utilized through the attention network, the step that global operation is carried out on each single key frame in the related technology is avoided, the waste of computing resources is reduced, and the resource consumption is reduced. Through the fused vector, the characteristics of the video to be processed can be more accurately represented, and therefore classification can be more accurately carried out. In addition, since the attention network can effectively utilize the inter-frame information, the videos to be processed can be accurately classified based on the inter-frame information.

After the trained attention network is obtained, the whole preset model can be trained to obtain the trained preset model. For example, the feature extraction network and the attention network are finely adjusted until the category of a certain to-be-processed video is consistent with the manually set category, so that a trained preset model with good performance is obtained, and the video classification accuracy is improved through the preset model. When the preset model is trained, end-to-end training can be realized. The end-to-end training may include: a predicted result is obtained from the input to the output and compared to the true result an error is obtained which is propagated (back-propagated) at each layer in the model, and the representation of each layer is adjusted according to the error until the model converges or the desired effect is achieved. The end-to-end training is realized by inputting raw data to task result output without additional processing, and the whole training and predicting process is completed in the model. For example, there is no single model in the whole model, but a neural network is directly connected from the input end to the output end, and the neural network is used for assuming the functions of all the original modules. Through end-to-end training, the operation steps are reduced, and the training efficiency is improved.

It should be added that, in order to further optimize the performance, the whole trained preset model may be adjusted, which specifically includes the following adjustment modes: firstly, compressing the trained preset model based on the regression loss, namely performing model pruning on each layer in the preset model. Since the parameters of the neural network are numerous, but some of the parameters do not contribute much to the final output result and appear redundant, the redundant parameters need to be cut off. The model pruning method may be a method of pruning according to the weight value, and the like. In the exemplary embodiment, the number of channels of the preset model may be adjusted based on the LASSO regression loss, so as to remove channels with smaller regression loss that do not affect the classification result much, thereby reducing the amount of calculation. By pruning the trained preset model, the operation speed can be increased, and the size of the model file can be reduced.

And secondly, adjusting the parameter type of the trained preset model. Specifically, the parameter type in the preset model is generally float32, and in the present exemplary embodiment, the parameter type may be cut off from float32 to float16, so as to reduce the model volume and reduce the consumption of computing resources without affecting the computing effect.

It should be noted that, in the exemplary embodiment, only the model compression may be performed, only the parameter type adjustment may be performed, and the model compression and the parameter type adjustment may also be performed at the same time, so as to increase the operation speed and reduce the consumption of the computing resources.

Next, in step S320, a probability that the video to be processed belongs to each category is determined according to the fused features, so as to determine the classification result according to the probability.

In this step, the classification result may be represented by a probability that the video to be processed belongs to each category, and specifically, a probability threshold may be set in advance; when the probability value is greater than or equal to the probability threshold, it may be determined that the video to be processed belongs to the category.

After the fused features are obtained, the fused features can be input into a BN layer for normalization processing, then input into a full connection layer for classification, and further input into a softmax layer to obtain a prediction vector, so that the probability that the video to be processed belongs to a certain category is obtained according to each dimensionality of the prediction vector, and the classification result is determined according to the probability value.

For example, the probability threshold may be 0.7, and when the probability that the to-be-processed video 1 belongs to the category 1 is 0.9, and the probability that the to-be-processed video 1 belongs to the category 2 is 0.1, it may be determined that the classification result of the to-be-processed video 1 is the category 1.

In the exemplary embodiment, the videos to be processed are classified through the preset model formed by the residual error network and the attention network, so that compared with the related technology, the parameters are reduced, the time consumption is less, and meanwhile, the precision is not lost too much. Meanwhile, the attention network effectively utilizes information among a plurality of different key frames and saves computing resources.

An overall flow chart of video classification is schematically shown in fig. 4, and referring to fig. 4, mainly includes the following steps:

in step S401, the video to be processed is subjected to frame slicing, and specifically, sparse sampling may be adopted to extract a plurality of key frames of the video to be processed.

In step S402, a plurality of key frames are input into an underlying feature extraction network, which may be a residual network ResNet18, to obtain a vector representing features.

In step S403, a vector representing a high-dimensional feature corresponding to each key frame is obtained.

In step S404, the high-dimensional features are input into the attention network to obtain a fused vector.

In step S405, a classification result is obtained from the fused vector. Specifically, the fused vector is input into the BN layer, the full link layer and the softmax layer to obtain the probability that the video to be processed belongs to each category, and then the classification result is determined according to the probability.

In summary, in the technical solution in the exemplary embodiment, sparse sampling is performed on a video to be processed first to obtain a key frame, and feature extraction is performed through a residual error network. And for the extracted features, performing further feature fusion by using an attention network to obtain fusion features among different key frames, and finally outputting a prediction result. By the method, parameters input into the feature extraction network are reduced, the number of the processed parameters is reduced due to the small network structure of the feature extraction network, time waste caused by extracting the features of all frames of the video to be processed in the related technology is avoided, and the efficiency and the speed of extracting the features are improved. In addition, the features of a plurality of key frames can be fused, so that the step that each single key frame is subjected to global operation in the related technology is avoided, the waste of computing resources is reduced, and the resource consumption is reduced. In addition, model parameters can be compressed and the speed can be increased by further processing by using a model pruning method.

In the present exemplary embodiment, there is also provided a video classification apparatus, and as shown in fig. 5, the apparatus 500 may include:

a key frame obtaining module 501, configured to perform sparse sampling on a video to be processed to obtain multiple key frames;

a feature extraction module 502, configured to process the multiple key frames through a feature extraction network in a preset model to extract features of the multiple key frames;

a classification result determining module 503, configured to fuse the features of the multiple key frames through the attention network trained in the preset model, and process the fused features to obtain a classification result of the video to be processed.

In an exemplary embodiment of the disclosure, the feature extraction network comprises a residual network, the feature extraction module is configured to: and taking the plurality of key frames as a batch, and inputting the batch into the residual error network to extract the features of the plurality of key frames.

In an exemplary embodiment of the present disclosure, the classification result determination module includes: the feature fusion module is used for inputting the features of the plurality of key frames into the trained attention network to obtain fused features; and the probability calculation module is used for determining the probability that the video to be processed belongs to each category according to the fused features so as to determine the classification result according to the probability.

In an exemplary embodiment of the disclosure, before inputting the features of the plurality of key frames into the trained attention network and obtaining the fused features, the apparatus further includes: and the network training module is used for fixing the residual error network and training the attention network to obtain the trained attention network.

In an exemplary embodiment of the present disclosure, the apparatus further includes: and the preset model training module is used for training the preset model after the trained attention network is obtained, so as to obtain the trained preset model.

In an exemplary embodiment of the present disclosure, the preset model training module includes: and the training control module is used for carrying out end-to-end training on the preset model so as to obtain the trained preset model.

In an exemplary embodiment of the present disclosure, the apparatus further includes: the model compression module is used for compressing the trained preset model based on the regression loss; and/or the parameter adjusting module is used for adjusting the parameter type of the trained preset model.

It should be noted that the specific details of each module in the video classification apparatus have been set forth in detail in the corresponding method, and therefore are not described herein again.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken into multiple step executions, etc.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.), or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: the at least one processing unit 610, the at least one memory unit 620, and a bus 630 that couples the various system components including the memory unit 620 and the processing unit 610.

Wherein the storage unit stores program code that is executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present specification. For example, the processing unit 610 may perform the steps as shown in fig. 1.

The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.

The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 630 can be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The display unit 640 may be a display having a display function to show a processing result obtained by the processing unit 610 performing the method in the present exemplary embodiment through the display. The display includes, but is not limited to, a liquid crystal display or other display.

The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. As shown, the network adapter 660 communicates with the other modules of the electronic device 600 over the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, and may also be implemented by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.

According to the program product for realizing the method, the portable compact disc read only memory (CD-ROM) can be adopted, the program code is included, and the program product can be operated on terminal equipment, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily appreciated that the processes illustrated in the above figures are not intended to indicate or limit the temporal order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A method for video classification, comprising:

sparse sampling is carried out on a video to be processed to obtain a plurality of key frames;

taking the plurality of key frames as a batch, and inputting the batch into a residual error network included in a feature extraction network in a preset model so as to extract features of the plurality of key frames;

inputting the features of the plurality of key frames into the trained attention network in the preset model to obtain fused features, determining the probability of the video to be processed belonging to each category according to the fused features, and determining the classification result of the video to be processed according to the probability.

2. The method of claim 1, wherein before inputting the features of the plurality of key frames into the trained attention network to obtain the fused features, the method further comprises:

and fixing the residual error network, and training the attention network to obtain the trained attention network.

3. The video classification method according to claim 1, characterized in that the method further comprises:

and after the trained attention network is obtained, training the preset model to obtain the trained preset model.

4. The video classification method according to claim 3, wherein the training of the preset model to obtain the trained preset model comprises:

and performing end-to-end training on the preset model to obtain the trained preset model.

5. The video classification method according to claim 3, characterized in that the method further comprises:

compressing the trained preset model based on regression loss; and/or

And adjusting the parameter type of the trained preset model.

6. A video classification apparatus, comprising:

the key frame acquisition module is used for carrying out sparse sampling on a video to be processed to obtain a plurality of key frames;

the characteristic extraction module is used for taking the plurality of key frames as a batch, inputting the batch into a residual error network included in a characteristic extraction network in a preset model so as to extract the characteristics of the plurality of key frames;

and the classification result determining module is used for inputting the characteristics of the plurality of key frames into the trained attention network in the preset model to obtain fused characteristics, determining the probability of the video to be processed belonging to each category according to the fused characteristics, and determining the classification result of the video to be processed according to the probability.

7. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the video classification method of any of claims 1-5 via execution of the executable instructions.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the video classification method of any one of claims 1 to 5.