CN111026915B

CN111026915B - Video classification method, video classification device, storage medium and electronic equipment

Info

Publication number: CN111026915B
Application number: CN201911168580.4A
Authority: CN
Inventors: 彭冬炜
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-11-25
Filing date: 2019-11-25
Publication date: 2023-09-15
Anticipated expiration: 2039-11-25
Also published as: CN111026915A

Abstract

The disclosure provides a video classification method, a video classification device, a storage medium and electronic equipment, and relates to the technical field of computer vision. The method comprises the following steps: acquiring a plurality of key frame images from a video to be classified; extracting features from the plurality of key frame images respectively by utilizing a pre-trained convolutional neural network; according to the time stamps of the plurality of key frame images in the video to be classified, arranging the features corresponding to the key frame images to obtain a feature sequence; and processing the feature sequence according to the attention weight of each feature in the feature sequence to obtain a classification result of the video to be classified. The method and the device can mine the semantics generated by the arrangement of the key frame images according to the time distribution information among the key frames in the video classification, and improve the accuracy of the video classification result.

Description

Video classification method, video classification device, storage medium and electronic equipment

Technical Field

The disclosure relates to the technical field of computer vision, and in particular relates to a video classification method, a video classification device, a computer readable storage medium and electronic equipment.

Background

Video classification refers to identifying Video content to determine the category to which the Video content belongs, and has important applications in Video online viewing, video content auditing, vlog (Video Blog) services and other scenes.

In the related art, video classification is mostly based on image classification, for example, a single frame image is extracted from a video, the image is classified by using an image recognition method, and then the classification result of multiple frames of images is synthesized to classify the video. However, this method is very dependent on selection of a single frame image, if the selected image is poor in representativeness, or the video content itself is complex, if the video contains more subject matter, and the content has abrupt change, the deviation between the image subject matter and the actual subject matter of the video will be caused, so that the accuracy of the video classification result is affected.

Therefore, a new video classification method is needed to solve the above technical problems.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The disclosure provides a video classification method, a video classification device, a computer readable storage medium and an electronic device, so as to improve the accuracy of a video classification result in the related art at least to a certain extent.

Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.

According to a first aspect of the present disclosure, there is provided a video classification method, comprising: acquiring a plurality of key frame images from a video to be classified; extracting features from the plurality of key frame images respectively by utilizing a pre-trained convolutional neural network; according to the time stamps of the plurality of key frame images in the video to be classified, arranging the features corresponding to the key frame images to obtain a feature sequence; and processing the feature sequence according to the attention weight of each feature in the feature sequence to obtain a classification result of the video to be classified.

According to a second aspect of the present disclosure, there is provided a video classification apparatus comprising: the image acquisition module is used for acquiring a plurality of key frame images from the video to be classified; the feature extraction module is used for respectively extracting features from the plurality of key frame images by utilizing a pre-trained convolutional neural network; the feature arrangement module is used for arranging the features corresponding to the key frame images according to the time stamps of the plurality of key frame images in the video to be classified to obtain a feature sequence; and the feature processing module is used for processing the feature sequence according to the attention weight of each feature in the feature sequence to obtain the classification result of the video to be classified.

According to a third aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described video classification method.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the video classification method described above via execution of the executable instructions.

The technical scheme of the present disclosure has the following beneficial effects:

according to the video classification method, the video classification device, the computer readable storage medium and the electronic equipment, a plurality of key frame images are obtained from videos to be classified, features are respectively extracted by using a convolutional neural network, the features are arranged according to time sequence in the videos, a feature sequence is obtained, and the feature sequence is processed according to the attention weight of each feature, so that a video classification result is obtained. On the one hand, the key frame images extracted from the video are not isolated from each other, the time distribution information among the key frames is integrated into the feature sequence through the processing of extracting the features and arranging the features, and the semantics generated by the image arrangement in the video can be mined according to the part of information during the processing, so that the more accurate video classification is realized. On the other hand, the attention weight of each feature in the feature sequence can show the local feature between frames in the video to be classified, so that the frame information of each part is completely reserved under the condition of different video content distribution, and the feature processing is carried out on two layers of local and global, so that the accuracy of the video classification result is further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely some embodiments of the present disclosure and that other drawings may be derived from these drawings without undue effort.

Fig. 1 shows a flowchart of a video classification method in the present exemplary embodiment;

fig. 2 shows a sub-flowchart of a video classification method in the present exemplary embodiment;

fig. 3 shows a sub-flowchart of another video classification method in the present exemplary embodiment;

fig. 4 is a block diagram showing a configuration of a video classification apparatus in the present exemplary embodiment;

fig. 5 illustrates a computer-readable storage medium for implementing the above-described method in the present exemplary embodiment;

fig. 6 shows an electronic device for implementing the above method in the present exemplary embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

In the related art, image selection for video classification is mostly based on splitting and comparing video image frames, for example, based on image frame comparison, a section with the longest continuous time of image content in a video is found out and used as a main part of the video, and a certain number of images are extracted from the section for classification and identification. However, the inventors found that: the above-mentioned method of the related art has a limitation in that the processing of the video is only stopped at a visual stage, and the video is essentially a sequence of images, and the context between the image frames thereof contains a large amount of semantic information, which has a very important meaning for identifying the content of the video. The lack of the related art in this respect is a main cause of lower accuracy of the video classification result.

In view of one or more of the above problems, exemplary embodiments of the present disclosure first provide a video classification method, which may be applied to a server of a video service platform, for example, to classify videos on the platform from a service end, to add classification tags, to facilitate a user's search, and may also be applied to a personal computer, a smart phone terminal device, for example, to automatically classify videos photographed or downloaded by a user, and so on.

Fig. 1 shows a flow of the present exemplary embodiment, and may include the following steps S110 to S140:

step S110, a plurality of key frame images are acquired from the video to be classified.

The key frame image refers to an image capable of reflecting the static content or dynamic content change of the video to be classified. The following provides several embodiments for how to determine a keyframe:

(1) Considering that the video frame is usually required to be decoded when extracting the complete image from the video, step S110 may specifically include:

extracting a plurality of intra-frame coding frames from the video to be classified;

and decoding the intra-frame coding frame to obtain a plurality of key frame images.

The Intra-coded frame (I frame) is a frame independently coded based on a single frame image, and is a complete reservation of the image of the present frame, and only the data of the present frame is needed to complete the decoding. Corresponding to the I frame, there are forward predicted frames (P frames for short) and Bi-predicted frames (B frames for short), the P frames record their differences from the previous frames, the previous frame data needs to be referred to when decoding the P frames, and the B frames record their differences from the previous and subsequent Bi-directional frames, and the previous and subsequent frame data needs to be referred to simultaneously for complete decoding.

From the above, if it is determined that the P frame or the B frame is the key frame, when the key frame image is acquired, the I frame needs to be decoded first, and then the P frame and the B frame of the target need to be decoded according to the difference between the previous frame and the next frame, which is inefficient. Therefore, the I-frame can be directly used as the key frame, so that only the key frame image is required to be independently decoded during decoding, other frames are not required to be decoded, the number of frames required to be decoded is minimum, and the speed of extracting the key frame image is the highest.

In order to further improve the efficiency, when the I frame is decoded, a plurality of threads can be called, so that each thread decodes one I frame. A decoder is typically included in a video tool (e.g., video playback software, editing software, etc.) for decoding video frames. According to the method and the device for processing the video classification, a decoder can be implanted into a program of video classification, codes of thread parts are modified, N I frames are obtained to be key frames after a video classification flow starts, N threads are correspondingly started, decoding tasks of each I frame are distributed to corresponding threads, and each thread independently executes the decoding tasks, so that extraction of key frame images is rapidly completed in a concurrent mode.

(2) And calculating the similarity between every two adjacent frames of images in the video to be classified, determining a position with lower similarity (for example, lower than a verification threshold value) as a content mutation point of the video to be classified, selecting key frames before and after the content mutation point, for example, selecting one frame before and after the mutation point respectively to obtain two key frames, and selecting sparse (fewer) key frames in the middle of a part with continuous content.

(3) And selecting a frame as a key frame in the video to be classified in a fixed interval mode at each fixed interval time length.

It should be noted that, in order to facilitate the subsequent processing, step S110 may acquire a fixed number of key frame images, for example, acquire 64 or 128 key frame images, etc., and then when acquiring the key frames, the relevant parameters may be determined according to the number, for example: determining the number of extracted I frames in the mode (1), and if the number of the I frames in the video to be classified is insufficient, extracting a P frame or a B frame from the insufficient part; calculating the sparsity of key frame acquisition in the mode (2); the time length of the interval and the like are calculated in the above-described mode (3).

Further, the present exemplary embodiment can also be used in combination of the above three modes, for example: the above modes (1) and (2) are adopted in combination, I frames are selected before and after the content mutation point to serve as key frames, a small number of I frames are selected in the content continuous part to serve as key frames, and the like.

Step S120, extracting features from the plurality of key frame images by using a pre-trained convolutional neural network.

In the present exemplary embodiment, the convolutional neural network is mainly used to extract the features in the key frame images, and the key frame images are not classified or identified, so that there is no limitation on what type of data is finally output by the convolutional neural network. This has the advantage that the type of label is not limited in training the convolutional neural network, which label is ready or easily available, and which label is used for training, for example, an open-source image dataset containing a large number of image classification labels can be used, and the convolutional neural network for image classification is correspondingly trained.

In step S120, the key frame image may be input into a convolutional neural network, and after a series of convolution and pooling processes, features may be extracted from the fully connected layers, where the features may be denser, and a first fully connected layer may be selected, and a subsequent fully connected layer may be selected, where the data size is generally smaller, which is not limited in this disclosure.

It should be noted that the convolutional neural network may process each key frame image separately, and each key frame image correspondingly extracts a set of features, which may be in the form of vectors or matrices.

Step S130, according to the time stamps of the plurality of key frame images in the video to be classified, arranging the features corresponding to the key frame images to obtain a feature sequence.

In this exemplary embodiment, the time stamp of each key frame in the video to be classified may be obtained, so as to determine the time sequence of each key frame, thereby arranging the features corresponding to each key frame image extracted in step S120, and obtaining a feature sequence, where the feature sequence includes time distribution information of each group of features. Taking each group of features as an example, the vectors can be arranged front and back as feature sequences, or the feature sequences in a matrix form can be obtained according to the sequence of two-dimensional organization, wherein the feature vector corresponding to the key frame image with the earliest time is arranged in the first row of the matrix.

In an alternative embodiment, besides arranging the features in time sequence, a time feature may be added to each feature, for example, a dimension of time is added to the last feature vector to record the time data of the key frame, so that the feature sequence contains more complete time information, which is beneficial to improving the accuracy of subsequent processing. The present disclosure is not limited to a specific form of the temporal feature, and may be, for example, temporal data in seconds or milliseconds, a frame number (essentially, also a kind of temporal data), a time difference from a previous key frame, a frame number difference, or the like.

And step S140, processing the feature sequence according to the attention weight of each feature in the feature sequence to obtain a classification result of the video to be classified.

In the present exemplary embodiment, a Attention (Attention) mechanism is used to process a feature sequence, and when processing, it is necessary to determine an Attention weight of each feature in the feature sequence. The attention weight represents the association of each feature with other features, namely the association information between key frames, and the attention weight can be utilized to redistribute the weight of each feature so as to rebalance the semantics of the feature sequence once. And then further identifying the feature sequence after rebalancing, and obtaining a classification result of the video to be classified.

In an alternative embodiment, an attention model may be trained in advance, and the feature sequence is processed by the attention model to obtain the classification result. The attention model is a neural network containing attention layers, for example, a general RNN (Recurrent Neural Network ) or LSTM (Long-Short Term Memory, long-short-term memory network) in which the attention layers are set can be employed. The attention layer, namely the part for carrying out characteristic weight reassignment through the attention weight, can represent the information of the whole sequence, and on the basis of focusing on the global information, the attention range is correspondingly increased, so that the model learns more interframe local information, and the output of the attention layer is generated according to the characteristic processing of each part. And (3) further learning in the subsequent middle layer, identifying based on time distribution information of the feature sequence, and finally outputting a classification result of the video to be classified. Particularly in RNN or LSTM, as the state is shifted, the previous frame information in the sequence is diluted by the subsequent frame information, which has less influence on subsequent output, which is equivalent to that the previous frame information is lost to a certain extent by the model, and the influence on this aspect can be reduced by the attention layer, so that the completeness and richness of feature learning are increased, and the accuracy of the classification result is improved.

The attention weight may be calculated by similarity between features, e.g. a sequence of features comprising K features C ₁ ～C _K Then calculate C ₁ Similarity with each other feature, and normalized after addition to C ₁ Is a weight of attention of (2); the feature sequence can be processed through the middle layer of the attention model, and the processed features are aligned to obtain the attention weight; the present disclosure is not limited to a specific algorithm for attention weighting.

It should be noted that, the convolutional neural network used in step S120 may be regarded as a Base model (Base Net), which processes each key frame image by inputting a single channel or a three channels (key frame images corresponding to gray levels in a single channel, key frame images corresponding to RGB in a three channel), and is a network model at an image level, and the attention model used in step S140 may be regarded as a fusion model (fused Net), processes feature sequences corresponding to a plurality of key frame images, and is a network model at an image sequence or video level.

Based on the above, in the present exemplary embodiment, a plurality of key frame images are obtained from a video to be classified, features are extracted by using a convolutional neural network, and then the features are arranged according to a time sequence in the video, so as to obtain a feature sequence, and then the feature sequence is processed according to the attention weight of each feature, so as to obtain a video classification result. On the one hand, the key frame images extracted from the video are not isolated from each other, the time distribution information among the key frames is integrated into the feature sequence through the processing of extracting the features and arranging the features, and the semantics generated by the image arrangement in the video can be mined according to the part of information during the processing, so that the more accurate video classification is realized. On the other hand, the attention weight of each feature in the feature sequence can show the local feature between frames in the video to be classified, so that the frame information of each part is completely reserved under the condition of different video content distribution, and the feature processing is carried out on two layers of local and global, so that the accuracy of the video classification result is further improved.

The training process of the attention model may include: firstly, a large number of sample videos and classification labels thereof are acquired, wherein the labels can be manually marked classification results or can be acquired from the existing data set; extracting features by using the trained convolutional neural network, wherein the processing mode is the same as that of the step S120; then, the corresponding features of each sample video are organized into feature sequences, and the processing mode is the same as that of the step S130; constructing an initial attention model, which can be based on RNN or LSTM, adding an attention layer therein, and setting initial parameter values (e.g., can be randomly initialized); and inputting the feature sequence corresponding to the sample video into a model, and updating model parameters according to the output data and the label until a certain accuracy rate is reached, so that training is completed.

In an alternative embodiment, model pruning or model quantification may also be performed while training the attention model.

The process of pruning the model may include the following steps S210 and S220, as shown in fig. 2:

step S210, determining invalid channels in the attention model, and removing the invalid channels from the attention model;

in step S220, parameters of the attention model are adjusted to minimize the reconstruction error.

Step S210 may be implemented by an algorithm such as Lasso Regression (Lasso Regression). Lasso regression is a compression estimation that minimizes the sum of squares of residuals under the constraint that the sum of absolute values of the regression coefficients is less than a constant, thereby enabling some regression coefficients exactly equal to 0 to be generated, resulting in a reduced model. The present exemplary embodiment takes as one channel the intermediate layer (including the attention layer) neurons corresponding to each set of features in the feature sequence, calculates residuals by lasso for different channel combinations to determine invalid channels therein, and then cuts out the invalid channels. After the invalid channel is cut off, the output result of the model is generally influenced, and thus the accuracy rate may be reduced, so that parameter fine adjustment is required. The parameter fine adjustment is equivalent to a simplified training process, gradient descent can be performed through a loss function to perform parameter fine adjustment, so that reconstruction errors are minimized, the obtained model can be put into practical application, particularly, the model is deployed on a client, quick response processing can be achieved, and the model has high practicability.

The process of model quantization may be as shown in fig. 3, including the following steps S310 to S330:

step S310, counting the numerical distribution of the parameters of the attention model, and determining a reference threshold according to the counting result;

step S320, a preset numerical value range is obtained, and a numerical value mapping relation is determined according to the preset numerical value range and a reference threshold value;

in step S330, the parameters of the attention model are mapped to the preset numerical range through the numerical mapping relationship.

The numerical distribution of the parameters of the attention model includes a numerical range, a numerical distribution characteristic (such as satisfying a normal distribution, a linear distribution, etc.), and the like. The reference threshold may be a maximum value, a minimum value, a median value, an average value, a span value (i.e., a maximum value minus a minimum value) of the numerical range, or a numerical value corresponding to a peak value of the numerical distribution probability, where the reference threshold represents a numerical distribution characteristic of the attention model to a certain extent. Preset numerical rangeThe range of the numerical processing related to the video classification predetermined according to the actual demand, scene characteristics and system resource conditions can be [ -2 ⁿ +1,2 ⁿ -1]N is a positive integer and can be any of 1, 2, 3, etc., with n=7 as an example, the preset value range is [ -127,127]Each parameter in the model is an int8 type of data. In this exemplary embodiment, the linear value mapping relationship may be determined according to the ratio relationship between the upper limit value (or the lower limit value) of the preset value range and the reference threshold value, or may be processed in a similar normalization manner, which is not limited in this disclosure, and the purpose of this disclosure is to map the parameters of the attention model to the preset value range, where the model may be normalized and encoded.

In an alternative embodiment, step S140 may be specifically implemented by:

normalizing the time stamps of the plurality of key frame images in the video to be classified according to the total duration of the video to be classified to obtain a state sequence corresponding to the feature sequence;

inputting the feature sequence into a feature input channel of a pre-trained attention model, and obtaining intermediate data corresponding to each feature in the feature sequence through a coding layer of the attention model;

inputting the state sequence into a state input channel of the attention model, and calculating the attention weight of each feature in the feature sequence through the state input channel;

and weighting the intermediate data by using the attention weight, processing the weighted data by a decoding layer of the attention model, and outputting a classification result of the video to be classified.

In the normalization process, the time progress percentage, such as 50%, 80%, etc., of each key frame image in the video to be classified, and the state sequence and the time progress sequence are arranged, which accurately represent the time distribution information of the feature sequence, can be calculated.

The attention model is configured as a two-channel input, including a feature input channel and a status input channel. The state sequence is input into the state input channel and is directly connected to the attention layer, and after the parameter operation of the attention layer, the attention weight corresponding to each feature is obtained. The feature sequence is input into the feature input channel, and after being processed by the coding layer (for example, the coding layer can be a circulating middle layer or a long-short-time memory layer), intermediate data is obtained, and each feature corresponds to each intermediate data one by one. At the attention layer of the attention model, each intermediate data and attention weight is weighted. It should be noted that the attention weight may be a weight matrix including the association weight between every two features; and multiplying the vector formed by the intermediate data by the matrix of the attention weights during weighting to obtain weighted intermediate data. And then the decoding layer of the attention model (the middle layer after the attention layer) is used for carrying out subsequent processing, and a final video classification result is output.

In this way, the method is equivalent to updating the parameters of the attention layer by using the time distribution information of the key frame images in the videos to be classified, and the differentiation of each video and the key frames thereof is embodied in the attention weight, so that the inter-frame attention calculation is realized more accurately, and the classification accuracy is further improved.

The exemplary embodiments of the present disclosure also provide a video classification apparatus, as shown in fig. 4, the video classification apparatus 400 may include: an image acquisition module 410, configured to acquire a plurality of key frame images from a video to be classified; a feature extraction module 420, configured to extract features from the plurality of key frame images respectively using a convolutional neural network trained in advance; the feature arrangement module 430 is configured to arrange features corresponding to the key frame images according to the timestamps of the plurality of key frame images in the video to be classified, so as to obtain a feature sequence; the feature processing module 440 is configured to process the feature sequence according to the attention weight of each feature in the feature sequence, so as to obtain a classification result of the video to be classified.

In an alternative embodiment, the image obtaining module 410 may be further configured to extract a plurality of intra-frame encoded frames from the video to be classified and decode the frames to obtain a plurality of key frame images.

In an alternative embodiment, the image acquisition module 410 may be further configured to invoke a plurality of threads to decode the intra-coded frames, such that each thread decodes a respective intra-coded frame.

In an alternative embodiment, the feature processing module 440 may be further configured to obtain the classification result of the video to be classified by performing the following steps: according to the total duration of the video to be classified, carrying out normalization processing on time stamps of a plurality of key frame images in the video to be classified to obtain a state sequence corresponding to the feature sequence; inputting the feature sequence into a feature input channel of a pre-trained attention model, and obtaining intermediate data corresponding to each feature in the feature sequence through a coding layer of the attention model; inputting the state sequence into a state input channel of the attention model, and calculating the attention weight of each feature in the feature sequence through the state input channel; and weighting the intermediate data by using the attention weight, processing the weighted data by a decoding layer of the attention model, and outputting a classification result of the video to be classified.

In an alternative embodiment, the video classification apparatus 400 may further include: the model training module is used for realizing model pruning by executing the following steps when training the attention model: determining inactive channels in the attention model, removing inactive channels from the attention model; parameters of the attention model are adjusted to minimize reconstruction errors.

In an alternative embodiment, the video classification apparatus 400 may further include: the model training module is used for realizing model quantification by executing the following steps when training the attention model: counting the numerical distribution of the parameters of the attention model, and determining a reference threshold according to the counting result; acquiring a preset numerical value range, and determining a numerical value mapping relation according to the preset numerical value range and a reference threshold value; and mapping the parameters of the attention model into a preset numerical range through a numerical mapping relation.

Further, the preset numerical range includes: [ -2 ⁿ +1,2 ⁿ -1]N is a positive integer.

The specific details of each module in the above apparatus are already described in the method section, and the details that are not disclosed can be referred to the embodiment of the method section, so that they will not be described in detail.

Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification. In some possible implementations, aspects of the present disclosure may also be implemented in the form of a program product comprising program code for causing an electronic device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on an electronic device.

Referring to fig. 5, a program product 500 for implementing the above-described method according to an exemplary embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on an electronic device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The exemplary embodiment of the disclosure also provides an electronic device capable of implementing the method. An electronic device 600 according to such an exemplary embodiment of the present disclosure is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.

As shown in fig. 6, the electronic device 600 may be embodied in the form of a general purpose computing device. Components of electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting the different system components (including the memory unit 620 and the processing unit 610), and a display unit 640.

The storage unit 620 stores program codes that can be executed by the processing unit 610, so that the processing unit 610 performs the steps according to various exemplary embodiments of the present disclosure described in the above "exemplary method" section of the present specification. For example, the processing unit 610 may perform any one or more of the method steps of fig. 1-3.

The storage unit 620 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 621 and/or cache memory 622, and may further include Read Only Memory (ROM) 623.

The storage unit 620 may also include a program/utility 624 having a set (at least one) of program modules 625, such program modules 625 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 630 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 600, and/or any device (e.g., router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 650. Also, electronic device 600 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 660. As shown, network adapter 660 communicates with other modules of electronic device 600 over bus 630. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 600, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the exemplary embodiments of the present disclosure.

Furthermore, the above-described figures are only schematic illustrations of processes included in the method according to the exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with exemplary embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of video classification, comprising:

acquiring a plurality of key frame images from a video to be classified;

extracting features from the plurality of key frame images respectively by utilizing a pre-trained convolutional neural network;

according to the time stamps of the plurality of key frame images in the video to be classified, arranging the features corresponding to the key frame images to obtain a feature sequence;

processing the feature sequence according to the attention weight of each feature in the feature sequence to obtain a classification result of the video to be classified;

the processing the feature sequence according to the attention weight of each feature in the feature sequence to obtain the classification result of the video to be classified comprises the following steps:

2. The method of claim 1, wherein the obtaining a plurality of keyframe images from the video to be classified comprises:

3. The method of claim 2, wherein a plurality of threads are invoked when decoding the intra-coded frames, such that each of the threads decodes a respective one of the intra-coded frames.

4. The method of claim 1, wherein in training the attention model, the method further comprises:

determining inactive channels in the attention model, removing the inactive channels from the attention model;

parameters of the attention model are adjusted to minimize reconstruction errors.

5. The method of claim 1, wherein in training the attention model, the method further comprises:

counting the numerical distribution of the parameters of the attention model, and determining a reference threshold according to a counting result;

acquiring a preset numerical value range, and determining a numerical value mapping relation according to the preset numerical value range and the reference threshold value;

and mapping the parameters of the attention model into the preset numerical range through the numerical mapping relation.

6. The method of claim 5, wherein the predetermined range of values comprises: [ -2 ⁿ +1,2 ⁿ -1]N is a positive integer.

7. A video classification apparatus, comprising:

the image acquisition module is used for acquiring a plurality of key frame images from the video to be classified;

the feature extraction module is used for respectively extracting features from the plurality of key frame images by utilizing a pre-trained convolutional neural network;

the feature arrangement module is used for arranging the features corresponding to the key frame images according to the time stamps of the plurality of key frame images in the video to be classified to obtain a feature sequence;

the feature processing module is used for processing the feature sequence according to the attention weight of each feature in the feature sequence to obtain a classification result of the video to be classified;

the feature processing module is used for obtaining a classification result of the video to be classified by executing the following steps: normalizing the time stamps of the plurality of key frame images in the video to be classified according to the total duration of the video to be classified to obtain a state sequence corresponding to the feature sequence; inputting the feature sequence into a feature input channel of a pre-trained attention model, and obtaining intermediate data corresponding to each feature in the feature sequence through a coding layer of the attention model; inputting the state sequence into a state input channel of the attention model, and calculating the attention weight of each feature in the feature sequence through the state input channel; and weighting the intermediate data by using the attention weight, processing the weighted data by a decoding layer of the attention model, and outputting a classification result of the video to be classified.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any one of claims 1 to 6.

9. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any one of claims 1 to 6 via execution of the executable instructions.