CN115049963A

CN115049963A - Video classification method and device, processor and electronic equipment

Info

Publication number: CN115049963A
Application number: CN202210720251.1A
Authority: CN
Inventors: 张宏韬; 刘华杰; 杨晓诚; 冯如
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2022-06-23
Filing date: 2022-06-23
Publication date: 2022-09-13

Abstract

The application discloses a video classification method, a video classification device, a processor and electronic equipment. Relates to the field of artificial intelligence, and the method comprises the following steps: acquiring a video to be classified; determining a key frame and a sampling frame of a video to be classified; merging the key frames and the sampling frames according to the time sequence of the key frames and the sampling frames in the video to be classified to obtain an image frame sequence; and determining a classification result of the video to be classified based on the image frame sequence. Through the method and the device, the problem of poor video classification effect in the related technology is solved.

Description

Video classification method and device, processor and electronic equipment

Technical Field

The application relates to the field, in particular to a video classification method, a video classification device, a video classification processor and electronic equipment.

Background

For video classification using a deep neural network, the current scheme is to decompose a video frame by frame, then use the deep neural network to perform classification and identification frame by frame (or perform random frame extraction, for example, keep one frame every 5 frames), and finally calculate a final classification result by weighted average.

This technique is time and labor consuming, typically one video is 30 frames per second, and a 10 minute video will produce 18000 frames, i.e., 18000 images. Currently, the prediction of the real-time ResNet model by the V100 video card can reach 60FPS approximately, namely, the time for completely processing 10 minutes of video needs 5 minutes. Therefore, to accelerate prediction, multiple GPUs are often required to perform parallel computation acceleration. Although the random frame extraction can reduce the overhead on GPU resources, due to the randomness of the extraction, it is difficult to ensure that the selected frame contains key information, and the result is often bad due to serious interference. However, even if all the frame information is used, the final result may be affected when the final weighting calculation is performed because the amount of irrelevant information in the video is too large.

Aiming at the problem of poor video classification effect in the related art, no effective solution is provided at present.

Disclosure of Invention

The present application mainly aims to provide a video classification method, device, processor and electronic device, so as to solve the problem of poor video classification effect in the related art.

To achieve the above object, according to one aspect of the present application, there is provided a video classification method. The method comprises the following steps: acquiring a video to be classified; determining key frames and sampling frames of the video to be classified; merging the key frames and the sampling frames according to the time sequence of the key frames and the sampling frames in the video to be classified to obtain an image frame sequence; determining a classification result of the video to be classified based on the image frame sequence.

Optionally, determining the key frames and the sample frames of the video to be classified comprises: determining a plurality of image frames of the video to be classified; extracting the plurality of image frames to obtain the sampling frame; clustering the plurality of image frames, and determining the key frame.

Optionally, performing decimation in the plurality of image frames, and obtaining the sampled frame includes: extracting image frames corresponding to the appointed number of frames from the plurality of image frames as the sampling frames according to the preset frame number interval; or extracting an image frame corresponding to a designated time from the plurality of image frames as the sampling frame at a predetermined time interval.

Optionally, the merging the key frame and the sample frame according to the time sequence of the key frame and the sample frame in the video to be classified to obtain an image frame sequence includes: determining the playing time of the key frame in the video to be classified as a first playing time; determining the playing time of the sampling frame in the video to be classified as a second playing time; determining the arrangement sequence of the key frames and the sampling frames according to the time sequence of the first playing time and the second playing time; and determining the image frame sequence according to the arrangement sequence of the key frames and the sampling frames.

Optionally, determining a classification result of the video to be classified based on the image frame sequence comprises: inputting a plurality of predetermined image frames in the image frame sequence into a preset convolutional neural network model, and determining a feature matrix and a probability of each predetermined image frame, wherein the predetermined image frames comprise: the preset convolutional neural network model is obtained by training the key frame and the sampling frame according to a sample image with calibrated characteristics; and determining the classification result of the video to be classified according to the feature matrix and the probability of a plurality of preset image frames in the image frame sequence.

Optionally, determining the classification result of the video to be classified according to the feature matrix and the probability of a plurality of predetermined image frames in the image frame sequence comprises: determining a feature matrix and a probability product of each preset image frame to obtain a feature result; determining a feature result sequence according to the sequence of each predetermined image frame in the image frame sequence, wherein the feature result sequence comprises feature results of a plurality of predetermined image frames in the image frame sequence; and inputting the characteristic result sequence into a preset cyclic neural network model, and determining the classification result of the video to be classified, wherein the preset cyclic neural network model is obtained by training according to the sample video with the classification result calibrated and the characteristic result sequence corresponding to the sample video.

Optionally, after inputting a plurality of predetermined image frames in the image frame sequence into a preset convolutional neural network model, and determining a feature matrix and a probability of each of the predetermined image frames, the method further comprises: adjusting the probability of the key frame to a preset value; and deleting the sampling frames with the probability lower than a preset threshold value in the image frame sequence.

In order to achieve the above object, according to another aspect of the present application, there is provided a video classification apparatus including: the acquiring unit is used for acquiring a video to be classified; the first determining unit is used for determining key frames and sampling frames of the video to be classified; the merging unit is used for merging the key frames and the sampling frames according to the time sequence of the key frames and the sampling frames in the video to be classified to obtain an image frame sequence; a second determining unit, configured to determine a classification result of the video to be classified based on the image frame sequence.

To achieve the above object, according to another aspect of the present application, there is provided a processor. The processor is used for running a program, wherein the program executes the video classification method during running.

To achieve the above object, according to another aspect of the present application, there is provided an electronic device. The electronic device comprises one or more processors and memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the video classification method described above.

Through the application, the following steps are adopted: acquiring a video to be classified; determining a key frame and a sampling frame of a video to be classified; merging the key frames and the sampling frames according to the time sequence of the key frames and the sampling frames in the video to be classified to obtain an image frame sequence; the classification result of the video to be classified is determined based on the image frame sequence, and the problem of poor video classification effect in the related technology is solved. And then the effect of accurately determining the video classification result is achieved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:

fig. 1 is a flowchart of a video classification method provided according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a predetermined convolutional neural network model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a predictive recurrent neural network according to an embodiment of the present application;

fig. 4 is a schematic diagram of a video classification apparatus according to an embodiment of the present application;

fig. 5 is a schematic diagram of an electronic device provided according to an embodiment of the present application.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The present invention is described below with reference to preferred implementation steps, and fig. 1 is a flowchart of a video classification method provided in an embodiment of the present application, and as shown in fig. 1, the method includes the following steps:

step S101, acquiring a video to be classified;

step S102, determining a key frame and a sampling frame of a video to be classified;

step S103, merging the key frames and the sampling frames according to the time sequence of the key frames and the sampling frames in the video to be classified to obtain an image frame sequence;

and step S104, determining a classification result of the video to be classified based on the image frame sequence.

It should be noted that the videos are all composed of still pictures, and these still pictures are called frames, so the video to be classified includes a plurality of image frames.

In the above step S101, the video to be classified may adopt any known non-encrypted video format (MP4, MAV, MOV, etc.), and the video time duration is not limited.

In step S102, the sampling frames may be randomly extracted from the image frames of the video to be classified, and the key frames may be image frames having special image characteristics in the video to be classified.

Optionally, the key frame and the image frame have special differences in image characteristics such as color, contrast or brightness.

In step S103, the image frame represents a picture played by the video to be classified at a specific playing time, and therefore, the image frame has a time attribute, and the key frame and the sampling frame determined according to the image frame also have a time attribute, so that after the key frame and the sampling frame are determined, the playing time of each key frame and each sampling frame in the video to be classified can be determined, and the key frame and the sampling frame are combined according to the time sequence of the playing time to obtain the image frame sequence.

In the above step S103, the image frame sequence includes at least one key frame and at least one sample frame.

In the above embodiment of the present invention, the image frame sequence includes both the sampling frame and the key frame, so that the key information of the video to be classified is greatly retained in the image frame sequence, and the video to be classified can be accurately classified according to the image frame sequence.

In step S104, the image frame sequence may be analyzed in a machine learning manner, and a classification result of the video to be classified is determined.

Optionally, in the video classification method provided in the embodiment of the present application, determining the key frame and the sample frame of the video to be classified includes: determining a plurality of image frames of a video to be classified; extracting a plurality of image frames to obtain sampling frames; clustering is carried out on the plurality of image frames, and a key frame is determined.

In the above embodiment of the present invention, the sampling frame is extracted from a plurality of image frames of the video to be classified according to a simple frame extraction manner, which means that some images are removed according to a certain rule; the key frame is determined by clustering a plurality of image frames of the video to be classified.

Optionally, the video to be classified is processed frame by frame to generate an image, and an original frame picture set (i.e., a plurality of image frames) P ═ F1, F2, F3... Fn } is obtained.

Optionally, by performing simple frame extraction on the original frame picture set (i.e., multiple image frames) P ═ { F1, F2, F3... Fn }, a picture set (i.e., a sample frame) F ═ F1, F6, F11.. Fn } after obtaining the simple frame extraction can be obtained.

It should be noted that, simply extracting frames refers to removing some pictures according to a certain rule, and a common way is to keep one frame every 5 frames or 1/3 frames every second (if the video is 30FPS, 10 frames are kept).

Optionally, the key frame calculation is based on the set of original frame images (i.e., resulting in multiple image frames) P ═ { F1, F2, F3... Fn }. Key frames are distinguished from other frames in that key frames are specific to image features such as color, contrast, or brightness. Therefore, a K-means clustering algorithm is required to perform clustering analysis on an original frame image set (i.e., a plurality of image frames) P ═ { F1, F2, F3... Fn } to obtain different clusters, then perform histogram analysis on images in each cluster, and select a key frame from the clusters with excessive mean deviation of the histograms as a key frame, so as to obtain a key frame set K.

Optionally, in the video classification method provided in the embodiment of the present application, the performing decimation in a plurality of image frames to obtain a sampled frame includes: according to a preset frame number interval, extracting an image frame corresponding to a specified frame number from a plurality of image frames as a sampling frame; or extracting the image frame corresponding to the designated time from the plurality of image frames as a sampling frame according to a preset time interval.

In the above embodiment of the present invention, in the process of obtaining the sampling frame, the sampling may be performed according to a predetermined frame number interval, for example, after splitting the video to be classified into a plurality of image frames, one frame may be reserved every 5 frames as the sampling frame; the decimation may also be performed at predetermined time intervals, for example, a number of frames of 1/3 is reserved as sampling frames in a plurality of image frames corresponding to the video to be classified per second.

Optionally, in the video classification method provided in the embodiment of the present application, merging the key frame and the sample frame according to the time sequence of the key frame and the sample frame in the video to be classified to obtain an image frame sequence includes: determining the playing time of the key frame in the video to be classified as a first playing time; determining the playing time of the sampling frame in the video to be classified as a second playing time; determining the arrangement sequence of the key frames and the sampling frames according to the time sequence of the first playing time and the second playing time; and determining the image frame sequence according to the arrangement sequence of the key frames and the sampling frames.

In the above embodiment of the present invention, the key frames and the sample frames are image frames in the video to be classified, so that each key frame and each sample frame have corresponding playing time in the video to be classified, and the arrangement sequence of the key frames and the sample frames is determined according to the playing time of the key frames and the sample frames in the video to be classified, so as to obtain an image frame sequence retaining the key information of the video to be classified.

Optionally, in the video classification method provided in the embodiment of the present application, determining a classification result of a video to be classified based on an image frame sequence includes: inputting a plurality of preset image frames in an image frame sequence into a preset convolutional neural network model, and determining a feature matrix and probability of each preset image frame, wherein the preset image frames comprise: the method comprises the steps that a key frame and a sampling frame are obtained by training a preset convolution neural network model according to a sample image with calibrated characteristics; and determining a classification result of the video to be classified according to the feature matrix and the probability of a plurality of preset image frames in the image frame sequence.

According to the embodiment of the invention, the plurality of preset image frames in the image frame sequence are analyzed according to the preset convolutional neural network model, the characteristic matrix and the probability of each preset image frame can be rapidly determined, and then the video to be classified is classified according to the determined characteristic matrix and probability.

Optionally, in the video classification method provided in the embodiment of the present application, determining a classification result of a video to be classified according to a feature matrix and a probability of a plurality of predetermined image frames in an image frame sequence includes: determining a feature result of each predetermined image frame according to the product of the feature matrix and the probability; sequencing the feature results of a plurality of preset image frames in the image frame sequence according to the sequence of the preset image frames in the image frame sequence to obtain a feature result sequence; and inputting the characteristic result sequence into a preset cyclic neural network model, and determining the classification result of the video to be classified, wherein the preset cyclic neural network model is obtained by training according to the sample video with the classification result calibrated and the characteristic result sequence corresponding to the sample video.

According to the embodiment of the invention, the feature result of the predetermined image frame can be determined according to the feature matrix and the probability product of the predetermined image frame, the feature result of the predetermined image frame is further analyzed according to the preset cyclic neural network model, and the video to be classified can be classified according to the feature results of a plurality of predetermined image frames in the image frame sequence.

Optionally, in the video classification method provided in this embodiment of the present application, after inputting a plurality of predetermined image frames in an image frame sequence into a preset convolutional neural network model, and determining a feature matrix and a probability of each predetermined image frame, the method further includes: adjusting the probability of the key frame to a preset value; and deleting the sampling frames with the probability lower than a preset threshold value in the image frame sequence.

According to the embodiment of the invention, the probability of the key frame is adjusted to the preset value, the key information of the video to be classified can be highlighted, the sampling frame with the probability lower than the preset threshold value is deleted from the image frame sequence, the interference information of the video to be classified can be reduced, and the classification result of the video to be classified can be accurately determined based on the adjusted image frame sequence.

The present invention also provides a preferred embodiment which provides a method for classifying a video based on a key frame.

The invention pre-extracts the key frame of the video in advance to solve the problem of missing key information caused by random frame extraction, dynamically adjusts the final weighting factor and simultaneously uses an RNN model (namely a preset convolutional neural network model) to solve the problem of influence of irrelevant information in the video on the whole result.

As an alternative example, the present invention provides a flow of a video classification method based on key frames, which includes the following steps:

s201: and inputting the video to be classified.

Alternatively, the video to be classified may be in any known non-encrypted video format (MP4, MAV, MOV, etc.), with unlimited video duration.

S202: and after the frame extraction processing is carried out on the video to be classified, inputting the video to be classified into a preset convolutional neural network model to obtain picture characteristics and probability.

Optionally, the video to be classified is processed frame by frame to generate a plurality of image frames, and then simple frame extraction processing is performed. Besides, the key frame calculation is required to be carried out on all the image frames, and if the key frames meeting the requirements are filtered by simply extracting the frames, the key frames need to be added back according to the time sequence.

Optionally, the structure of the preset convolutional neural network model may adopt Resnet or VGG, and a feature matrix of an image frame and a classification probability corresponding to the image frame need to be output.

S203: inputting the feature matrix and probability of the image frame into a preset recurrent neural network model to obtain a classification result

And (3) generating a new matrix (namely a characteristic result) by using the characteristic matrix probability of the image frame output in the step (S202), stacking and inputting the new matrix into a preset recurrent neural network model according to the time sequence, and obtaining a final classification result through softmax.

As an alternative example, the detailed flow of the video classification method based on key frames of the present invention includes the following steps:

s301: and inputting the video to be classified.

S302: decomposing the video to be classified into a plurality of image frames frame by frame, and calculating a key frame set (namely a plurality of key frames) K. The picture is retained to obtain a sample frame set (i.e., a plurality of sample frames) F, one sample per N frames. The key frame set (i.e. a plurality of key frames) K and the sampling frame set (i.e. a plurality of sampling frames) F are merged according to time to obtain an image frame sequence FN.

Optionally, the video to be classified is processed into pictures frame by frame to obtain an original frame picture set (i.e., a plurality of image frames) P ═ { F1, F2, F3... Fn }, and then a simple frame extraction process is performed. The simple frame extraction means that some pictures are removed according to a certain rule, and a common mode is that one frame is reserved every 5 frames or 1/3 frames are reserved every second (if the video is 30FPS, 10 frames are reserved). A set of sampled frames (i.e., a plurality of sampled frames) after the simple frame decimation is thus obtained, F ═ F1, F6, F11.. Fn.

Optionally, the key frame calculation is based on the original frame picture set (i.e. multiple image frames) P. The key frame is different from other frames in that the picture features such as color, contrast or brightness of the key frame are special. Therefore, clustering analysis is performed on the original frame picture set (i.e. a plurality of image frames) and the P by using a K-means clustering algorithm to obtain different clusters, histogram analysis is performed on the pictures in each cluster, and the key frames with the mean values of the histograms deviating too much from the average are selected to obtain a key frame set (i.e. a plurality of key frames) K.

Optionally, the sampling frame set (i.e. multiple sampling frames) F and the key frame set (i.e. multiple key frames) K are merged and collected in a time sequence, that is, if a frame in the key frame set (i.e. multiple key frames) K is already in the sampling frame set (i.e. multiple sampling frames) F, no operation is performed, and if a frame in the key frame set (i.e. multiple key frames) K is not in the sampling frame set (i.e. multiple sampling frames) F, the frame is inserted into the sampling frame set (i.e. multiple sampling frames) F in a time sequence, so as to obtain the image frame sequence FN with key frames.

For example, if F is { F1, F6, F11}, and K is { F1, F3, F9}, then FN is { F1, F3, F6, F9, F11 }.

S303: and inputting the image frame sequence FN into a preset convolutional neural network model to obtain a feature matrix of each preset image frame and the probability of belonging to a certain classification.

FIG. 2 is a schematic diagram of a model of a predetermined convolutional neural network in accordance with an embodiment of the present application. As shown in fig. 2, preset image frames included in the image frame sequence FN are required to be input into a pre-trained predetermined convolutional neural network model one by one, the output feature matrix is output before passing through the full connection layer, and the probability is output through softmax.

Optionally, the predetermined convolutional neural network model is trained in a conventional manner, i.e., the CNN + fully-connected layers are trained together.

S304: and multiplying the probability by the characteristic matrix of the preset image frame, and overlapping and inputting the result into a preset cyclic neural network model according to the original sequence of time.

Optionally, the probability and feature matrix are both from the output of S303, but it should be noted that the probability of the key frame is manually adjusted to 1, and the discarding is not used for the frame with the probability less than 0.5, because the final recognition result is affected by too low probability and becomes an interference term.

Fig. 3 is a schematic diagram of a recurrent neural network according to an embodiment of the present application. As shown in fig. 3, the predetermined recurrent neural network model includes a predetermined convolutional neural network model RNN + full-link layer.

S305: and outputting a classification result.

It should be noted that, in the past, the video identification is to perform fixed frame extraction, which is relatively easy to miss some key information to interfere the result, and meanwhile, the extracted frame may cause result deviation randomly; and even in the case of frame decimation, it is computationally expensive to perform the calculations.

The invention introduces the pre-extraction of the key frame, so that the number of the extracted frames is less and sparser, thereby reducing the calculation time and improving the accuracy of the result. And the RNN network is introduced to carry out global analysis on the video after the simple image classification, so that the method is more accurate than the traditional weighted average.

The invention adopts the frame extraction technology based on the key frame, the key information of the video is greatly reserved under the condition of sparser frame extraction, and the introduced RNN can better understand the whole streaming video.

The video classification method provided by the embodiment of the application obtains videos to be classified; determining a key frame and a sampling frame of a video to be classified; merging the key frames and the sampling frames according to the time sequence of the key frames and the sampling frames in the video to be classified to obtain an image frame sequence; the classification result of the video to be classified is determined based on the image frame sequence, and the problem of poor video classification effect in the related technology is solved. And then the effect of accurately determining the video classification result is achieved.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

The embodiment of the present application further provides a video classification device, and it should be noted that the video classification device according to the embodiment of the present application can be used for executing the method for classifying videos provided by the embodiment of the present application. The following describes a video classification apparatus according to an embodiment of the present application.

Fig. 4 is a schematic diagram of a video classification apparatus according to an embodiment of the present application. As shown in fig. 4, the apparatus includes: an obtaining unit 41, configured to obtain a video to be classified; a first determining unit 42 for determining key frames and sample frames of the video to be classified; a merging unit 43, configured to merge the key frames and the sample frames according to the time sequence of the key frames and the sample frames in the video to be classified to obtain an image frame sequence; a second determining unit 44 for determining a classification result of the video to be classified based on the image frame sequence.

It should be noted that the obtaining unit 41 in this embodiment may be configured to execute step S101 in this embodiment, the first determining unit 42 in this embodiment may be configured to execute step S102 in this embodiment, the combining unit 43 in this embodiment may be configured to execute step S103 in this embodiment, and the second determining unit 44 in this embodiment may be configured to execute step S104 in this embodiment. The above units are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of the above embodiments.

Optionally, in the video classification apparatus provided in an embodiment of the present application, the first determining unit includes: the device comprises a first determining module, a second determining module and a judging module, wherein the first determining module is used for determining a plurality of image frames of a video to be classified; the extraction module is used for extracting a plurality of image frames to obtain sampling frames; and the clustering module is used for clustering the plurality of image frames to determine the key frames.

Optionally, in the video classification apparatus provided in the embodiment of the present application, the extraction module includes: the first extraction module is used for extracting image frames corresponding to the appointed number of frames from a plurality of image frames as sampling frames according to the preset frame number interval; or a second extraction module, configured to extract, as a sampling frame, an image frame corresponding to a specified time from the plurality of image frames at a predetermined time interval.

Optionally, in the video classification apparatus provided in this embodiment of the present application, the merging unit includes: the third determining module is used for determining the playing time of the key frame in the video to be classified as the first playing time; the fourth determining module is used for determining the playing time of the sampling frame in the video to be classified as the second playing time; a fifth determining module, configured to determine an arrangement order of the key frames and the sampling frames according to a time order of the first playing time and the second playing time; and the sixth determining module is used for determining the image frame sequence according to the arrangement sequence of the key frames and the sampling frames.

Optionally, in the video classification device provided in the embodiment of the present application, the second determining unit includes: a sixth determining module, configured to input a plurality of predetermined image frames in the image frame sequence into a preset convolutional neural network model, and determine a feature matrix and a probability of each predetermined image frame, where the predetermined image frames include: the method comprises the steps that a key frame and a sampling frame are obtained by training a preset convolution neural network model according to a sample image with calibrated characteristics; and the seventh determining module is used for determining the classification result of the video to be classified according to the feature matrix and the probability of a plurality of preset image frames in the image frame sequence.

Optionally, in the video classification apparatus provided in this embodiment of the present application, the sixth determining module includes: a seventh determining module, configured to determine a feature result of each predetermined image frame according to a product of the feature matrix and the probability; an eighth determining module, configured to sort the feature results of the multiple predetermined image frames in the image frame sequence according to a sequence of the predetermined image frames in the image frame sequence, so as to obtain a feature result sequence; and the ninth determining module is used for inputting the characteristic result sequence into a preset cyclic neural network model and determining the classification result of the video to be classified, wherein the preset cyclic neural network model is obtained by training according to the sample video with the classification result calibrated and the characteristic result sequence corresponding to the sample video.

Optionally, in the video classification apparatus provided in the embodiment of the present application, the apparatus further includes: the adjusting unit is used for adjusting the probability of the key frame to a preset value after inputting a plurality of preset image frames in the image frame sequence into a preset convolutional neural network model and determining a feature matrix and the probability of each preset image frame; and the deleting unit is used for deleting the sampling frames with the probability lower than a preset threshold value in the image frame sequence.

The video classification device provided by the embodiment of the application acquires videos to be classified; determining a key frame and a sampling frame of a video to be classified; merging the key frames and the sampling frames according to the time sequence of the key frames and the sampling frames in the video to be classified to obtain an image frame sequence; the classification result of the video to be classified is determined based on the image frame sequence, and the problem of poor video classification effect in the related technology is solved. Thereby achieving the effect of accurately determining the video classification result

The video classification device comprises a processor and a memory, wherein the units and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the purpose of accurately determining the video classification result is achieved by adjusting the kernel parameters.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

An embodiment of the present invention provides a computer-readable storage medium, on which a program is stored, which, when executed by a processor, implements the video classification method.

The embodiment of the invention provides a processor, which is used for running a program, wherein the video classification method is executed when the program runs.

Fig. 5 is a schematic diagram of an electronic device provided according to an embodiment of the present application. As shown in fig. 5, an embodiment of the present invention provides an electronic device 50, which includes a processor 501, a memory 502, and a program stored in the memory and running on the processor, and when the processor executes the program, the processor implements the following steps: acquiring a video to be classified; determining a key frame and a sampling frame of a video to be classified; merging the key frames and the sampling frames according to the time sequence of the key frames and the sampling frames in the video to be classified to obtain an image frame sequence; and determining a classification result of the video to be classified based on the image frame sequence.

Optionally, the processor executes the program to implement the following steps: determining a plurality of image frames of a video to be classified; extracting a plurality of image frames to obtain sampling frames; clustering the plurality of image frames to determine a key frame.

Optionally, the processor executes the program to implement the following steps: according to a preset frame number interval, extracting an image frame corresponding to a specified frame number from a plurality of image frames as a sampling frame; or extracting the image frame corresponding to the designated time from the plurality of image frames as a sampling frame according to a preset time interval.

Optionally, the processor executes the program to implement the following steps: determining the playing time of the key frame in the video to be classified as a first playing time; determining the playing time of the sampling frame in the video to be classified as a second playing time; determining the arrangement sequence of the key frames and the sampling frames according to the time sequence of the first playing time and the second playing time; and determining the image frame sequence according to the arrangement sequence of the key frames and the sampling frames.

Optionally, the processor executes the program to implement the following steps: inputting a plurality of preset image frames in an image frame sequence into a preset convolutional neural network model, and determining a feature matrix and probability of each preset image frame, wherein the preset image frames comprise: the method comprises the steps that a key frame and a sampling frame are obtained by training a preset convolution neural network model according to a sample image with calibrated characteristics; and determining a classification result of the video to be classified according to the characteristic matrix and the probability of a plurality of preset image frames in the image frame sequence.

Optionally, the processor executes the program to implement the following steps: determining a feature result of each predetermined image frame according to the product of the feature matrix and the probability; sequencing the feature results of a plurality of preset image frames in the image frame sequence according to the sequence of the preset image frames in the image frame sequence to obtain a feature result sequence; and inputting the characteristic result sequence into a preset cyclic neural network model, and determining the classification result of the video to be classified, wherein the preset cyclic neural network model is obtained by training according to the sample video with the classification result calibrated and the characteristic result sequence corresponding to the sample video.

Optionally, the processor executes the program to implement the following steps: after a plurality of preset image frames in an image frame sequence are input into a preset convolutional neural network model and a characteristic matrix and a probability of each preset image frame are determined, the probability of a key frame is adjusted to be a preset value; sample frames with deletion probability lower than preset threshold in image frame sequence

Alternatively, the electronic device herein may be a server, a PC, a PAD, a mobile phone, or the like.

The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: acquiring a video to be classified; determining a key frame and a sampling frame of a video to be classified; merging the key frames and the sampling frames according to the time sequence of the key frames and the sampling frames in the video to be classified to obtain an image frame sequence; and determining a classification result of the video to be classified based on the image frame sequence.

Alternatively, it is adapted to perform a procedure when executed on a data processing device, which initializes the following method steps: determining a plurality of image frames of a video to be classified; extracting a plurality of image frames to obtain sampling frames; clustering the plurality of image frames to determine a key frame.

Alternatively, it is adapted to perform a procedure when executed on a data processing device, which initializes the following method steps: according to a preset frame number interval, extracting an image frame corresponding to a specified frame number from a plurality of image frames as a sampling frame; or extracting the image frame corresponding to the designated time from the plurality of image frames as a sampling frame according to a preset time interval.

Alternatively, it is adapted to perform a procedure when executed on a data processing device, which initializes the following method steps: determining the playing time of the key frame in the video to be classified as a first playing time; determining the playing time of the sampling frame in the video to be classified as a second playing time; determining the arrangement sequence of the key frames and the sampling frames according to the time sequence of the first playing time and the second playing time; and determining the image frame sequence according to the arrangement sequence of the key frames and the sampling frames.

Alternatively, it is adapted to perform a procedure when executed on a data processing device, which initializes the following method steps: inputting a plurality of preset image frames in an image frame sequence into a preset convolutional neural network model, and determining a feature matrix and probability of each preset image frame, wherein the preset image frames comprise: the method comprises the steps that a key frame and a sampling frame are obtained by training a preset convolution neural network model according to a sample image with calibrated characteristics; and determining a classification result of the video to be classified according to the feature matrix and the probability of a plurality of preset image frames in the image frame sequence.

Alternatively, it is adapted to perform a procedure when executed on a data processing device, which initializes the following method steps: determining a feature result of each predetermined image frame according to the product of the feature matrix and the probability; sequencing the feature results of a plurality of preset image frames in the image frame sequence according to the sequence of the preset image frames in the image frame sequence to obtain a feature result sequence; and inputting the characteristic result sequence into a preset cyclic neural network model, and determining the classification result of the video to be classified, wherein the preset cyclic neural network model is obtained by training according to the sample video with the classification result calibrated and the characteristic result sequence corresponding to the sample video.

Alternatively, it is adapted to perform a procedure when executed on a data processing device, which initializes the following method steps: after a plurality of preset image frames in an image frame sequence are input into a preset convolutional neural network model and a characteristic matrix and a probability of each preset image frame are determined, the probability of a key frame is adjusted to be a preset value; and deleting the sampling frames with the probability lower than a preset threshold value in the image frame sequence.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of video classification, comprising:

acquiring a video to be classified;

determining key frames and sampling frames of the video to be classified;

merging the key frames and the sampling frames according to the time sequence of the key frames and the sampling frames in the video to be classified to obtain an image frame sequence;

determining a classification result of the video to be classified based on the image frame sequence.

2. The method of claim 1, wherein determining key frames and sample frames of the video to be classified comprises:

determining a plurality of image frames of the video to be classified;

extracting the plurality of image frames to obtain the sampling frame;

clustering the plurality of image frames, and determining the key frame.

3. The method of claim 2, wherein decimating among the plurality of image frames to obtain the sampled frame comprises:

extracting image frames corresponding to the appointed number of frames from the plurality of image frames as the sampling frames according to the preset frame number interval; or

And extracting an image frame corresponding to a designated time from the plurality of image frames as the sampling frame at a predetermined time interval.

4. The method of claim 1, wherein combining the key frames and the sample frames into an image frame sequence according to the time sequence of the key frames and the sample frames in the video to be classified comprises:

determining the playing time of the key frame in the video to be classified as a first playing time;

determining the playing time of the sampling frame in the video to be classified as a second playing time;

determining the arrangement sequence of the key frames and the sampling frames according to the time sequence of the first playing time and the second playing time;

and determining the image frame sequence according to the arrangement sequence of the key frames and the sampling frames.

5. The method of claim 1, wherein determining the classification result for the video to be classified based on the sequence of image frames comprises:

inputting a plurality of predetermined image frames in the image frame sequence into a preset convolutional neural network model, and determining a feature matrix and a probability of each predetermined image frame, wherein the predetermined image frames comprise: the preset convolutional neural network model is obtained by training the key frame and the sampling frame according to a sample image with calibrated characteristics;

and determining the classification result of the video to be classified according to the feature matrix and the probability of a plurality of preset image frames in the image frame sequence.

6. The method of claim 5, wherein determining the classification result of the video to be classified according to the feature matrix and the probability of a plurality of predetermined image frames in the image frame sequence comprises:

determining a feature result of each of the predetermined image frames according to a product of the feature matrix and the probability;

sequencing the feature results of a plurality of preset image frames in the image frame sequence according to the sequence of the preset image frames in the image frame sequence to obtain a feature result sequence;

and inputting the characteristic result sequence into a preset cyclic neural network model, and determining the classification result of the video to be classified, wherein the preset cyclic neural network model is obtained by training according to the sample video with the classification result calibrated and the characteristic result sequence corresponding to the sample video.

7. The method of claim 5, wherein after inputting a plurality of predetermined image frames in the sequence of image frames into a preset convolutional neural network model, determining a feature matrix and a probability for each of the predetermined image frames, the method further comprises:

adjusting the probability of the key frame to a preset value;

and deleting the sampling frames with the probability lower than a preset threshold value in the image frame sequence.

8. A video classification apparatus, comprising:

the acquiring unit is used for acquiring a video to be classified;

the first determining unit is used for determining key frames and sampling frames of the video to be classified;

the merging unit is used for merging the key frames and the sampling frames according to the time sequence of the key frames and the sampling frames in the video to be classified to obtain an image frame sequence;

a second determining unit, configured to determine a classification result of the video to be classified based on the image frame sequence.

9. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to perform the video classification method according to any one of claims 1 to 7 when running.

10. An electronic device comprising one or more processors and memory storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the video classification method of any of claims 1-7.