CN111859023A

CN111859023A - Video classification method, device, equipment and computer readable storage medium

Info

Publication number: CN111859023A
Application number: CN202010531316.9A
Authority: CN
Inventors: 乔宇; 王亚立; 李先航; 周志鹏; 邹静
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2020-10-30
Anticipated expiration: 2040-06-11
Also published as: WO2021248859A9; CN111859023B; WO2021248859A1

Abstract

The application belongs to the field of image processing and discloses a video classification method, a video classification device, video classification equipment and a computer readable storage medium. The video classification method comprises the steps of obtaining a video to be classified; inputting the video to be classified into a trained video classification model for processing, and outputting a classification result of the video to be classified; the video classification model comprises a feature extraction layer and a full connection layer, wherein the feature extraction layer is used for extracting spatial feature information through two-dimensional convolution, extracting time feature information through pooling and fusing the spatial feature information and the time feature information to output fused feature information, and the full connection layer is used for performing full connection processing on the fused feature information to obtain a classification result. Compared with three-dimensional convolution kernel calculation, the method and the device for classifying the video have the advantages that the characteristic information of the time dimension of the video to be classified is obtained through pooling, calculation of convolution parameters can be greatly reduced through the adopted two-dimensional convolution, and the method and the device are beneficial to reducing the calculation amount of video classification.

Description

Video classification method, device, equipment and computer readable storage medium

Technical Field

The present application relates to the field of image processing, and in particular, to a video classification method, apparatus, device, and computer-readable storage medium.

Background

In order to facilitate the management of the images, the image contents can be identified and classified in a deep learning mode. In recent years, with the great breakthrough of the convolutional neural network in the task of image classification, the accuracy of image classification by the two-dimensional convolutional neural network even exceeds the accuracy of human classification.

While the two-dimensional convolutional neural network is used to accurately classify images, it can also be applied to classification of videos composed of images. Since video data has one more time dimension than a still picture, in order to extract information of the time dimension in a video, a three-dimensional convolution kernel including the time dimension is generally adopted to extract features simultaneously in time and space. However, when convolution calculation is performed by a three-dimensional convolution kernel, an additional parameter is added to two-dimensional convolution calculation, resulting in an increase in calculation amount.

Disclosure of Invention

In view of this, embodiments of the present application provide a video classification method, an apparatus, a device, and a computer-readable storage medium, so as to solve the problem in the prior art that when performing convolution computation video classification through a three-dimensional convolution kernel, an additional parameter is added compared to two-dimensional convolution computation, which results in an increase in computation amount.

A first aspect of an embodiment of the present application provides a video classification method, where the method includes:

acquiring a video to be classified, wherein the video to be classified comprises a plurality of video frames;

inputting the video to be classified into a trained video classification model for processing, and outputting a classification result of the video to be classified; the video classification model comprises a feature extraction layer and a full connection layer, the feature extraction layer is used for extracting the spatial feature information of the video frames through two-dimensional convolution, extracting the time feature information of the video frames through pooling, and fusing the spatial feature information and the time feature information to output fused feature information, and the full connection layer is used for performing full connection processing on the fused feature information output by the feature extraction layer to obtain the classification result.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the feature extraction layer includes N feature extraction sublayers, where N is greater than or equal to 1, input information of a first feature extraction sublayer of the N feature extraction sublayers is the plurality of video frames, output information of a previous feature extraction sublayer is input information of a next feature extraction sublayer, and output information of an nth feature extraction sublayer is fusion feature information output by the feature extraction layer; each feature extraction sub-layer of the N feature extraction sub-layers comprises a large receptive field context feature extraction branch and a small receptive field core feature extraction branch, and each feature extraction sub-layer of the N feature extraction sub-layers processes input information, including:

Performing pooling processing on the input information through a large receptive field context characteristic extraction branch, and extracting time characteristic information of the input information;

performing two-dimensional convolution processing on the input information through a small receptive field core feature extraction branch, and extracting spatial feature information of the input information;

and fusing the time characteristic information extracted by the large receptive field context characteristic extraction branch and the space characteristic information extracted by the small receptive field core characteristic extraction branch to obtain output information.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the performing pooling processing on the input information through a large receptive field context feature extraction branch to extract temporal feature information of the input information includes:

performing three-dimensional pooling processing on the input information through a large receptive field context characteristic extraction branch to obtain pooled information;

and performing two-dimensional convolution processing on the pooled information through a large receptive field context feature extraction branch to obtain time feature information.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, performing three-dimensional pooling processing on the input information through a large receptive field context feature extraction branch to obtain pooled information, where the three-dimensional pooling processing includes:

And performing pooling processing on the input information through a three-dimensional pooling kernel { t, K, K } in the large receptive field context feature extraction branch to obtain pooled information, wherein t is the size of a kernel in the time direction, t is smaller than or equal to the video duration, K is the size of the pooling kernel in a two-dimensional space where an image is located, and the three-dimensional pooling kernel is the size of a pooled pixel selected in single pooling calculation.

With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, among the N three-dimensional pooling kernels included in the feature extraction layer, the sizes of the N three-dimensional pooling kernels are completely the same, or the sizes of the N three-dimensional pooling kernels are completely different, or the sizes of some of the N three-dimensional pooling kernels are the same, and the three-dimensional pooling kernels are the sizes of the pooled pixels selected in a single pooling calculation.

With reference to the third or fourth possible implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, the completely different sizes of the N three-dimensional pooling cores include:

and gradually increasing the size of the three-dimensional pooling kernel along with the sequence of extracting the characteristic information.

With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, the step-increasing the size of the three-dimensional pooling core includes:

Gradually increasing the size of the three-dimensional pooling kernel in the time direction;

or, gradually increasing the dimension of the three-dimensional pooling core in the two-dimensional space where the video frame is located;

or, the time direction size of the three-dimensional pooling kernel and the dimension size of the two-dimensional space where the video frame is located are gradually increased.

With reference to the second possible implementation manner of the first aspect, in a seventh possible implementation manner of the first aspect, the convolution parameters of the two-dimensional convolution processing in the large receptive field context feature extraction branch are the same as the convolution parameters of the two-dimensional convolution processing in the small receptive field core feature extraction branch.

With reference to the first aspect, in an eighth possible implementation manner of the first aspect, the fusing the spatial feature information and the temporal feature information to output fused feature information includes:

and superposing the image of the spatial characteristic information and the image of the temporal characteristic information to generate the fusion characteristic information.

In a second aspect, an embodiment of the present application provides a video classification apparatus, including:

the device comprises a to-be-classified video acquisition unit, a to-be-classified video acquisition unit and a classification unit, wherein the to-be-classified video acquisition unit is used for acquiring a to-be-classified video, and the to-be-classified video comprises a plurality of video frames;

The classification unit is used for inputting the video to be classified into a trained video classification model for processing and outputting a classification result of the video to be classified; the video classification model comprises a feature extraction layer and a full connection layer, the feature extraction layer is used for extracting the spatial feature information of the video frames through two-dimensional convolution, extracting the time feature information of the video frames through pooling, and fusing the spatial feature information and the time feature information to output fused feature information, and the full connection layer is used for performing full connection processing on the fused feature information output by the feature extraction layer to obtain the classification result.

A third aspect of embodiments of the present application provides a video classification apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to make the video classification apparatus implement the video classification method according to any one of the first aspect.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the video classification method according to any one of the first aspects.

Compared with the prior art, the embodiment of the application has the advantages that: according to the method, spatial characteristic information of a plurality of video frames in the video to be classified is extracted through a classification model through two-dimensional convolution, time characteristic information of the plurality of video frames is extracted through pooling, the time characteristic information and the spatial characteristic information are fused, and a classification result is obtained through a full connection layer. Because the time characteristic information of the video to be classified can be obtained through pooling, compared with three-dimensional convolution kernel calculation, the method and the device have the advantages that the time characteristic information is kept, and meanwhile, the calculation of convolution parameters can be greatly reduced by adopting a two-dimensional convolution calculation mode, and the method and the device are favorable for reducing the calculation amount of video classification. In addition, the video classification method and the video classification device can be inserted into any two-dimensional convolutional network to classify the video, and are favorable for improving the diversity and the universality of the video classification method.

Drawings

Fig. 1 is a schematic view of a video classification application scene provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of prior art video classification using three-dimensional convolution;

fig. 3 is a schematic flowchart illustrating an implementation process of a video classification method according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating an implementation of a video classification method provided in an embodiment of the present application;

Fig. 5 is a schematic diagram illustrating an implementation of video classification according to an embodiment of the present application;

fig. 6 is a schematic diagram illustrating an implementation of another video classification provided by an embodiment of the present application;

fig. 7 is a schematic diagram of a video classification apparatus according to an embodiment of the present application;

fig. 8 is a schematic diagram of a video classification device provided in an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.

With the rise of video data, video classification techniques are required to be used in more and more scenes. The video classification method provided by the embodiment of the application is used for carrying out classification management on the videos, so that the convenience of video use can be effectively improved.

For example, in the field of intelligent monitoring, collected monitoring videos are classified through a video classification technology, and whether the video content is abnormal or not is judged. The video classification method is insensitive to the change speed of the action frame, can effectively perform modeling on actions with different duration, and classifies the monitoring videos through modeling, so that a user can be helped to quickly find out key monitoring information, or abnormity reminding can be sent to monitoring personnel in time, and the monitoring personnel can timely handle abnormity in the monitoring videos.

For example, when a large amount of videos are stored in the device, the large amount of videos can be classified into videos of different scenes, different moods, different wind networks and the like through a video classification technology, so that a user can conveniently and quickly find out the required videos.

For example, for the aspect of intelligent sports training or video-assisted referee, the video includes sports videos with faster motions, such as shooting, gymnastics or quick-sliding videos, and sports videos with slower motions, such as yoga videos. The motion in the motion video can be classified by the video classification method based on the characteristic of insensitivity to motion speed and time.

For another example, as shown in fig. 1, in the video entertainment platform, the platform server receives a video shot by the user and uploaded by the terminal a, and classifies the uploaded video to obtain the category of the video uploaded by the terminal a. As the number of videos uploaded increases, so does the number of videos for the same category. When other terminals, such as terminal B browse, the video category browsed by terminal B is obtained through the pre-classification result. The platform can search other videos in the same category according to the category of the videos browsed by the terminal B and recommend the videos to the terminal B, so that the use experience of the user for browsing the videos is improved.

However, in the currently used video classification algorithm, as shown in fig. 2, a three-dimensional convolution kernel including time information, such as a time convolution kernel of 3 × 1, is selected to perform a convolution operation on the video to be classified. The three-dimensional convolution kernel includes a width W, a height H, and a duration T of the image, and when performing convolution calculation, includes parameter calculation of a time dimension in addition to parameter calculation of spatial features, such as parameter calculation of dimensions of the width W and the height H in the image shown in fig. 2. Compared with the traditional two-dimensional convolution kernel, the three-dimensional convolution kernel increases the parameter calculation of time dimension, increases a large number of parameters and improves the calculation amount of video classification.

In order to reduce the amount of calculation in video classification calculation, an embodiment of the present application provides a video classification method, as shown in fig. 3, where the video classification method includes:

in step S301, a video to be classified is acquired, and the video to be classified includes a plurality of video frames.

The video to be classified in the embodiment of the application can be a video stored in a user terminal, a video collected by a monitoring device or a video uploaded by a platform user and received by a video entertainment platform. When the video is the video collected by the monitoring equipment, the video collected in real time can be divided into a plurality of sub-video segments according to a preset time period, and the collected sub-video segments are classified, so that whether the sub-video segments are abnormal or not is judged.

The video to be classified comprises a plurality of video frames which are sequentially arranged according to a time sequence. From the video to be classified, spatial information of the width W and height H of each video frame can be determined. According to the time interval between the video frames and the initial playing time, the playing time corresponding to each video frame can be determined.

In step S302, the video to be classified is input into a trained video classification model for processing, and a classification result of the video to be classified is output; the video classification model comprises a feature extraction layer and a full connection layer, the feature extraction layer is used for extracting the spatial feature information of the video frames through two-dimensional convolution, extracting the time feature information of the video frames through pooling, and fusing the spatial feature information and the time feature information to output fused feature information, and the full connection layer is used for performing full connection processing on the fused feature information output by the feature extraction layer to obtain the classification result.

The feature extraction layer may include a large receptive field context feature extraction branch and a small receptive field core feature extraction branch. The large receptive field context feature extraction branch is used for extracting time feature information, or extracting spatio-temporal feature information including the time feature information, wherein the context feature is also the time feature information. The large receptive field can be obtained by cascading a plurality of feature extraction sublayers, and also can be obtained by gradually increasing the size of the three-dimensional pooling nucleus. And the small receptive field core feature extraction branch is used for extracting the spatial feature information of the two-dimensional plane in each video frame in the video to be classified. The feature extraction layer is further used for fusing the extracted time feature information and the extracted space feature information to obtain fused feature information. That is, by the dual-branch structure, context information extracted by the large-receptive-field context extraction branch and core features extracted by the small-receptive-field core feature extraction branch can be effectively acquired.

In a possible implementation manner, in the video classification model, the feature extraction layer may include N feature extraction sublayers, where N is greater than or equal to 1.

For example, the feature extraction layer may include 1 feature extraction sublayer, fused feature information is output through one feature extraction sublayer, and the full connection processing is performed on the fused feature information through the full connection layer, so as to obtain the classification result.

And when the N is greater than or equal to 2, the output information of the feature extraction sublayer of the previous or previous stage is used as the input information of the feature extraction sublayer of the next or next stage. For example, the fusion feature information output by the ith feature extraction sublayer is used as the input information of the (i + 1) th feature extraction sublayer. The fused feature information output by the ith feature extraction sublayer is fused with time feature information and spatial feature information, and the feature information can be further extracted by the (i + 1) th feature extraction sublayer through pooling. Wherein i is greater than or equal to 1 and less than N.

The fused feature information is feature information obtained by fusing temporal feature information and spatial feature information. The fusion process may refer to superposition of feature information. For example, the image corresponding to the temporal feature information and the image corresponding to the spatial feature information may be subjected to the pixel superimposition processing.

In order to match the sizes of images corresponding to the temporal feature information and the spatial feature information at the time of fusion, when pooling the input information, the input information for pooling may be matched with the size of an image corresponding to the output information for pooling.

In one implementation, Padding processing may be performed on the input information, that is, Padding is performed on the input feature image or video frame in a time dimension or in a spatial dimension, so that after the pooling check performs pooling processing on the padded input information, the size of the obtained output information is consistent with the size of the unfilled input information.

For example, when it is determined that the size of the input information is n, the size of the pooling kernel is f, the step size is s, the padding size is p, and the size of the output information is o, the following formula can be used:

the size of the padding required is calculated.

For example, for the pooling operation with a pooling kernel size of 3 × 3 and a step size of 1, the padding parameter padding size may be 2 to obtain the same size of the output information and the input information.

The size of the pooling kernel is 3 × 3, which means that the pooling kernel has a dimension of 3 × 3 in the two-dimensional plane of the pooled image, and the unit may be a pixel or other predetermined length unit. When the length of the time dimension is 3, the unit may be a video duration, for example, a 3-second video duration, and the number of video frames corresponding to the video duration may be determined by the video duration. Of course, the definition of the three-dimensional pooling kernel is not limited to this, and the size of the pooling kernel in the time dimension can also be directly determined by the number of video frames.

The two-dimensional convolution refers to convolution performed on two dimensions, namely width and height, of a plane where the image of the video frame is located. The selected convolution kernel is a convolution kernel with a size of two-dimensional space.

When spatial feature information is extracted by two-dimensional convolution, the extraction of the spatial feature information may be completed based on a convolution kernel of a predetermined fixed size. Of course, the spatial feature information may also be extracted by using an existing neural network model, such as a convolutional neural network of a LeNet architecture, a convolutional neural network of an AlexNet architecture, a convolutional neural network of a ResNet architecture, a convolutional neural network of a Google architecture, a convolutional neural network of a VGGNet architecture, or the like. Therefore, in the process of extracting the spatial feature information, the spatial feature information of the video frame in the video to be classified, namely the feature information included in the width W and the height H of the video frame, is obtained without changing the identification capability of the convolutional neural network to the video frame.

Any two-dimensional convolutional network can be inserted into the video classification method, so that the effect of the three-dimensional convolutional network on time characteristic information acquisition is achieved, the optimization of characteristic hardware or a deep learning platform is not needed, and a specific network design is not needed, so that the universality of the video classification method can be effectively improved.

Moreover, compared with the currently used plug-and-play video identification module, such as a Time Shift Module (TSM) and a non-local neural network (NNN) video identification module, the method is beneficial to reducing the calculation amount in the classification process on the premise of ensuring the accuracy of the classification result.

In a possible implementation manner, three-dimensional pooling processing can be performed on the input information through the large receptive field context feature extraction branch to obtain pooled information, and then two-dimensional convolution processing can be performed on the pooled information through the large receptive field context feature extraction branch to obtain time feature information.

For example, in the structural diagram of the video classification method shown in fig. 4, the two-dimensional convolution performs convolution operation on a two-dimensional plane where the image based on a single video frame is located, and on the premise that the extraction complexity of the feature information of the two-dimensional image is not increased, the spatial feature information of each frame image of the video to be classified is obtained, that is, the feature information of the width W and the height H dimension of each video frame is obtained.

In one implementation, the convolution kernel of the two-dimensional convolution may be expressed as: { C1, C2, 1, K, K }, wherein C1 denotes the number of channels of the input feature image, C2 denotes the number of channels of the output feature image, the position of "1" in the convolution kernel denotes the time dimension of the convolution kernel, 1 denotes that the convolution kernel does not extend in the time dimension, that is, each time two-dimensional convolution is performed, the convolution is performed only on the image of the same video frame, and K denotes the size of the convolution kernel in the two-dimensional space where the video frame is located.

The time characteristic information is extracted through pooling, and the pooling can comprise a maximum pooling mode, an average pooling mode or a global average pooling mode. For example, when the operation of maximum pooling is selected, the pixels that need pooling may be selected according to the pooling kernel, and the pixel with the largest pixel value may be selected as the pooled pixel value.

In one implementation, the three-dimensional pooling kernel may be represented as { t, K }, where t represents a size of the pooling kernel in a temporal direction and K represents a size of the pooling kernel in a two-dimensional space in which the image is located. Specifically, T may be 3 or T may be T (video length, or the number of video frames or images corresponding to the video duration). Since the pooling operation does not require convolution calculation, only the magnitude of the values need to be compared, the amount of calculation required is very small.

For different selected sizes of the parameter t in the time direction, the number of corresponding video frames in the pooling process is different. According to the setting of the pooling step length, the same video frame can be used as the object pooled by different pooling cores. When the value of K in the pooling kernel is greater than 1, it indicates that the pooling kernel also pools a plurality of pixels or regions in the two-dimensional space. In order to facilitate subsequent fusion, during pooling, pooling operation with padding can be adopted to fill the edges of the pooled images, so that the size consistency of the images of the input information and the output information before and after pooling is ensured.

After pooling, the output information of the pooling is convolved. And the pooled output information is fused with space-time information of t × K size on adjacent space-time, and then the pooled output information is subjected to convolution operation in a two-dimensional convolution mode to obtain time characteristic information of a plurality of video frames.

In an implementation manner, the convolution operations of the small receptive field core feature extraction branch and the large receptive field context feature extraction branch may be performed by sharing parameters and using the same convolution parameters. Therefore, when the time characteristic information is extracted, the time characteristic information can be obtained without introducing a new convolution parameter for calculating the characteristic information of the time dimension, and the calculation amount of the video classification model is reduced without increasing the calculation parameter of the time characteristic information.

In the N three-dimensional pooling kernels included in the N feature extraction sublayers, the sizes of any two three-dimensional pooling kernels may be different, or the sizes of the N three-dimensional pooling kernels may be the same, or the sizes of some three-dimensional pooling kernels may be the same, and the sizes of some three-dimensional pooling kernels are different.

In a possible implementation manner, the three-dimensional pooling kernel used in the three-dimensional pooling of the large receptive field context feature extraction branch may adopt time dimensions of different sizes or space dimensions of different sizes.

For example, adjusting the size of the pooling kernel used for the three-dimensional pooling may include adjusting the size of a time dimension or a time direction in the three-dimensional pooling kernel, or the size of the three-dimensional pooling kernel in a two-dimensional space where a video frame is located, or the size of the three-dimensional pooling kernel in a time dimension and a space dimension, to obtain three-dimensional pooling kernels of different sizes, and obtaining corresponding spatio-temporal feature information through calculation by using the three-dimensional pooling kernels of different sizes, where the spatio-temporal feature information includes time feature information.

In a possible implementation manner, the pooled feature image can be obtained by gradually increasing the size of the pooled kernel, including gradually increasing the size of the pooled kernel in the time dimension, or gradually increasing the size of the video frame of the pooled kernel in the two-dimensional space, or simultaneously increasing the size of the pooled kernel in the time dimension and the dimension of the video frame in the two-dimensional space, so that time feature information of features with different durations obtained by pooling different pooled kernels is gradually fused to obtain space-time feature information with finer granularity.

When the temporal feature information is extracted, as shown in fig. 4, because the images corresponding to the spatial feature information and the temporal feature information are convolved by using the same convolution parameter, the information represented by the corresponding points of the images of the temporal feature information and the spatial feature information has spatial consistency, that is, the spatial feature information and the temporal feature information have the same size, and a strategy of adding the spatial feature information point by point can be adopted to obtain the fusion feature information.

The video classification method comprises the steps of obtaining fusion characteristic information by fusing spatial characteristic information and time characteristic information, extracting spatial characteristics of video frames by two-dimensional convolution according to the spatial characteristic information, extracting space-time characteristics of images by pooling according to the time characteristic information, integrating the fusion characteristic information by a full connection layer, classifying the video to be classified according to the integrated fusion characteristic information, and obtaining a video classification result. For example, the fusion feature information may be fully-connected according to a preset fully-connected layer weight coefficient, and the video classification result may be determined by comparing the calculation result with a preset classification standard.

Because the calculation of the convolution parameters of the time dimension is not needed to be added in the video classification process, the spatio-temporal feature information of the video to be classified can be effectively obtained only by carrying out simple pooling operation, the number of calculated parameters is favorably reduced, and the complexity of video classification calculation is reduced.

In a possible implementation manner of the present application, the video classification model may include two or more feature extraction layers, and two or more spatio-temporal feature images (a video to be classified belongs to one of the spatio-temporal feature images) may be extracted through the two or more feature extraction layers. For example, in the video classification implementation diagram shown in fig. 5, the feature extraction layer includes two feature extraction sublayers, and in this embodiment of the present application, the feature extraction layer may be referred to as a SmallBig unit for short. As shown in fig. 5, the feature extraction layer in the video classification model includes two feature extraction sublayers, which are SmallBig unit 1 and SmallBig unit 2, respectively, and the fusion feature information extracted by the SmallBig unit 1 of the previous feature extraction sublayer can be used as the input of the SmallBig unit 2 of the next feature extraction sublayer, and the video classification is performed by the full connection layer according to the fusion feature information obtained by the SmallBig unit 2 of the second feature extraction sublayer, and the category to which the video belongs is output.

Specifically, as shown in fig. 5, a video to be classified is input to the SmallBig unit 1 of the first-stage feature extraction layer, and a first convolution operation of two-dimensional convolution is performed on a plurality of video frames included therein, so as to obtain spatial feature information included in the plurality of video frames. The video frame pooling method comprises the step of performing first pooling operation on video frames of a video to be classified in a time dimension by adopting a three-dimensional pooling core with a preset time length parameter on a plurality of video frames in the video to be classified. And for the pooled image, performing a second convolution operation of two-dimensional convolution on the pooled image by a mode of sharing the convolution parameter of the first convolution operation, namely adopting the convolution parameter of the first convolution operation, so as to obtain time characteristic information. And then fusing the space-time characteristic information and the time characteristic information to obtain fused characteristic information. And according to the consistent information of the sizes of the images corresponding to the time characteristic information and the space characteristic information, carrying out pixel addition on corresponding pixel points of the images corresponding to the space characteristic information and the time characteristic information to obtain fusion characteristic information comprising space characteristics and space-time characteristics, wherein the fusion characteristic information can comprise a plurality of frames of images.

And inputting the fusion characteristic information into a SmallBig unit 2 of a second-stage characteristic extraction sublayer, and further obtaining the spatial characteristic information in the fusion characteristic information of the SmallBig unit 1 through a third convolution operation on the image of each channel in the fusion characteristic information. And performing second pooling operation on the images of the channels in the fusion characteristic information output by the SmallBig unit 1 in the time dimension according to the time sequence of the channels, performing fourth convolution operation on the pooled information obtained by the second pooling operation, and further extracting the time characteristic information of the images in the fusion characteristic information of the SmallBig unit 1. Wherein the fourth convolution operation and the third convolution operation use the same convolution parameters.

Of course, without being necessarily limited thereto, the number of the SmallBig units of the feature extraction sublayer may also include three or more. Fig. 6 is a schematic diagram of an implementation of video classification through three feature extraction sublayers according to an embodiment of the present application, and on the basis of fig. 5, a SmallBig unit 3 of a third-level feature extraction sublayer is added. And respectively processing the fusion characteristic information output by the SmallBig unit 1 of the first-stage characteristic extraction sublayer through convolution operation of the second stage and pooling operation of the second stage, and acquiring the fusion characteristic information output by the SmallBig unit 2 of the second-stage characteristic extraction sublayer after fusion processing of the time characteristic information and the space characteristic information of the SmallBig unit 2 of the second stage. And the SmallBig unit 3 of the third-level feature extraction sublayer respectively processes the fusion feature information output by the SmallBig unit 2 of the second-level feature extraction sublayer through two-dimensional convolution and pooling, further extracts time feature information and space feature information, and fuses to obtain fusion feature information output by the SmallBig unit 3 of the third-level feature extraction sublayer.

In a possible implementation manner, the feature extraction layer is further configured to superimpose the video to be classified and the fusion feature information output by the feature extraction layer to form residual connection to update the fusion feature information.

For example, for the video classification model shown in fig. 6, in the third-level feature extraction sublayer SmallBig unit 3, the fused data includes temporal feature information and spatial feature information calculated by the third-level feature extraction sublayer SmallBig unit 3, and the video to be classified is also superimposed, and the temporal feature information and the spatial feature information extracted by the video to be classified and the third-level feature extraction sublayer are fused to form a residual connection structure, so that during training, the newly added parameters do not affect the parameters of the original pre-training image network, which is beneficial to improving the pre-training effect of the image network, and by introducing the residual, the convergence is accelerated, and the training efficiency of the video classification model is improved.

As shown in fig. 6, the first level of feature extraction subunit uses a first convolution kernel whose size is larger, the pooling uses a first pooling kernel, the second level of feature extraction subunit uses a second convolution kernel whose size is larger, the pooling uses a second pooling kernel, the third level of feature extraction subunit uses a third convolution kernel whose size is larger, and the pooling uses a third pooling kernel.

In a possible implementation manner, the first convolution kernel used in the two-dimensional convolution and the third convolution kernel used in the third convolution operation are smaller than the second convolution kernel used in the second convolution operation. In one implementation, as shown in fig. 6, the first convolution kernel and the third convolution kernel are 1 × 1 in size, and the second convolution kernel is 1 × 3 in size. The fusion of the plurality of channels and the space-time information can be completed through the first convolution kernel and the third convolution kernel. And the second convolution kernel can be used for completing the extraction of the space-time characteristics.

In a possible implementation, the first pooled core and the second pooled core may be smaller than a third pooled core employed by the third pooling operation. In one implementation, as shown in fig. 6, the first pooling core and the second pooling core of the pooling are 3 × 3 in size, and the third pooling core is 3 × T in size, where T may be a video duration or may also be a number of video frames corresponding to the video duration. And when the T is the video duration, the T is the duration. And when the T is the number of the video frames corresponding to the video duration, the T is the number of the video frames. Through the first pooling kernel and the second pooling kernel, pooling values, such as maximum pooling, of 9 pixel points in the stereo space in adjacent three frames can be captured. Through the third pooling core, temporal features of video frames of the entire video length can be extracted. By gradually increasing the time receptive field in the time dimension and combining the spatial characteristics of convolution learning, the output fusion characteristic information has global time receptive following. In addition, the SmallBig unit 1 and the SmallBig unit 3 are added with two times of local reception fields in space, so that the spatial reception field of the whole module is also increased.

In practical application, the video classification system can be trained by using optimization algorithms such as a random gradient descent (SGD) algorithm and the like, and a data set can adopt mainstream video task data. According to the experimental results of data set training, in the network structure, the video classification method disclosed by the application can provide higher precision, faster convergence and better robustness, compared with the most advanced network at present, the video classification identification of only inputting 8 frames of images is superior to the non-local-R50 (non-local R50 network) inputting 32 frames with higher precision, and the number of floating-point operations per second (called GFlos for short) which is 4.9 times less than that of the non-local-R50 inputting 128 frames is used, but the video classification identification has the same precision. In addition, under the same GFlops, the performance of the video classification method disclosed by the application for inputting 8 frames is superior to that of the current most advanced 36-frame input fast-slow combined R50 network (English is called SlowFast-R50). These results indicate that the video classification model for video classification described in the present application is an accurate and efficient video classification model.

In addition, the application also provides a video classification model training method, which comprises the following steps: obtaining a sample video in a sample video set and a sample classification result of the sample video, wherein the sample video comprises a plurality of video frames; extracting spatial feature information in a sample video through two-dimensional convolution; extracting time characteristic information in the sample video through pooling; fusing the spatial characteristic information and the time characteristic information to obtain fused characteristic information, and carrying out full-connection processing on the fused characteristic information to obtain a model classification result; and correcting parameters of the two-dimensional convolution according to the model classification result and the sample classification result, and returning to the step of extracting spatial feature information in the sample video through the two-dimensional convolution until the model classification result and the sample classification result meet preset conditions to obtain a trained video classification model.

The structure of the video classification model is consistent with the neural network model adopted by the video classification method shown in fig. 2, and repeated description is omitted here.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 7 is a schematic diagram of a video classification apparatus according to an embodiment of the present application, where the video classification apparatus includes:

a to-be-classified video acquiring unit 701 configured to acquire a to-be-classified video, where the to-be-classified video includes a plurality of video frames;

a classifying unit 702, configured to input the video to be classified into a trained video classification model for processing, and output a classification result of the video to be classified; the video classification model comprises a feature extraction layer and a full connection layer, the feature extraction layer is used for extracting the spatial feature information of the video frames through two-dimensional convolution, extracting the time feature information of the video frames through pooling, and fusing the spatial feature information and the time feature information to output fused feature information, and the full connection layer is used for performing full connection processing on the fused feature information output by the feature extraction layer to obtain the classification result.

The video classification apparatus shown in fig. 7 corresponds to the video classification method shown in fig. 3. By the video classification device, the video classification method described in any of the above embodiments can be performed.

Fig. 8 is a schematic diagram of a video classification apparatus according to an embodiment of the present application. As shown in fig. 8, the video classification device 8 of this embodiment includes: a processor 80, a memory 81 and a computer program 82, such as a video classification program, stored in said memory 81 and executable on said processor 80. The processor 80 implements the steps in the various video classification method embodiments described above when executing the computer program 82. Alternatively, the processor 80 implements the functions of the modules/units in the above-described device embodiments when executing the computer program 82.

Illustratively, the computer program 82 may be partitioned into one or more modules/units that are stored in the memory 81 and executed by the processor 80 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 82 in the video classification device 8.

The video classification device 8 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing device. The video classification device may include, but is not limited to, a processor 80, a memory 81. It will be appreciated by those skilled in the art that fig. 8 is merely an example of a video classification device 8 and does not constitute a limitation of the video classification device 8 and may include more or fewer components than shown, or some components may be combined, or different components, for example the video classification device may also include input output devices, network access devices, buses, etc.

The Processor 80 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 81 may be an internal storage unit of the video classification device 8, such as a hard disk or a memory of the video classification device 8. The memory 81 may also be an external storage device of the video classification device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the video classification device 8. Further, the memory 81 may also include both an internal storage unit of the video classification device 8 and an external storage device. The memory 81 is used for storing the computer program and other programs and data required by the video classification apparatus. The memory 81 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method for video classification, the method comprising:

2. The method according to claim 1, wherein the feature extraction layer comprises N feature extraction sublayers, N is more than or equal to 1, input information of a first feature extraction sublayer of the N feature extraction sublayers is the plurality of video frames, output information of a previous feature extraction sublayer is input information of a next feature extraction sublayer, and output information of an Nth feature extraction sublayer is fused feature information output by the feature extraction layer; each feature extraction sub-layer of the N feature extraction sub-layers comprises a large receptive field context feature extraction branch and a small receptive field core feature extraction branch, and each feature extraction sub-layer of the N feature extraction sub-layers processes input information, including:

3. The method of claim 2, wherein the step of performing pooling processing on the input information through a large receptive field context feature extraction branch to extract time feature information of the input information comprises:

4. The method of claim 3, wherein performing three-dimensional pooling on the input information through a large receptive field context feature extraction branch to obtain pooled information comprises:

5. The method according to claim 4, wherein the feature extraction layer comprises N three-dimensional pooled kernels, wherein the N three-dimensional pooled kernels are identical in size, or are different in size, or are identical in size, and wherein the three-dimensional pooled kernels are the sizes of pooled pixels selected in a single pooled calculation.

6. The method of claim 4 or 5, wherein the N three-dimensional pooling cores are substantially different sizes comprising:

7. The method of claim 6, wherein gradually increasing the size of the three-dimensional pooling core comprises:

8. The method according to claim 3, wherein the convolution parameters of the two-dimensional convolution processing in the large receptive field context feature extraction branch are the same as those of the two-dimensional convolution processing in the small receptive field core feature extraction branch.

9. The method of claim 1, wherein fusing the spatial feature information and the temporal feature information to output fused feature information comprises:

10. An apparatus for video classification, the apparatus comprising:

the classification unit is used for inputting the video to be classified acquired by the video to be classified acquisition unit into a trained video classification model for processing and outputting a classification result of the video to be classified; the video classification model comprises a feature extraction layer and a full connection layer, the feature extraction layer is used for extracting the spatial feature information of the video frames through two-dimensional convolution, extracting the time feature information of the video frames through pooling, and fusing the spatial feature information and the time feature information to output fused feature information, and the full connection layer is used for performing full connection processing on the fused feature information output by the feature extraction layer to obtain the classification result.

11. A video classification device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, causes the video classification device to carry out the steps of the method according to any one of claims 1 to 9.

12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.