CN111859023B

CN111859023B - Video classification method, apparatus, device and computer readable storage medium

Info

Publication number: CN111859023B
Application number: CN202010531316.9A
Authority: CN
Inventors: 乔宇; 王亚立; 李先航; 周志鹏; 邹静
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2024-05-03
Anticipated expiration: 2040-06-11
Also published as: WO2021248859A9; CN111859023A; WO2021248859A1

Abstract

The application belongs to the field of image processing, and discloses a video classification method, a video classification device, video classification equipment and a computer readable storage medium. The video classification method comprises the steps of obtaining videos to be classified; inputting the video to be classified into a trained video classification model for processing, and outputting a classification result of the video to be classified; the video classification model comprises a feature extraction layer and a full connection layer, wherein the feature extraction layer is used for extracting spatial feature information through two-dimensional convolution, extracting time feature information through pooling and outputting fusion feature information by fusing the spatial feature information and the time feature information, and the full connection layer is used for carrying out full connection processing on the fusion feature information to obtain a classification result. Compared with three-dimensional convolution kernel calculation, the embodiment of the application obtains the characteristic information of the time dimension of the video to be classified through pooling, and the adopted two-dimensional convolution can greatly reduce the calculation of convolution parameters, thereby being beneficial to reducing the calculation amount of video classification.

Description

Video classification method, apparatus, device and computer readable storage medium

Technical Field

The present application relates to the field of image processing, and in particular, to a video classification method, apparatus, device, and computer readable storage medium.

Background

To facilitate image management, image content may be identified and categorized by way of deep learning. In recent years, with the significant breakthrough of convolutional neural networks in image classification tasks, the accuracy of classifying images through two-dimensional convolutional neural networks even exceeds the accuracy of human classification.

The two-dimensional convolutional neural network is used for accurately classifying images, and can be applied to the classification of videos formed by the images. Since video data has one more time dimension than still pictures, in order to extract information of the time dimension in video, a three-dimensional convolution kernel including the time dimension is generally used to extract features simultaneously in time and space. However, when the convolution calculation is performed by the three-dimensional convolution kernel, an additional parameter is added to the two-dimensional convolution calculation, resulting in an increase in the calculation amount.

Disclosure of Invention

In view of this, embodiments of the present application provide a video classification method, apparatus, device, and computer readable storage medium, so as to solve the problem in the prior art that when video classification is performed by convolution calculation through a three-dimensional convolution kernel, additional parameters are added relative to two-dimensional convolution calculation, resulting in an increase in calculation amount.

A first aspect of an embodiment of the present application provides a video classification method, including:

Acquiring a video to be classified, wherein the video to be classified comprises a plurality of video frames;

Inputting the video to be classified into a trained video classification model for processing, and outputting a classification result of the video to be classified; the video classification model comprises a feature extraction layer and a full connection layer, wherein the feature extraction layer is used for extracting spatial feature information of a plurality of video frames through two-dimensional convolution, extracting time feature information of the plurality of video frames through pooling, fusing the spatial feature information and the time feature information to output fused feature information, and the full connection layer is used for carrying out full connection processing on the fused feature information output by the feature extraction layer to obtain a classification result.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the feature extraction layer includes N feature extraction sublayers, N is greater than or equal to 1, input information of a first feature extraction sublayer among the N feature extraction sublayers is the plurality of video frames, output information of a previous feature extraction sublayer is input information of a next feature extraction sublayer, and output information of an nth feature extraction sublayer is fusion feature information output by the feature extraction layer; each of the N feature extraction sublayers includes a large receptive field context feature extraction branch and a small receptive field core feature extraction branch, and the processing of the input information by each of the N feature extraction sublayers includes:

Carrying out pooling treatment on the input information through a large receptive field context feature extraction branch, and extracting time feature information of the input information;

Carrying out two-dimensional convolution processing on the input information through a small receptive field core feature extraction branch, and extracting spatial feature information of the input information;

And fusing the time characteristic information extracted by the large receptive field context characteristic extraction branch and the space characteristic information extracted by the small receptive field core characteristic extraction branch to obtain output information.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the pooling processing is performed on the input information through a large receptive field context feature extraction branch, and extracting time feature information of the input information includes:

carrying out three-dimensional pooling treatment on the input information through a large receptive field context feature extraction branch to obtain pooling information;

And carrying out two-dimensional convolution processing on the pooled information through a large receptive field context feature extraction branch to obtain time feature information.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, performing three-dimensional pooling processing on the input information through a large receptive field context feature extraction branch to obtain pooled information, where the pooling processing includes:

And carrying out pooling treatment on the input information through three-dimensional pooling kernels { t, K and K } in the large receptive field context feature extraction branch to obtain pooling information, wherein t is the size of the kernel in the time direction, t is less than or equal to the video duration, K is the size of the pooling kernel in the two-dimensional space where the image is located, and the three-dimensional pooling kernel is the size of the pooling pixel selected in single pooling calculation.

With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, in N three-dimensional pooled kernels included in the feature extraction layer, sizes of the N three-dimensional pooled kernels are identical, or sizes of the N three-dimensional pooled kernels are completely different, or sizes of part of pooled kernels in the N three-dimensional pooled kernels are identical, where the three-dimensional pooled kernels are sizes of pooled pixels selected in a single pooled calculation.

With reference to the third or fourth possible implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, the sizes of the N three-dimensional pooled cores are completely different and include:

and gradually increasing the size of the three-dimensional pooling core along with the sequence of extracting the characteristic information.

With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, gradually increasing a size of the three-dimensional pooling core includes:

gradually increasing the time direction of the three-dimensional pooling core;

or gradually increasing the dimension of the three-dimensional pooling core in the two-dimensional space where the video frame is located;

Or gradually increasing the size of the time direction of the three-dimensional pooling core and the size of the dimension of the two-dimensional space in which the video frame is located.

With reference to the second possible implementation manner of the first aspect, in a seventh possible implementation manner of the first aspect, a convolution parameter of the two-dimensional convolution process in the large receptive field context feature extraction branch is the same as a convolution parameter of the two-dimensional convolution process in the small receptive field core feature extraction branch.

With reference to the first aspect, in an eighth possible implementation manner of the first aspect, fusing the spatial feature information and the temporal feature information to output fused feature information includes:

and superposing the image of the spatial characteristic information and the image of the time characteristic information to generate the fusion characteristic information.

In a second aspect, an embodiment of the present application provides a video classification apparatus, including:

The video classification system comprises a video classification obtaining unit, a video classification processing unit and a video classification processing unit, wherein the video classification obtaining unit is used for obtaining video to be classified, and the video to be classified comprises a plurality of video frames;

the classification unit is used for inputting the video to be classified into a trained video classification model for processing and outputting a classification result of the video to be classified; the video classification model comprises a feature extraction layer and a full connection layer, wherein the feature extraction layer is used for extracting spatial feature information of a plurality of video frames through two-dimensional convolution, extracting time feature information of the plurality of video frames through pooling, fusing the spatial feature information and the time feature information to output fused feature information, and the full connection layer is used for carrying out full connection processing on the fused feature information output by the feature extraction layer to obtain a classification result.

A third aspect of an embodiment of the present application provides a video classification device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, when executing the computer program, causing the video classification device to implement the video classification method according to any one of the first aspect.

A fourth aspect of embodiments of the present application provides a computer readable storage medium storing a computer program which when executed by a processor implements the video classification method of any of the first aspects.

Compared with the prior art, the embodiment of the application has the beneficial effects that: according to the method, spatial feature information of a plurality of video frames in the video to be classified is extracted through two-dimensional convolution by the classification model, time feature information of the plurality of video frames is extracted through pooling, the time feature information and the spatial feature information are fused, and a classification result is obtained through a full connection layer. Because the time characteristic information of the video to be classified can be obtained through pooling, compared with three-dimensional convolution kernel calculation, the method and the device can greatly reduce the calculation of convolution parameters by adopting a two-dimensional convolution calculation mode while retaining the time characteristic information, and are beneficial to reducing the calculation amount of video classification. The embodiment of the application can be inserted into any two-dimensional convolution network to classify the video, thereby being beneficial to improving the diversity and the universality of the video classification method.

Drawings

Fig. 1 is a schematic view of a video classification application scenario provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of prior art video classification using three-dimensional convolution;

fig. 3 is a schematic flow chart of an implementation of a video classification method according to an embodiment of the present application;

fig. 4 is a schematic implementation diagram of a video classification method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an implementation of video classification according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an implementation of yet another video classification provided by an embodiment of the application;

Fig. 7 is a schematic diagram of a video classification apparatus according to an embodiment of the present application;

fig. 8 is a schematic diagram of a video classification apparatus according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

In order to illustrate the technical scheme of the application, the following description is made by specific examples.

With the rise of video data, video classification techniques are required for use in more and more scenes. The video classification method provided by the embodiment of the application is used for classifying and managing the videos, so that the convenience of video use can be effectively improved.

For example, in the intelligent monitoring field, the collected monitoring video is classified by a video classification technology, and whether the video content is abnormal or not is judged. The video classification method provided by the application is insensitive to the change speed of the action frame, can effectively model actions with different duration, classifies the monitoring video through modeling, can help a user to quickly find out key monitoring information, or timely sends an abnormality prompt to a monitoring person, so that the monitoring person can timely process the abnormality in the monitoring video.

For example, when a large number of videos are stored in the device, the large number of videos can be classified into different scenes, different moods, different winds and other types of videos through the video classification technology, so that a user can find the required videos quickly.

For example, for intelligent sports training or video assisted referee aspects, faster acting sports videos, such as basketball, gymnastics or speed skating videos, and slower acting sports videos, such as yoga videos, etc. are included. The motion in the motion video can be classified by the characteristic that the video classification method is insensitive to the motion speed and the motion time.

For another example, as shown in fig. 1, in the video entertainment platform, the platform server receives the video shot by the user and uploaded by the terminal a, and performs classification processing on the uploaded video to obtain the category of the video uploaded by the terminal a. As the number of uploaded videos increases, so does the number of videos for the same category. When other terminals, such as terminal B browse, the video category browsed by terminal B is obtained through the pre-classification result. The platform can search other videos in the same category according to the category of the video browsed by the terminal B and recommend the videos to the terminal B, so that the use experience of the user for browsing the videos is improved.

However, in the video classification algorithm that is currently more commonly used, as shown in fig. 2, a three-dimensional convolution kernel including time information, such as a 3×1×1 time convolution kernel, is selected to perform a convolution operation on the video to be classified. The three-dimensional convolution kernel comprises the width W and the height H of the image and the time length T, and when the convolution calculation is carried out, the parameter calculation of the space characteristics, such as the parameter calculation of the dimension where the width W and the height H are positioned in the image shown in fig. 2, is also included in the parameter calculation of the time dimension. Compared with the traditional two-dimensional convolution kernel, the three-dimensional convolution kernel increases the parameter calculation of the time dimension, increases a large number of parameters, and improves the calculated amount of video classification.

In order to reduce the calculation amount during the calculation of video classification, an embodiment of the present application provides a video classification method, as shown in fig. 3, where the video classification method includes:

In step S301, a video to be classified is acquired, the video to be classified including a plurality of video frames.

The video to be classified in the embodiment of the application can be a video stored in a user terminal, a video collected by monitoring equipment or a video uploaded by a platform user received by a video entertainment platform. When the video is collected by the monitoring equipment, the video collected in real time can be divided into a plurality of sub-video segments according to a preset time period, and the collected sub-video segments are classified, so that whether the abnormality exists in the sub-video segments or not is judged.

The video to be classified comprises a plurality of video frames, and the video frames are sequentially arranged according to a time sequence. From the video to be classified, spatial information of the width W and the height H of each video frame can be determined. According to the time interval between the video frames and the initial playing time, the playing time corresponding to each video frame can be determined.

In step S302, inputting the video to be classified into a trained video classification model for processing, and outputting a classification result of the video to be classified; the video classification model comprises a feature extraction layer and a full connection layer, wherein the feature extraction layer is used for extracting spatial feature information of a plurality of video frames through two-dimensional convolution, extracting time feature information of the plurality of video frames through pooling, fusing the spatial feature information and the time feature information to output fused feature information, and the full connection layer is used for carrying out full connection processing on the fused feature information output by the feature extraction layer to obtain a classification result.

The feature extraction layer may include a large receptive field context feature extraction branch and a small receptive field core feature extraction branch. The large receptive field context feature extraction branch is used for extracting time feature information, or can also extract space-time feature information comprising the time feature information, and the context feature is also the time feature information. The large receptive field can be obtained by cascading a plurality of feature extraction sublayers, and can also be obtained by gradually increasing the size of the three-dimensional pooling nucleus. The small receptive field core feature extraction branch is used for extracting spatial feature information of a two-dimensional plane in each video frame in the video to be classified. The feature extraction layer is also used for fusing the extracted time feature information and the spatial feature information to obtain fused feature information. That is, by the double-branch structure, the context information extracted by the large receptive field context extraction branch and the core features extracted by the small receptive field core feature extraction branch can be effectively obtained.

In a possible implementation manner, in the video classification model, the feature extraction layer may include an N feature extraction sub-layer, where N is greater than or equal to 1.

For example, the feature extraction layer may include 1 feature extraction sub-layer, and the fused feature information is output through one feature extraction sub-layer, and the classification result is obtained through performing full connection processing on the fused feature information through a full connection layer.

When the N is greater than or equal to 2, the output information of the feature extraction sub-layer of the previous stage or the previous stage is used as the input information of the feature extraction sub-layer of the next stage or the subsequent stage. For example, the fused feature information output by the i-th feature extraction sub-layer is used as the input information of the i+1-th feature extraction sub-layer. The i-th feature extraction sub-layer outputs fused feature information, wherein the fused feature information is fused with time feature information and space feature information, and the i+1-th feature extraction sub-layer can further extract the feature information through pooling. Wherein i is greater than or equal to 1 and less than N.

The fused characteristic information refers to characteristic information obtained by fusing time characteristic information and space characteristic information. The fusion process may refer to superposition of feature information. For example, the image corresponding to the temporal feature information may be an image corresponding to the spatial feature information, and the pixel superimposition processing may be performed.

In order to make the sizes of the images corresponding to the temporal feature information and the spatial feature information at the time of fusion coincide, when the input information is subjected to the pooling process, the input information of the pooling process and the size of the image corresponding to the output information of the pooling process may be made to coincide.

In one implementation, the input information may be subjected to Padding processing, that is, the input feature image or video frame is padded in a time dimension or further including a space dimension, so that the size of the output information obtained after the pooling processing is performed on the padded input information by the pooling check is consistent with the size of the unfilled input information.

For example, when the size of the input information is determined to be n, the size of the pooling kernel is determined to be f, the step size is determined to be s, the filling size is determined to be p, and the size of the output information is determined to be o, the formula may be as follows:

to calculate the size of the padding needed.

For example, for a pooling operation with a size of 3 x 3 and a step size of 1, in order to obtain the same size of the output information as the input information, the size of the padding parameter padding may be chosen to be 2.

Wherein the size of the pooling nucleus is 3 x 3, which means that the dimension of the pooling nucleus in the two-dimensional plane of the pooled image is 3*3, the units may be pixels or other predetermined units of length. The length of the time dimension is 3, the unit can be a video duration, such as a 3-second video duration, and the number of video frames corresponding to the video duration can be determined through the video duration. Of course, the definition of the three-dimensional pooling core may not be limited to this, and the size of the pooling core in the time dimension may be determined directly by the number of video frames.

The two-dimensional convolution refers to the convolution performed on the dimension of the plane where the image of the video frame is located, namely the two dimensions of width and height. The size of the convolution kernel is the convolution kernel of the two-dimensional space.

When the spatial feature information is extracted by two-dimensional convolution, the extraction of the spatial feature information can be completed based on a convolution kernel with a preset fixed size. Of course, the existing neural network model may be selected, for example, a neural network model such as a convolutional neural network of a LeNet architecture, a convolutional neural network of AlexNet architecture, a convolutional neural network of ResNet architecture, a convolutional neural network of Google architecture, or a convolutional neural network of VGGNet architecture, and spatial feature information may be extracted. Therefore, in the process of extracting the spatial feature information, the spatial feature information of the video frames in the video to be classified, namely the feature information included in the width W and height H dimensions of the video frames, is obtained without changing the identification capability of the convolutional neural network on the video frames.

Because any two-dimensional convolution network can be inserted into the video classification method, the effect of the three-dimensional convolution network on time feature information acquisition is achieved, and optimization of feature hardware or a deep learning platform is not needed, and therefore specific network design is not needed, and the universality of the video classification method can be effectively improved.

Compared with the currently used plug and play video recognition module, the method comprises the time shift module TSM and the non-local neural network nonlocal video recognition module, and is beneficial to reducing the calculated amount in the classification process on the premise of ensuring the accuracy of the classification result.

In a possible implementation manner, three-dimensional pooling processing can be performed on the input information through a large receptive field context feature extraction branch to obtain pooled information, and then two-dimensional convolution processing is performed on the pooled information through the large receptive field context feature extraction branch to obtain time feature information.

For example, in the schematic structural diagram of the video classification method shown in fig. 4, the two-dimensional convolution performs a convolution operation based on a two-dimensional plane where the images of the single video frame are located, so that the spatial feature information of each frame image of the video to be classified is obtained on the premise that the complexity of extracting the feature information of the two-dimensional image is not increased, that is, the feature information of the width W and the height H dimensions of each video frame is obtained.

In one implementation, the convolution kernel of the two-dimensional convolution may be expressed as: { C1, C2,1, K }, wherein C1 represents the number of channels of the input feature image, C2 represents the number of channels of the output feature image, the position of "1" in the convolution kernel represents the convolution kernel time dimension, "1" represents the convolution kernel not extending in the time dimension, i.e. each time a two-dimensional convolution is performed, only the image of the same video frame is convolved, and K represents the size of the convolution kernel in the two-dimensional space of the video frame.

And extracting time characteristic information through pooling, wherein the pooling treatment can comprise a pooling treatment mode such as maximum pooling, average pooling or global average pooling. For example, when the operation of maximum pooling is selected, a pixel to be pooled may be selected according to the pooling core, and a pixel with the largest pixel value may be selected as the pixel value after pooling.

In one implementation, the three-dimensional pooling kernel may be represented as { t, K }, where t represents the size of the pooling kernel in the time direction and K represents the size of the pooling kernel in the two-dimensional space in which the image is located. Specifically, we can set t=3 or t=t (video length, or number of video frames or number of images corresponding to video duration). Since the pooling operation does not require convolution calculation, only the comparison of the values is required, so the required calculation amount is very small.

For different sizes selected by the parameter t of the time direction size, the number of corresponding video frames in the pooling process is also different. According to the setting of the pooling step length, the same video frame can be used as the pooled object of different pooling cores. When the K value in the pooling kernel is greater than 1, it means that the pooling kernel also pools a plurality of pixels or regions in two-dimensional space at the same time. In order to facilitate subsequent fusion, during pooling, pooling operation with filling padding can be adopted to fill the edges of the pooled images, so that the consistency of the sizes of the images of the input information and the output information before and after pooling is ensured.

After the pooling process, convolution processing is performed on the output information of the pooling process. The pooled output information is fused with the space-time information of the size of t and K on adjacent space-time, and then the pooled output information is subjected to convolution operation in a two-dimensional convolution mode, so that the time characteristic information of a plurality of video frames is obtained.

In one implementation, the convolution operations of the small receptive field core feature extraction branches and the large receptive field context feature extraction branches may use the same convolution parameters to perform the convolution operations by sharing the parameters. Therefore, when the time feature information is extracted, the convolution parameters of the feature information of the new calculation time dimension are not required to be introduced, the time feature information can be acquired, the calculation parameters of the time feature information are not required to be increased, and the calculation amount of the video classification model is reduced.

The N three-dimensional pooling cores included in the N feature extraction sublayers may be any two three-dimensional pooling cores that are different in size, or may be N three-dimensional pooling cores that are the same in size, or may be part of three-dimensional pooling cores that are the same in size, and part of three-dimensional pooling cores that are different in size.

In a possible implementation manner, the three-dimensional pooling core adopted by the three-dimensional pooling processing of the large receptive field context feature extraction branch can adopt time dimensions with different sizes or space dimensions with different sizes.

For example, adjusting the size of the pooling core adopted by the three-dimensional pooling may include adjusting the size of a time dimension or a time direction in the three-dimensional pooling core, or adjusting the size of the dimension of the three-dimensional pooling core in a two-dimensional space where a video frame is located, or adjusting the size of the three-dimensional pooling core in the time dimension and the space dimension to obtain three-dimensional pooling cores with different sizes, and calculating to obtain corresponding space-time feature information through the three-dimensional pooling cores with different sizes, where the space-time feature information includes time feature information.

In a possible implementation manner, the feature images after pooling can be obtained by gradually increasing the size of the pooling core in the time dimension, including gradually increasing the size of the pooling core in the time dimension, or gradually increasing the size of the two-dimensional space where the video frame of the pooling core is located, or simultaneously increasing the size of the pooling core in the time dimension and the dimension of the two-dimensional space where the video frame is located, so that the time feature information of different time features obtained after pooling of different pooling cores is gradually fused to obtain the space-time feature information with finer granularity.

When the time feature information is extracted, as shown in fig. 4, the same convolution parameters are adopted to perform convolution operation on the images corresponding to the space feature information and the time feature information, so that the information represented by the corresponding points of the images of the space feature information and the time feature information has space consistency, that is, the sizes of the space feature information and the time feature information are consistent, and a strategy of adding the space feature information and the time feature information point by point can be adopted to obtain the fusion feature information.

The method comprises the steps of obtaining fusion characteristic information by fusing spatial characteristic information and temporal characteristic information, extracting spatial characteristics of video frames by the spatial characteristic information through two-dimensional convolution, and extracting spatial and temporal characteristics of images by the temporal characteristic information through pooling, so that the fusion characteristic information comprises the spatial characteristics and the spatial and temporal characteristics of the images in the video to be classified, synthesizing the fusion characteristic information through a full connection layer, and classifying the video to be classified according to the synthesized fusion characteristic information to obtain a video classification result. For example, the fusion feature information may be subjected to full-connection calculation according to a preset full-connection layer weight coefficient, and the video classification result may be determined by comparing the calculation result with a preset classification standard.

Because the calculation of convolution parameters with time dimension is not needed in the video classification process, the space-time characteristic information of the video to be classified can be effectively obtained only by simple pooling operation, the calculation parameter quantity is reduced, and the video classification calculation complexity is reduced.

In a possible implementation manner of the present application, the video classification model may include two or more feature extraction layers, and two or more space-time feature images (the video to be classified belongs to one of the space-time feature images) may be extracted through the two or more feature extraction layers. For example, in the video classification implementation schematic shown in fig. 5, the feature extraction layer includes two feature extraction sublayers, and in the embodiment of the present application, the feature extraction layer may be simply referred to as SmallBig units. As shown in fig. 5, the feature extraction layer in the video classification model includes two feature extraction sublayers, namely SmallBig unit 1 and SmallBig unit 2, and the fused feature information extracted by the preceding feature extraction sublayer SmallBig unit 1 can be used as input of the next-stage feature extraction sublayer SmallBig unit 2, and video classification is performed by the full connection layer according to the fused feature information obtained by the second-stage feature extraction sublayer SmallBig unit 2, so as to output the category to which the video belongs.

Specifically, as shown in fig. 5, the video to be classified is input to the first-stage feature extraction layer SmallBig unit 1, and a first convolution operation of two-dimensional convolution is performed on a plurality of video frames included in the video to obtain spatial feature information included in the plurality of video frames. The first pooling operation of the video frames of the video to be classified in the time dimension comprises pooling processing of a plurality of video frames in the video to be classified by adopting a three-dimensional pooling core with a preset duration parameter. And for the pooled image, performing a second convolution operation of two-dimensional convolution on the pooled image by sharing the pooled image with the convolution parameters of the first convolution operation, namely adopting the convolution parameters of the first convolution operation, so as to obtain time characteristic information. And then fusing the space-time characteristic information and the time characteristic information to obtain fused characteristic information. And according to the consistent information of the sizes of the images corresponding to the time feature information and the space feature information, carrying out pixel addition on corresponding pixel points of the images corresponding to the space feature information and the time feature information to obtain fusion feature information comprising the space feature and the space-time feature, wherein the fusion feature information can comprise multi-frame images.

The fused characteristic information is input to a second-stage characteristic extraction sublayer SmallBig unit 2, and the spatial characteristic information in the fused characteristic information of the SmallBig unit 1 is further obtained through a third convolution operation on the image of each channel in the fused characteristic information. And carrying out second pooling operation on the fused characteristic information in the time dimension according to the time sequence of the channels on the images of a plurality of channels in the fused characteristic information output by the SmallBig unit 1, and carrying out fourth convolution operation on the pooled information obtained by the second pooling operation to further extract the time characteristic information of a plurality of images in the fused characteristic information of the SmallBig unit 1. Wherein the fourth convolution operation and the third convolution operation employ the same convolution parameters.

Of course, the number of the feature extraction sublayer SmallBig units may also include three or more. Fig. 6 is a schematic diagram of implementation of video classification by three feature extraction sublayers according to an embodiment of the present application, and on the basis of fig. 5, a third-stage feature extraction sublayer SmallBig unit 3 is added. And processing the fused characteristic information output by the characteristic extraction sublayer SmallBig unit 1 of the first stage through the convolution operation of the second stage and the pooling operation of the second stage, and fusing the processed time characteristic information and space characteristic information by the characteristic extraction sublayer SmallBig unit 2 of the second stage to obtain fused characteristic information output by the characteristic extraction sublayer SmallBig unit 2 of the second stage. The third-stage feature extraction sublayer SmallBig unit 3 processes the fusion feature information output by the second-stage feature extraction sublayer SmallBig unit 2 through two-dimensional rolling and pooling, further extracts time feature information and space feature information, and fuses the fusion feature information output by the third-stage feature extraction sublayer SmallBig unit 3.

In a possible implementation manner, the feature extraction layer is further configured to superimpose the video to be classified with the fusion feature information output by the feature extraction layer, and form a residual connection to update the fusion feature information.

For example, for the video classification model shown in fig. 6, in the third-stage feature extraction sublayer SmallBig unit 3, the fused data includes the temporal feature information and the spatial feature information calculated by the third-stage feature extraction sublayer SmallBig unit 3, the video to be classified is further superimposed, and the video to be classified is fused with the temporal feature information and the spatial feature information extracted by the third-stage feature extraction sublayer, so as to form a residual connection structure, so that during training, the newly added parameters do not affect the parameters of the original pre-training image network, which is beneficial to improving the pre-training effect of the image network, and by introducing residual, the convergence is beneficial to be accelerated, and the training efficiency of the video classification model is improved.

As shown in fig. 6, the size of the convolution kernel adopted by the feature extraction subunit of the first stage is the first convolution kernel, the pooling adopts the first pooling kernel, the size of the convolution kernel adopted by the feature extraction subunit of the second stage is the second convolution kernel, the pooling adopts the second pooling kernel, the size of the convolution kernel adopted by the feature extraction subunit of the third stage is the third convolution kernel, and the pooling adopts the third pooling kernel.

In a possible implementation, the first convolution kernel and the third convolution kernel employed by the two-dimensional convolution are smaller than the size of the second convolution kernel employed by the second convolution operation. In one implementation, as shown in fig. 6, the first convolution kernel and the third convolution kernel are 1×1 in size, and the second convolution kernel is 1×3×3 in size. The fusion of the channels and the space-time information can be completed through the first convolution kernel and the third convolution kernel. Through the second convolution kernel, extraction of spatio-temporal features may be performed.

In a possible implementation, the first and second pooled cores may be smaller than the third pooled core size employed by the third pooled operation. In one implementation, as shown in fig. 6, the pooled first pooled core and second pooled core are 3 x3 in size, the size of the third pooling core is 3×3×t, where T may be a video duration, or may also be the number of video frames corresponding to the video duration. And when the T is the video duration, T is the duration. And when the T is the number of video frames corresponding to the video duration, the T is the number of video frames. By the first pooling core and the second pooling core, the pooling values of 9 pixel points in the stereo space in the adjacent three frames, such as maximum pooling, can be captured. By means of the third pooling kernel, the temporal features of video frames of the whole video length can be extracted. By gradually increasing the time receptive field in the time dimension and combining the space characteristics of convolution learning, the output fusion characteristic information has global time receptive follow-up. And, at SmallBig unit 1 and SmallBig unit 3, two spatially localized receptive fields are added, so that the spatial receptive field of the overall module is also increased.

In practical application, the video classification system of the application can be trained by using optimization algorithms such as random gradient descent algorithm (SGD), and the like, and the data set can adopt mainstream video task data. As can be seen from the experimental results of training in the data set, in the network structure, the video classification method provided by the application can provide higher precision, faster convergence and better robustness, compared with the current most advanced network, the video classification recognition of only 8 frames of images is better than the non-local-R50 (non-local-R50 network) of 32 frames, and the floating point operation times per second (English full-called floating-point operations per second, english short-called GFlops) which is 4.9 times less than that of the non-local-R50 of 128 frames is used, but the video classification method has the same precision. In addition, under the same GFlops, the video classification method of the application has the performance of 8 frames input, which is superior to the current most advanced 36 frames input speed combined with R50 network (fully called SlowFast-R50 English). These results indicate that the video classification model for video classification according to the present application is an accurate and efficient video classification model.

In addition, the application also provides a video classification model training method, which comprises the following steps: acquiring sample videos in a sample video set and sample classification results of the sample videos, wherein the sample videos comprise a plurality of video frames; extracting spatial feature information in a sample video through two-dimensional convolution; extracting time characteristic information in a sample video through pooling; fusing the spatial feature information and the time feature information to obtain fused feature information, and performing full-connection processing on the fused feature information to obtain a model classification result; and correcting parameters of the two-dimensional convolution by the model classification result and the sample classification result, and returning to the step of extracting spatial feature information in the sample video through the two-dimensional convolution until the model classification result and the sample classification result meet preset conditions, so as to obtain a trained video classification model.

The structure of the video classification model is identical to the neural network model adopted by the video classification method shown in fig. 2, and the description thereof will not be repeated here.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

Fig. 7 is a schematic diagram of a video classification device according to an embodiment of the present application, where the video classification device includes:

a video to be classified acquisition unit 701, configured to acquire a video to be classified, where the video to be classified includes a plurality of video frames;

the classification unit 702 is configured to input the video to be classified into a trained video classification model for processing, and output a classification result of the video to be classified; the video classification model comprises a feature extraction layer and a full connection layer, wherein the feature extraction layer is used for extracting spatial feature information of a plurality of video frames through two-dimensional convolution, extracting time feature information of the plurality of video frames through pooling, fusing the spatial feature information and the time feature information to output fused feature information, and the full connection layer is used for carrying out full connection processing on the fused feature information output by the feature extraction layer to obtain a classification result.

The video classification apparatus shown in fig. 7 corresponds to the video classification method shown in fig. 3. By the video classification apparatus, the video classification method described in any of the above embodiments can be performed.

Fig. 8 is a schematic diagram of a video classification apparatus according to an embodiment of the present application. As shown in fig. 8, the video classification apparatus 8 of this embodiment includes: a processor 80, a memory 81 and a computer program 82, such as a video classification program, stored in the memory 81 and executable on the processor 80. The processor 80, when executing the computer program 82, implements the steps of the various video classification method embodiments described above. Or the processor 80, when executing the computer program 82, performs the functions of the modules/units of the apparatus embodiments described above.

By way of example, the computer program 82 may be partitioned into one or more modules/units that are stored in the memory 81 and executed by the processor 80 to complete the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 82 in the video classification device 8.

The video classification device 8 may be a computing device such as a desktop computer, a notebook computer, a palm computer, and a cloud server. The video classification device may include, but is not limited to, a processor 80, a memory 81. It will be appreciated by those skilled in the art that fig. 8 is merely an example of video classification device 8 and is not intended to be limiting of video classification device 8, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the video classification device may also include input and output devices, network access devices, buses, etc.

The Processor 80 may be a central processing unit (Central Processing Unit, CPU), other general purpose Processor, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 81 may be an internal storage unit of the video classification device 8, such as a hard disk or a memory of the video classification device 8. The memory 81 may also be an external storage device of the video classification device 8, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the video classification device 8. Further, the memory 81 may also include both an internal storage unit and an external storage device of the video classification device 8. The memory 81 is used for storing the computer program and other programs and data required by the video classification device. The memory 81 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical function division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium may include content that is subject to appropriate increases and decreases as required by jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is not included as electrical carrier signals and telecommunication signals.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method of video classification, the method comprising:

Inputting the video to be classified into a trained video classification model for processing, and outputting a classification result of the video to be classified; the video classification model comprises a feature extraction layer and a full connection layer, wherein the feature extraction layer is used for extracting spatial feature information of the plurality of video frames through two-dimensional convolution, extracting time feature information of the plurality of video frames through pooling, fusing the spatial feature information and the time feature information and outputting fused feature information, and the full connection layer is used for carrying out full connection processing on the fused feature information output by the feature extraction layer to obtain a classification result;

The feature extraction layer comprises N feature extraction sublayers, N is more than or equal to 1, the input information of a first feature extraction sublayer in the N feature extraction sublayers is the plurality of video frames, the output information of the former feature extraction sublayer is the input information of a later feature extraction sublayer, and the output information of the Nth feature extraction sublayer is the fusion feature information output by the feature extraction layer; each of the N feature extraction sublayers includes a large receptive field context feature extraction branch and a small receptive field core feature extraction branch, and the processing of the input information by each of the N feature extraction sublayers includes:

2. The method of claim 1, wherein the pooling of the input information by the large receptive field contextual feature extraction branch, extracting temporal feature information of the input information, comprises:

3. The method of claim 2, wherein three-dimensional pooling of the input information by the large receptive field context feature extraction branches to obtain pooled information comprises:

4. A method according to claim 3, wherein, of the N three-dimensional pooled kernels included in the feature extraction layer, the sizes of the N three-dimensional pooled kernels are identical, or the sizes of the N three-dimensional pooled kernels are completely different, or the sizes of some of the N three-dimensional pooled kernels are identical, the three-dimensional pooled kernels being the sizes of pooled pixels selected in a single pooled calculation.

5. The method of claim 3 or 4, wherein the N three-dimensional pooled kernels are of substantially different sizes comprising:

6. The method of claim 5, wherein gradually increasing the size of the three-dimensional pooling core comprises:

gradually increasing the time direction of the three-dimensional pooling core;

7. The method of claim 2, wherein the convolution parameters of the two-dimensional convolution process in the large receptive field context feature extraction branch are the same as the convolution parameters of the two-dimensional convolution process in the small receptive field core feature extraction branch.

8. The method of claim 1, wherein fusing the spatial feature information and the temporal feature information to output fused feature information comprises:

9. A video classification device, the device comprising:

The classification unit is used for inputting the video to be classified acquired by the video to be classified acquisition unit into a trained video classification model for processing and outputting a classification result of the video to be classified; the video classification model comprises a feature extraction layer and a full connection layer, wherein the feature extraction layer is used for extracting spatial feature information of a plurality of video frames through two-dimensional convolution, extracting time feature information of the plurality of video frames through pooling, and outputting fusion feature information by fusing the spatial feature information and the time feature information, the full connection layer is used for carrying out full connection processing on the fusion feature information output by the feature extraction layer to obtain a classification result, the feature extraction layer comprises N feature extraction sublayers, N is more than or equal to 1, input information of a first feature extraction sublayer in the N feature extraction sublayers is the plurality of video frames, output information of a previous feature extraction sublayer is input information of a next feature extraction sublayer, and output information of an Nth feature extraction sublayer is fusion feature information output by the feature extraction layer; each of the N feature extraction sublayers includes a large receptive field context feature extraction branch and a small receptive field core feature extraction branch, and the processing of the input information by each of the N feature extraction sublayers includes: carrying out pooling treatment on the input information through a large receptive field context feature extraction branch, and extracting time feature information of the input information; carrying out two-dimensional convolution processing on the input information through a small receptive field core feature extraction branch, and extracting spatial feature information of the input information; and fusing the time characteristic information extracted by the large receptive field context characteristic extraction branch and the space characteristic information extracted by the small receptive field core characteristic extraction branch to obtain output information.

10. A video classification device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, causes the video classification device to carry out the steps of the method according to any one of claims 1 to 8.

11. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 8.