WO2021248859A1

WO2021248859A1 - Video classification method and apparatus, and device, and computer readable storage medium

Info

Publication number: WO2021248859A1
Application number: PCT/CN2020/134995
Authority: WO
Inventors: 乔宇; 王亚立; 李先航; 周志鹏; 邹静
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2020-06-11
Filing date: 2020-12-09
Publication date: 2021-12-16
Also published as: CN111859023A; WO2021248859A9

Abstract

A video classification method and apparatus, and a device, and a computer readable storage medium. The video classification method comprises: obtaining a video to be classified, said video comprising multiple video frames (S301); and inputting said video to a trained video classification model for processing, and outputting a classification result of said video, wherein the video classification model comprises a feature extraction layer and a fully connected layer, the feature extraction layer is used for extracting spatial feature information by means of a two-dimensional convolution, extracting temporal feature information by means of pooling, and fusing the spatial feature information and the temporal feature information to output fused feature information, and the fully connected layer is used for performing full connection processing on the fused feature information to obtain the classification result (S302). According to the method, with respect to calculation of a three-dimensional convolution kernel, feature information of a temporal dimension of said video is obtained by means of pooling, and the used two-dimensional convolution can greatly reduce the calculation of convolution parameters, thereby facilitating reduction of the computational complexity of video classification.

Description

Video classification method, device, equipment and computer readable storage medium

Technical field

This application belongs to the field of image processing, and particularly relates to video classification methods, devices, equipment, and computer-readable storage media.

Background technique

In order to facilitate image management, deep learning can be used to identify and classify image content. In recent years, as convolutional neural networks have made major breakthroughs in image classification tasks, the accuracy of image classification through two-dimensional convolutional neural networks has even exceeded that of human classification.

While using a two-dimensional convolutional neural network to accurately classify images, it can also be applied to the classification of videos composed of images. Since video data has one more time dimension than static pictures, in order to extract the time dimension information in the video, a three-dimensional convolution kernel including the time dimension is usually used to extract features in both time and space. However, when performing convolution calculations with a three-dimensional convolution kernel, additional parameters will be added compared to two-dimensional convolution calculations, resulting in an increase in the amount of calculation.

technical problem

In view of this, the embodiments of the present application provide a video classification method, device, device, and computer-readable storage medium to solve the problem of the conventional three-dimensional convolution kernel for convolution calculation video classification compared to the two-dimensional convolution calculation. , Will add extra parameters, leading to the problem of increased calculation.

Technical solutions

In order to solve the above technical problems, the technical solutions adopted in the embodiments of this application are:

The first aspect of the embodiments of the present application provides a video classification method, and the method includes:

Acquiring a video to be classified, where the video to be classified includes a plurality of video frames;

Input the video to be classified into a trained video classification model for processing, and output the classification result of the video to be classified; wherein, the video classification model includes a feature extraction layer and a fully connected layer, and the feature extraction layer is used for The spatial feature information of the multiple video frames is extracted through two-dimensional convolution, and the temporal feature information of the multiple video frames is extracted through pooling, and the spatial feature information and the temporal feature information are merged to output the fused feature information, so The fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the feature extraction layer includes N feature extraction sublayers, N≥1, and the first feature extraction sublayer in the N feature extraction sublayers The input information of the layer is the multiple video frames, the output information of the previous feature extraction sublayer is the input information of the next feature extraction sublayer, and the output information of the Nth feature extraction sublayer is the output information of the feature extraction layer Fusion feature information; each feature extraction sublayer in the N feature extraction sublayers includes a large receptive field context feature extraction branch and a small receptive field core feature extraction branch, and each feature in the N feature extraction sublayers The processing of the input information by the extraction sub-layer includes:

Pooling the input information through the context feature extraction branch of the large receptive field, and extracting the time feature information of the input information;

Performing two-dimensional convolution processing on the input information through the core feature extraction branch of the small receptive field to extract the spatial feature information of the input information;

The temporal feature information extracted by the context feature extraction branch of the large receptive field and the spatial feature information extracted by the core feature extraction branch of the small receptive field are fused to obtain output information.

In combination with the first possible implementation of the first aspect, in the second possible implementation of the first aspect, the input information is pooled through the context feature extraction branch of the large receptive field, and the time when the input information is extracted Characteristic information, including:

Performing three-dimensional pooling processing on the input information through the context feature extraction branch of the large receptive field to obtain pooling information;

Performing two-dimensional convolution processing on the pooling information through the context feature extraction branch of the large receptive field to obtain temporal feature information.

In combination with the second possible implementation manner of the first aspect, in the third possible implementation manner of the first aspect, the input information is three-dimensionally pooled through the context feature extraction branch of the large receptive field to obtain pooled information, including :

The input information is pooled by the three-dimensional pooling kernel {t, K, K} in the context feature extraction branch of the large receptive field to obtain pooling information, where t is the smaller the kernel size in the time direction, and t Less than or equal to the video duration, K is the size of the pooling kernel in the two-dimensional space where the image is located, and the three-dimensional pooling kernel is the size of the pooling pixel selected in a single pooling calculation.

In combination with the third possible implementation manner of the first aspect, in the fourth possible implementation manner of the first aspect, among the N three-dimensional pooling cores included in the feature extraction layer, the size of the N three-dimensional pooling cores is completely The same, or the sizes of the N three-dimensional pooling nuclei are completely different, or the sizes of some of the N three-dimensional pooling nuclei are the same, and the three-dimensional pooling nuclei are the pooling pixels selected in a single pooling calculation the size of.

In combination with the third or fourth possible implementation manner of the first aspect, in the fifth possible implementation manner of the first aspect, the sizes of N three-dimensional pooling nuclei are completely different, including:

With the sequence of feature information extraction, the size of the three-dimensional pooling core is gradually increased.

In combination with the fifth possible implementation manner of the first aspect, in the sixth possible implementation manner of the first aspect, gradually increasing the size of the three-dimensional pooling core includes:

Gradually increasing the size of the three-dimensional pooling core in the time direction;

Or, gradually increase the size of the dimension of the three-dimensional pooling core in the two-dimensional space where the video frame is located;

Or, gradually increase the size of the time direction of the three-dimensional pooling core and the size of the two-dimensional space where the video frame is located.

In combination with the second possible implementation manner of the first aspect, in the seventh possible implementation manner of the first aspect, the convolution parameter of the two-dimensional convolution process in the context feature extraction branch of the large receptive field is the same as that of the small The convolution parameters of the two-dimensional convolution processing in the core feature extraction branch of the receptive field are the same.

With reference to the first aspect, in an eighth possible implementation manner of the first aspect, the size of the image corresponding to the input information of the pooling process and the output information of the pooling process are consistent.

In combination with the eighth possible implementation manner of the first aspect, in the ninth possible implementation manner of the first aspect, the input feature image or video frame is filled in the time dimension or the space dimension, so that after the pooling check is filled After the input information is pooled, the size of the image corresponding to the output information obtained is consistent with the size of the image corresponding to the input information.

With reference to the first aspect, in the tenth possible implementation manner of the first aspect, fusing the spatial feature information and the temporal feature information to output fused feature information includes:

The image of the spatial feature information is superimposed on the image of the temporal feature information to generate the fusion feature information.

In the second aspect, an embodiment of the present application provides a video classification device, the device including:

A to-be-classified video obtaining unit, configured to obtain a to-be-classified video, where the to-be-classified video includes a plurality of video frames;

The classification unit is configured to input the video to be classified into a trained video classification model for processing, and output the classification result of the video to be classified; wherein, the video classification model includes a feature extraction layer and a fully connected layer, and The feature extraction layer is used to extract the spatial feature information of the multiple video frames through two-dimensional convolution, and extract the temporal feature information of the multiple video frames through pooling, and output the fusion of the spatial feature information and the temporal feature information The feature information is fused, and the fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.

The third aspect of the embodiments of the present application provides a video classification device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes the computer program At this time, the video classification device is made to implement the video classification method as described in any one of the first aspect.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the video classification described in any one of the first aspect is implemented method.

Beneficial effect

Compared with the prior art, the embodiments of this application have the following beneficial effects: this application uses a classification model to extract the spatial feature information of multiple video frames in the video to be classified through two-dimensional convolution, and extracts the information of multiple video frames through pooling. Time feature information is combined with time feature information and spatial feature information, and the classification result is obtained through the fully connected layer. Since this application can obtain the temporal feature information of the video to be classified through pooling, compared to the calculation of the three-dimensional convolution kernel, while retaining the temporal feature information, the two-dimensional convolution calculation method adopted by this application can greatly reduce the volume. The calculation of product parameters helps reduce the amount of calculation for video classification. In addition, in the embodiments of the present application, any two-dimensional convolutional network can be inserted to classify videos, which is beneficial to improve the diversity and versatility of video classification methods.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the accompanying drawings used in the description of the embodiments or exemplary technologies. Obviously, the accompanying drawings in the following description are only of the present application. For some embodiments, those of ordinary skill in the art can obtain other drawings based on these drawings without creative work.

FIG. 1 is a schematic diagram of a video classification application scenario provided by an embodiment of the present application;

Fig. 2 is a schematic diagram of video classification using three-dimensional convolution in the prior art;

FIG. 3 is a schematic diagram of the implementation process of a video classification method provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of the implementation of a video classification method provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of the implementation of a video classification provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of another implementation of video classification provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a video classification device provided by an embodiment of the present application;

Fig. 8 is a schematic diagram of a video classification device provided by an embodiment of the present application.

Embodiments of the present invention

In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are proposed for a thorough understanding of the embodiments of the present application. However, it should be clear to those skilled in the art that the present application can also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, devices, circuits, and methods are omitted to avoid unnecessary details from obstructing the description of this application.

In order to illustrate the technical solution described in the present application, specific embodiments are used for description below.

With the rise of video data, more and more scenes need to use video classification technology. Using the video classification method described in the embodiments of the present application to classify and manage videos can effectively improve the convenience of video use.

For example, in the field of intelligent surveillance, video classification technology is used to classify the collected surveillance video to determine whether there is an abnormality in the video content. The video classification method described in this application is not sensitive to the speed of the action frame change, and can effectively model the actions of different durations. The surveillance video can be classified through the modeling, which can help the user to quickly find the key surveillance Information, or send abnormal reminders to the monitoring personnel in time, so that the monitoring personnel can deal with the abnormalities in the surveillance video in a timely manner.

For example, when a large number of videos are stored in the device, the video classification technology can be used to classify a large number of videos into different scenes, different moods, and different types of videos, so that users can quickly find the videos they need.

For example, for smart sports training or video-assisted refereeing, it includes faster-moving sports videos, such as shooting, gymnastics or speed skating, and slower sports videos, such as yoga videos. The video classification method described in this application is not sensitive to the speed and time of the motion, and the motion in the motion video can be classified.

For another example, as shown in FIG. 1, in a video entertainment platform, the platform server receives the self-photographed video uploaded by terminal A, and classifies the uploaded video to obtain the category of the video uploaded by terminal A. When the number of uploaded videos is increasing, the number of videos in the same category is also increasing. When other terminals, such as terminal B browse, obtain the video category browsed by terminal B through the pre-classification result. The platform can search for other videos in the same category and recommend them to terminal B according to the category of the video browsed by terminal B, so as to improve the user experience of browsing the video.

However, in the currently more commonly used video classification algorithms, as shown in Figure 2, a three-dimensional convolution kernel including time information is selected, such as a 3*1*1 temporal convolution kernel, and the video to be classified is convolved. The three-dimensional convolution kernel includes the width W, height H, and duration T of the image. When performing the convolution calculation, in addition to the calculation of the parameters of the spatial characteristics, such as the calculation of the parameters of the width W and the height H in the image shown in Figure 2, It also includes parameter calculations in the time dimension. Compared with the traditional two-dimensional convolution kernel, the three-dimensional convolution kernel increases the parameter calculation of the time dimension, adds a large number of parameters, and increases the amount of calculation for video classification.

In order to reduce the amount of calculation during video classification calculation, an embodiment of the present application provides a video classification method. As shown in FIG. 3, the video classification method includes:

In step S301, a video to be classified is obtained, and the video to be classified includes multiple video frames.

The video to be classified in the embodiment of the present application may be a video stored in a user terminal, a video collected by a monitoring device, or a video uploaded by a platform user received by a video entertainment platform. When the video is a video collected by a monitoring device, the real-time collected video can be divided into several sub video segments according to a preset time period, and the collected sub video segments can be classified to determine whether the sub video segments are Whether there is an abnormality.

The video to be classified includes multiple video frames, and the multiple video frames are sequentially arranged in a time sequence. According to the video to be classified, the spatial information of the width W and the height H of each video frame can be determined. According to the time interval between video frames and the initial playback time, the playback time corresponding to each video frame can be determined.

In step S302, the video to be classified is input into a trained video classification model for processing, and the classification result of the video to be classified is output; wherein, the video classification model includes a feature extraction layer and a fully connected layer. The feature extraction layer is used to extract the spatial feature information of the multiple video frames through two-dimensional convolution, and extract the temporal feature information of the multiple video frames through pooling, and output the fusion of the spatial feature information and the temporal feature information The feature information is fused, and the fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.

The feature extraction layer may include a large receptive field context feature extraction branch and a small receptive field core feature extraction branch. Among them, the large receptive field context feature extraction branch is used to extract time feature information, or it can also extract time feature information including time feature information, and the context feature is also time feature information. The large receptive field can be obtained by cascading multiple feature extraction sub-layers, or by gradually increasing the size of the three-dimensional pooling core. The small receptive field core feature extraction branch is used to extract the spatial feature information of the two-dimensional plane in each video frame in the video to be classified. The feature extraction layer is also used to fuse the extracted temporal feature information and spatial feature information to obtain fused feature information. That is, through the dual-branch structure, the context information extracted by the context extraction branch of the large receptive field and the core features extracted by the core feature extraction branch of the small receptive field can be effectively obtained.

In a possible implementation, in the video classification model, the feature extraction layer may include N feature extraction sublayers, where N is greater than or equal to 1.

For example, the feature extraction layer may include one feature extraction sub-layer, and the fused feature information is output through one feature extraction sub-layer, and the fused feature information is fully connected through the fully connected layer to obtain the classification result.

When the N is greater than or equal to 2, the output information of the feature extraction sublayer of the previous or previous level is used as the input information of the feature extraction sublayer of the next or next level. For example, the fusion feature information output by the i-th feature extraction sub-layer is used as the input information of the i+1-th feature extraction sub-layer. Among them, the fused feature information output by the i-th feature extraction sublayer is fused with time feature information and spatial feature information, and the i+1th feature extraction sublayer can further extract feature information through pooling. Among them, i is greater than or equal to 1 and less than N.

Wherein, the fusion feature information refers to the feature information after the time feature information and the space feature information are fused. The fusion processing may refer to the superposition of feature information. For example, it may be an image corresponding to temporal feature information, and an image corresponding to spatial feature information, which is subjected to pixel superposition processing.

In order to make the size of the image corresponding to the temporal feature information and the spatial feature information consistent during fusion, when the input information is pooled, the input information of the pooling process can be made to match the output information of the pooling process. The size is consistent.

In one implementation, the input information can be filled with Padding processing, that is, the input feature image or video frame is filled in the time dimension, or the space dimension is also included, so that the pooling check is performed on the filled input information. After the transformation, the size of the output information obtained is consistent with the size of the unfilled input information.

For example, after determining that the size of the input information is n, the size of the pooling core is f, the step size is s, the padding size is p, and the size of the output information is o, the formula can be:

To calculate the size that needs to be filled.

For example, for a pooling operation with a large pooling core of 3*3*3 and a step size of 1, in order to obtain the same size of the output information and the input information, the padding parameter can be selected to have a size of 2.

Among them, the large and small pooling core is 3*3*3, which means that the dimensional size of the pooling core in the two-dimensional plane where the image to be pooled is located is 3*3, and the unit can be pixels or other predetermined length units. The length of the time dimension is 3, and the unit may be the video duration, for example, the video duration of 3 seconds. The number of video frames corresponding to the video duration can be determined by the video duration. Of course, the definition of the three-dimensional pooling core may not be limited to this, and the size of the pooling core in the time dimension can also be determined directly by the number of video frames.

The two-dimensional convolution refers to the convolution performed on the dimensions of the plane where the image of the video frame is located, that is, the two dimensions of width and height. The larger the selected convolution kernel and the smaller one are the convolution kernels in the two-dimensional space.

When the spatial feature information is extracted by two-dimensional convolution, the spatial feature information can be extracted based on a predetermined fixed-size convolution kernel. Of course, existing neural network models can also be used, such as LeNet-based convolutional neural networks, AlexNet-based convolutional neural networks, ResNet-based convolutional neural networks, Google-based convolutional neural networks, and VGGNet-based convolutional neural networks. Neural network models such as convolutional neural networks extract spatial feature information. Therefore, in the process of extracting spatial feature information, there is no need to change the ability of the convolutional neural network to recognize video frames to obtain the spatial feature information of the video frames in the video to be classified. Included feature information.

Since any two-dimensional convolutional network can be inserted into the video classification method described in this application, the effect of the three-dimensional convolutional network on the collection of temporal feature information is achieved, and the optimization of feature hardware or deep learning platform is not required, so there is no need With the help of a specific network design, the versatility of the video classification method described in this application can be effectively improved.

Moreover, compared with the plug-and-play video recognition modules currently used, including time shift module TSM and nonlocal neural network nonlocal video recognition module, it is helpful to reduce the classification process under the premise of ensuring the accuracy of the classification results. Calculation amount.

In a possible implementation manner, the input information can be three-dimensionally pooled through the large receptive field context feature extraction branch to obtain pooled information, and then the pooled information can be secondarily performed through the large receptive field context feature extraction branch. Dimensional convolution processing to obtain temporal feature information.

For example, in the schematic structural diagram of the video classification method shown in FIG. 4, the two-dimensional convolution is based on the convolution operation on the two-dimensional plane where the image of a single video frame is located, without increasing the feature information of the two-dimensional image. Under the premise of extracting complexity, the spatial feature information of each frame of the video to be classified is obtained, that is, the feature information of the width W and height H dimensions of each video frame is obtained.

In one implementation, the convolution kernel of the two-dimensional convolution can be expressed as: {C1, C2, 1, K, K}, where C1 represents the number of channels of the input feature image, and C2 represents the output feature image The number of channels in the convolution kernel, where the "1" is located, indicates the time dimension of the convolution kernel, and "1" means that the convolution kernel is not expanded in the time dimension, that is, each time the two-dimensional convolution is performed, it is only for the same The image of the video frame is convolved, and K represents the size of the convolution kernel in the two-dimensional space where the video frame is located.

The time feature information is extracted through pooling, and the pooling processing may include pooling processing methods such as maximum pooling, average pooling, or global average pooling. For example, when the maximum pooling operation is selected, the pixels to be pooled can be selected according to the pooling kernel, and the pixel with the largest pixel value can be selected as the pixel value after pooling.

In one implementation, the three-dimensional pooling kernel can be expressed as {t, K, K}, where t represents the size of the pooling kernel in the time direction, and K represents the size of the pooling kernel in the two-dimensional space where the image is located. . Specifically, we can set t=3 or t=T (video length, or the number of video frames or images corresponding to the video duration). Since the pooling operation does not need to perform convolution calculation, only the numerical value needs to be compared, so the required calculation amount is very small.

For the different sizes selected by the parameter t of the size in the time direction, the number of corresponding video frames in the pooling process is also different. According to the setting of the pooling step size, the same video frame can be used as the objects pooled by different pooling cores. When the K value in the pooling kernel is greater than 1, it means that the pooling kernel also pools multiple pixels or regions in the two-dimensional space. In order to facilitate subsequent fusion, during pooling, a pooling operation with padding can be used to fill the edges of the pooled image to ensure the consistency of the image size of the input information and output information before and after pooling.

After the pooling process, convolution processing is performed on the output information of the pooling process. The pooled output information is fused with spatiotemporal information of the size of t*K*K on the adjacent time and space, and then convolution is performed on the pooled output information through a two-dimensional convolution method to obtain multiple video frames Time characteristic information.

In an implementation manner, the convolution operation of the small receptive field core feature extraction branch and the large receptive field context feature extraction branch may use the same convolution parameter to perform the convolution operation in a manner of sharing parameters. As a result, when extracting time feature information, there is no need to introduce new convolution parameters for calculating the feature information of the time dimension, and the time feature information can be acquired without increasing the calculation parameters of the time feature information, thereby reducing the amount of calculation of the video classification model.

Among the N three-dimensional pooling nuclei included in the N feature extraction sublayers, the size of any two three-dimensional pooling nuclei may be different, or it may be that the sizes of the N three-dimensional pooling nuclei are all the same, or It may also be that the size of some of the three-dimensional pooling nuclei is the same, and the size of some of the three-dimensional pooling nuclei are different.

In a possible implementation manner, the three-dimensional pooling core used in the three-dimensional pooling process of the context feature extraction branch of the large receptive field may adopt different sizes of time dimensions or different sizes of space dimensions.

For example, adjusting the size of the pooling core used in the three-dimensional pooling may include adjusting the size of the time dimension or the time direction in the three-dimensional pooling core, or the size of the three-dimensional pooling core in the two-dimensional space where the video frame is located. The size of the dimensionality, or the size of the three-dimensional pooling core in the time and space dimensions, to obtain different sizes of three-dimensional pooling cores. Through the three-dimensional pooling cores of different sizes, the corresponding spatiotemporal feature information is calculated, and the spatiotemporal feature information includes time Characteristic information.

In possible implementations, you can gradually increase the size of the pooling core, including gradually increasing the size of the pooling core in the time dimension, or gradually increasing the size of the two-dimensional space where the video frame of the pooling core is located, or increasing the pool at the same time The size of the core in the time dimension and the dimensions of the two-dimensional space where the video frame is located, to obtain the feature image after pooling, so that the time feature information of different duration features obtained after the pooling of different pooling cores is gradually merged to obtain more Fine-grained spatio-temporal feature information.

When extracting the temporal feature information, as shown in Figure 4, since the images corresponding to the spatial feature information and the temporal feature information use the same convolution parameters for convolution operations, the images of the temporal feature information and the temporal feature information The information represented by the corresponding points has spatial consistency, that is, the size of the spatial feature information and the temporal feature information are the same, and the strategy of point-by-point addition in space can be adopted to obtain the fusion feature information.

The fusion feature information obtained by fusing the spatial feature information and the temporal feature information, and the spatial feature information extracts the spatial features of the video frame through two-dimensional convolution, and the temporal feature information extracts the temporal and spatial features of the image through pooling, thus fusing the feature information Including the spatial features and spatiotemporal features of the images in the video to be classified, the fusion feature information is synthesized through a fully connected layer, and the video to be classified is classified according to the integrated fusion feature information to obtain a video classification result. For example, a fully connected calculation may be performed on the fusion feature information according to a preset weight coefficient of a fully connected layer, and the result of the video classification may be determined by comparing the calculation result with a preset classification standard.

Since there is no need to increase the calculation of convolution parameters in the time dimension during the video classification process, only a simple pooling operation is required to effectively obtain the spatiotemporal feature information of the video to be classified, which is beneficial to reduce the amount of calculation parameters and reduce the video Classification calculation complexity.

In a possible implementation of this application, the video classification model may include two or more feature extraction layers, and two or more spatiotemporal features can be extracted through the two or more feature extraction layers. Image (the video to be classified is a kind of spatiotemporal feature image). For example, in the schematic diagram of video classification implementation shown in FIG. 5, the feature extraction layer includes two feature extraction sublayers. In this embodiment of the present application, the feature extraction layer may be referred to as a SmallBig unit for short. As shown in Figure 5, the feature extraction layer in the video classification model includes two feature extraction sublayers, namely SmallBig unit 1 and SmallBig unit 2, and the previous feature extraction sublayer SmallBig unit 1 extracts the fusion feature information, It can be used as the input of the next-level feature extraction sub-layer SmallBig unit 2. According to the fusion feature information obtained by the second-level feature extraction sub-layer SmallBig unit 2, the fully connected layer performs video classification and outputs the category of the video.

Specifically, as shown in FIG. 5, the video to be classified is input to the first-level feature extraction layer SmallBig unit 1, and the first convolution operation of two-dimensional convolution is performed on the multiple video frames included therein, and the multiple video frames are obtained. Included spatial feature information. After the first pooling operation of the video frame of the video to be classified in the time dimension, it includes multiple video frames in the video to be classified, using a three-dimensional pooling kernel with a predetermined duration parameter to perform pooling processing. For the pooled image, it is further shared with the convolution parameters of the first convolution operation, that is, the second convolution parameter of the first convolution operation is used to perform the second two-dimensional convolution on the pooled image. Convolution operation to obtain time feature information. Then the spatio-temporal feature information is fused with the temporal feature information to obtain the fusion feature information. According to the consistent information of the size of the image corresponding to the temporal feature information and the spatial feature information, the corresponding pixels of the image corresponding to the spatial feature information and the temporal feature information are pixel-added to obtain the fusion feature information including the spatial feature and the temporal feature The fusion feature information may include multiple frames of images.

The fusion feature information is input to the second-level feature extraction sublayer SmallBig unit 2, and the image of each channel in the fusion feature information is subjected to the third convolution operation to further obtain the spatial features in the fusion feature information of the SmallBig unit 1. information. For images of multiple channels in the fusion feature information output by the SmallBig unit 1, a second pooling operation is performed on the fusion feature information in the time dimension according to the time sequence of the channels, and the pooling information obtained by the second pooling operation Perform the fourth convolution operation to further extract the time feature information of multiple images in the fusion feature information of the SmallBig unit 1. Among them, the fourth convolution operation and the third convolution operation use the same convolution parameters.

Of course, it is not necessarily limited to this, and the number of SmallBig units in the feature extraction sublayer may also include three or more. Fig. 6 is a schematic diagram of implementing video classification through three feature extraction sublayers provided by an embodiment of the application. On the basis of Fig. 5, a third-level feature extraction sublayer SmallBig unit 3 is added. Through the second-level convolution operation and the second-level pooling operation, the first-level feature extraction sub-layer SmallBig unit 1 output fusion feature information is processed, and the second-level feature extraction sub-layer SmallBig unit 2 fusion processing After the temporal feature information and the spatial feature information, the fusion feature information output by the SmallBig unit 2 of the second-level feature extraction sublayer is obtained. The third-level feature extraction sub-layer SmallBig unit 3 separately processes the fusion feature information output by the second-level feature extraction sub-layer SmallBig unit 2 through two-dimensional convolution and pooling, and further extracts temporal feature information and space The feature information is fused to obtain the fused feature information output by the SmallBig unit 3 of the third-level feature extraction sublayer.

In a possible implementation, the feature extraction layer is also used to superimpose the to-be-classified video with the fusion feature information output by the feature extraction layer to form a residual connection to update the fusion feature information.

For example, for the video classification model shown in Figure 6, in the third-level feature extraction sub-layer SmallBig unit 3, the fused data includes the time feature information calculated by the third-level feature extraction sub-layer SmallBig unit 3 and The spatial feature information is superimposed with the video to be classified, and the temporal feature information and spatial feature information extracted by the third-level feature extraction sublayer are merged to form the residual connection structure, so that the newly added parameters are not used during training. It will affect the parameters of the original pre-training image network, which will help improve the pre-training effect of the image network, and the introduction of residuals will help speed up the convergence and improve the training efficiency of the video classification model.

As shown in Figure 6, the first convolution kernel used by the feature extraction subunit of the first level is the larger and the smaller is the first convolution kernel, pooling uses the first pooling kernel, and the second level feature extraction subunit uses the convolution The larger one is the second convolution kernel, and the second pooling kernel is used for pooling. The larger and smaller convolution kernels used by the third-level feature extraction subunit are the third convolution kernel, and the third pooling is used for pooling. Chemical nucleus.

In a possible implementation manner, the first convolution kernel used in the two-dimensional convolution and the third convolution kernel used in the third convolution operation are smaller than the second convolution kernel used in the second convolution operation. size. In an implementation manner, as shown in FIG. 6, the larger or smaller of the first convolution kernel and the third convolution kernel are 1*1*1, and the larger or smaller of the second convolution kernel is 1*3*3. Through the first convolution kernel and the third convolution kernel, the fusion of multiple channels and spatiotemporal information can be completed. Through the second convolution kernel, it can be used to complete the extraction of spatiotemporal features.

In a possible implementation manner, the first pooling core and the second pooling core may be smaller than the third pooling core used in the third pooling operation. In one implementation, as shown in Figure 6, the larger and smaller pooled cores of the first pooled core and the second pooled core are 3*3*3, and the smaller of the third pooled core is 3*3*. T, where T can be the video duration or the number of video frames corresponding to the video duration. When the T is the video duration, t is the duration. When the T is the number of video frames corresponding to the video duration, t is the number of video frames. Through the first pooling core and the second pooling core, the pooling values of 9 pixels in the stereo space in three adjacent frames can be captured, such as the maximum pooling. Through the third pooling core, the temporal characteristics of the video frame of the entire video length can be extracted. By gradually increasing the time receptive field in the time dimension, combined with the spatial characteristics of convolutional learning, the output fusion feature information has a global time perception to follow. In addition, in SmallBig unit 1 and SmallBig unit 3, two spatially local receptive fields have been added, so that the spatial receptive field of the overall module has also been increased.

In practical applications, the video classification system described in this application can be trained using optimization algorithms such as stochastic gradient descent (SGD), and the data set can be mainstream video task data. Through the experimental results of training on the data set, it can be known that in this network structure, the video classification method described in this application can provide higher accuracy, faster convergence and better robustness, which is comparable to the current state-of-the-art Compared with the network, our video classification and recognition with only 8 frames of input is better than the 32-frame Nonlocal-R50 (non-local R50 network), and it uses 4.9 times less than the 128-frame Nonlocal-R50. The number of floating-point operations per second (the full English name is floating-point operations per second, and the English abbreviation is GFlops), but with the same accuracy. In addition, under the same GFlops, the performance of the 8-frame input of the video classification method described in this application is better than the current state-of-the-art 36-frame input fast and slow combined R50 network (the full English name is SlowFast-R50). These results indicate that the video classification model for video classification described in this application is an accurate and efficient video classification model.

In addition, the present application also provides a method for training a video classification model. The method includes: obtaining a sample video in a sample video set and a sample classification result of the sample video, the sample video including a plurality of video frames; Extract the spatial feature information in the sample video; extract the temporal feature information in the sample video through pooling; fuse the spatial feature information and the temporal feature information to obtain fused feature information, and perform full connection processing on the fused feature information to obtain a model Classification result; the model classification result and the sample classification result are corrected, the parameters of the two-dimensional convolution are corrected, and the step of extracting the spatial feature information in the sample video through the two-dimensional convolution is returned to the execution until the model classification result is consistent with The sample classification result meets the preset condition, and the trained video classification model is obtained.

The structure of the video classification model is consistent with the neural network model adopted by the video classification method shown in FIG. 2, and will not be repeated here.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.

FIG. 7 is a schematic diagram of a video classification device provided by an embodiment of the application, and the video classification device includes:

The video to be classified acquisition unit 701 is configured to acquire a video to be classified, where the video to be classified includes a plurality of video frames;

The classification unit 702 is configured to input the video to be classified into a trained video classification model for processing, and output the classification result of the video to be classified; wherein, the video classification model includes a feature extraction layer and a fully connected layer, so The feature extraction layer is used for extracting the spatial feature information of the multiple video frames through two-dimensional convolution, and extracting the temporal feature information of the multiple video frames through pooling, and fusing the spatial feature information and temporal feature information Output fusion feature information, and the fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.

The video classification device described in FIG. 7 corresponds to the video classification method shown in FIG. 3. With the video classification device, the video classification method described in any of the above embodiments can be executed.

Fig. 8 is a schematic diagram of a video classification device provided by an embodiment of the present application. As shown in FIG. 8, the video classification device 8 of this embodiment includes a processor 80, a memory 81, and a computer program 82 stored in the memory 81 and running on the processor 80, such as a video classification program. When the processor 80 executes the computer program 82, the steps in the foregoing embodiments of the video classification method are implemented. Alternatively, when the processor 80 executes the computer program 82, the function of each module/unit in the foregoing device embodiments is realized.

Exemplarily, the computer program 82 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 81 and executed by the processor 80 to complete This application. The one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program 82 in the video classification device 8.

The video classification device 8 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The video classification device may include, but is not limited to, a processor 80 and a memory 81. Those skilled in the art can understand that FIG. 8 is only an example of the video classification device 8 and does not constitute a limitation on the video classification device 8. It may include more or less components than shown in the figure, or combine certain components, or different For example, the video classification device may also include input and output devices, network access devices, buses, and so on.

The so-called processor 80 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

The memory 81 may be an internal storage unit of the video classification device 8, for example, a hard disk or a memory of the video classification device 8. The memory 81 may also be an external storage device of the video classification device 8, such as a plug-in hard disk equipped on the video classification device 8, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD card, Flash Card, etc. Further, the memory 81 may also include both an internal storage unit of the video classification device 8 and an external storage device. The memory 81 is used to store the computer program and other programs and data required by the video classification device. The memory 81 can also be used to temporarily store data that has been output or will be output.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above-mentioned functional units and modules is used as an example. In practical applications, the above-mentioned functions can be allocated to different functional units and modules as required. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist alone physically, or two or more units can be integrated into one unit. The above-mentioned integrated units can be hardware-based Formal realization can also be realized in the form of a software functional unit. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the foregoing system, reference may be made to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the above-mentioned embodiments, the description of each embodiment has its own focus. For parts that are not described in detail or recorded in an embodiment, reference may be made to related descriptions of other embodiments.

A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of this application.

In the embodiments provided in this application, it should be understood that the disclosed device/terminal device and method may be implemented in other ways. For example, the device/terminal device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units. Or components can be combined or integrated into another system, or some features can be omitted or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated module/unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, it can implement the steps of the foregoing method embodiments. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media, etc. It should be noted that the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer-readable medium Does not include electrical carrier signals and telecommunication signals.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims

A video classification method, characterized in that the method includes:

Acquiring a video to be classified, where the video to be classified includes a plurality of video frames;

Input the video to be classified into a trained video classification model for processing, and output the classification result of the video to be classified; wherein, the video classification model includes a feature extraction layer and a fully connected layer, and the feature extraction layer is used for Extract the spatial feature information of the multiple video frames through two-dimensional convolution, extract the temporal feature information of the multiple video frames through pooling, and fuse the spatial feature information and the temporal feature information to output the fused feature information, so The fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.
The method according to claim 1, wherein the feature extraction layer includes N feature extraction sublayers, N≥1, and the input information of the first feature extraction sublayer in the N feature extraction sublayers is For the multiple video frames, the output information of the previous feature extraction sublayer is the input information of the next feature extraction sublayer, and the output information of the Nth feature extraction sublayer is the fusion feature information output by the feature extraction layer; Each feature extraction sublayer in the N feature extraction sublayers includes a large receptive field context feature extraction branch and a small receptive field core feature extraction branch. Each feature extraction sublayer in the N feature extraction sublayers inputs Information processing, including:

Pooling the input information through the context feature extraction branch of the large receptive field, and extracting the time feature information of the input information;

Performing two-dimensional convolution processing on the input information through the core feature extraction branch of the small receptive field to extract the spatial feature information of the input information;

The temporal feature information extracted by the context feature extraction branch of the large receptive field and the spatial feature information extracted by the core feature extraction branch of the small receptive field are fused to obtain the output information.
The method according to claim 2, wherein the pooling processing of the input information through the context feature extraction branch of the large receptive field to extract the time feature information of the input information comprises:

Performing three-dimensional pooling processing on the input information through the context feature extraction branch of the large receptive field to obtain pooling information;

Performing two-dimensional convolution processing on the pooling information through the context feature extraction branch of the large receptive field to obtain temporal feature information.
The method according to claim 3, wherein the three-dimensional pooling process is performed on the input information through the context feature extraction branch of the large receptive field to obtain the pooling information, comprising:

The input information is pooled by the three-dimensional pooling kernel {t, K, K} in the context feature extraction branch of the large receptive field to obtain pooling information, where t is the smaller the kernel size in the time direction, and t Less than or equal to the video duration, K is the size of the pooling kernel in the two-dimensional space where the image is located, and the three-dimensional pooling kernel is the size of the pooling pixel selected in a single pooling calculation.
The method according to claim 4, wherein among the N three-dimensional pooling nuclei included in the feature extraction layer, the sizes of the N three-dimensional pooling nuclei are completely the same, or the sizes of the N three-dimensional pooling nuclei are completely the same. Different, or the size of some of the N three-dimensional pooling nuclei is the same, and the three-dimensional pooling nucleus is the size of the pooling pixel selected in a single pooling calculation.
The method according to claim 4 or 5, wherein the completely different sizes of the N three-dimensional pooling nuclei comprise:

With the sequence of feature information extraction, the size of the three-dimensional pooling core is gradually increased.
The method according to claim 6, wherein gradually increasing the size of the three-dimensional pooling core comprises:

Gradually increasing the size of the three-dimensional pooling core in the time direction;

Or, gradually increase the size of the dimension of the three-dimensional pooling core in the two-dimensional space where the video frame is located;

Or, gradually increase the size of the time direction of the three-dimensional pooling core and the size of the two-dimensional space where the video frame is located.
The method according to claim 3, wherein the convolution parameters of the two-dimensional convolution processing in the context feature extraction branch of the large receptive field are the same as those of the two-dimensional convolution process in the core feature extraction branch of the small receptive field. The processed convolution parameters are the same.
The method according to claim 3, wherein the input information of the pooling process is consistent with the size of the image corresponding to the output information of the pooling process.
The method according to claim 9, characterized in that the input feature image or video frame is filled in the time dimension or the space dimension, so that the pooling check performs pooling processing on the filled input information, and the output is obtained The size of the image corresponding to the information is the same as the size of the image corresponding to the input information.
The method according to claim 1, wherein fusing the spatial characteristic information and the temporal characteristic information to output the fused characteristic information comprises:

The image of the spatial feature information is superimposed on the image of the temporal feature information to generate the fusion feature information.
A video classification device, characterized in that the device includes:

A to-be-classified video obtaining unit, configured to obtain a to-be-classified video, where the to-be-classified video includes a plurality of video frames;

The classification unit is configured to input the to-be-classified video acquired by the to-be-classified video acquisition unit into a trained video classification model for processing, and output the classification result of the to-be-classified video; wherein, the video classification model includes feature extraction Layer and fully connected layer, the feature extraction layer is used to extract the spatial feature information of the multiple video frames through two-dimensional convolution, and extract the temporal feature information of the multiple video frames through pooling, and fuse the The spatial feature information and the temporal feature information output fusion feature information, and the fully connected layer is used to perform full connection processing on the fused feature information output by the feature extraction layer to obtain the classification result.
A video classification device includes a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to cause the video classification device to The steps of the method as claimed in any one of claims 1 to 11 are implemented.
A computer-readable storage medium storing a computer program, wherein the computer program implements the steps of the method according to any one of claims 1 to 11 when the computer program is executed by a processor.