WO2022183805A1 - 视频分类方法、装置及设备 - Google Patents

视频分类方法、装置及设备 Download PDF

Info

Publication number
WO2022183805A1
WO2022183805A1 PCT/CN2021/137912 CN2021137912W WO2022183805A1 WO 2022183805 A1 WO2022183805 A1 WO 2022183805A1 CN 2021137912 W CN2021137912 W CN 2021137912W WO 2022183805 A1 WO2022183805 A1 WO 2022183805A1
Authority
WO
WIPO (PCT)
Prior art keywords
convolution kernel
feature
video
image
dimension
Prior art date
Application number
PCT/CN2021/137912
Other languages
English (en)
French (fr)
Inventor
乔宇
黎昆昌
李先航
王亚立
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2022183805A1 publication Critical patent/WO2022183805A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application belongs to the field of image processing, and in particular, relates to a video classification method, apparatus and device.
  • Video classification is a fundamental task in video understanding because it can not only extract semantic information from videos, but also provide general video representations for other tasks, such as video behavior detection, localization, etc.
  • 2D Convolutional Neural Networks are the most straightforward and effective methods for video classification, including two-stream methods, as well as sparse temporal sampling strategies. However, these methods cannot fully learn spatiotemporal interaction information, and have poor ability to discriminate complex human behaviors. If the method of 3D convolutional neural network is used, the amount of parameters and computation is very large, which is difficult to deploy in the real environment, and more parameters are also proved to be more prone to overfitting.
  • the embodiments of the present application provide a video classification method, device, and equipment, to solve the problem that the 2D convolutional neural network is used for video classification in the prior art, and the spatiotemporal interaction information cannot be fully learned, and the 3D convolutional neural network is used to solve the problem.
  • the amount of parameters and operations are large and difficult to achieve.
  • a first aspect of the embodiments of the present application provides a video classification method, and the method includes:
  • the video classification model includes a feature extraction layer and a fully connected layer
  • the feature extraction layer is used for Decompose the image of the video to be classified in the channel dimension, determine the decomposed image including multiple channel dimensions, and determine the sub-dimension convolution kernel corresponding to the multiple channel sub-dimensions respectively, and decompose the sub-dimension convolution kernel into time volumes product kernel and spatial convolution kernel, according to the temporal convolution kernel and the spatial convolution kernel, the decomposed image is subjected to serial convolution processing in the channel dimension, and the feature image of the video to be classified is extracted, and the The full connection layer is used for performing full connection processing according to the feature image extracted by the feature extraction layer to obtain the classification result.
  • the image of the video to be classified is decomposed in the channel dimension, the decomposed image including multiple channel dimensions is determined, and the corresponding sub-dimensions of the multiple channels are determined respectively.
  • a sub-dimension convolution kernel, decomposing the sub-dimension convolution kernel into a temporal convolution kernel and a spatial convolution kernel, and according to the temporal convolution kernel and the spatial convolution kernel, decompose the decomposed image in the channel dimension Perform serial convolution processing to extract the feature images of the video to be classified, including:
  • Determine the sub-dimension convolution kernels corresponding to the multiple channel sub-dimensions respectively include: the first sub-dimension convolution kernel C 1 ⁇ 1 ⁇ ... ⁇ 1 ⁇ t ⁇ h ⁇ w, the second sub-dimension convolution kernel 1 ⁇ C 2 ⁇ ... ⁇ 1 ⁇ t ⁇ h ⁇ w, ..., the K-th sub-dimension convolution kernel 1 ⁇ 1 ⁇ ... ⁇ C K ⁇ t ⁇ h ⁇ w, where t is the convolution kernel time dimension, h is the convolution kernel height , w is the width of the convolution kernel;
  • the -1 feature image is subjected to convolution processing to obtain the i-th feature image, where i is greater than 1, and the K-th feature image is the feature image of the video to be classified.
  • the i sub-dimension convolution kernel is decomposed into the i th time convolution kernel and the i th spatial convolution kernel, Perform convolution processing on the i-1th feature image through the i-th time convolution kernel and the i-th spatial convolution kernel to obtain the i-th feature image, including:
  • the ith spatial feature and the ith time feature are fused to obtain the ith feature image
  • the first spatial convolution kernel and the first temporal convolution kernel convolve the decomposed image to obtain the first feature image
  • the i+1 th spatial convolution kernel and the i+1 th temporal convolution kernel convolve the i th feature image
  • the K th feature image is the feature image of the video to be classified
  • K is the number of sub-dimensions decomposed by the channel dimension.
  • the method before extracting the i-th spatial feature through the i-th spatial convolution kernel, the method further includes:
  • Average pooling is performed on the images whose features are to be extracted in the time dimension
  • the method Before extracting the i-th time feature according to the i-th time convolution kernel, the method further includes:
  • Average pooling is performed on the images whose features are to be extracted in the spatial dimension.
  • the ith spatial feature and the ith temporal feature are fused to obtain the ith feature image, including:
  • the fused image is activated through the channel attention module to obtain the i-th feature image.
  • the ith spatial feature and the ith temporal feature are fused to obtain the ith feature image, including:
  • the feature extraction layer before the feature extraction layer performs feature extraction, it further includes performing channel reduction processing on the video to be classified, and performing feature extraction in the feature extraction layer. After the extraction, channel recovery processing is also performed on the feature image.
  • a second aspect of the embodiments of the present application provides a video classification method, and the device includes:
  • a classification unit configured to input the video to be classified into a trained video classification model for processing, and output the classification result of the video to be classified; wherein, the video classification model includes a feature extraction layer and a fully connected layer, and the The feature extraction layer is used to decompose the image of the video to be classified in the channel dimension, determine the decomposed image including multiple channel dimensions, and determine the sub-dimension convolution kernels corresponding to the sub-dimensions of the multiple channels respectively, and convolve the sub-dimensions.
  • the kernel is decomposed into a temporal convolution kernel and a spatial convolution kernel.
  • the decomposed image is subjected to serial convolution processing in the channel dimension, and the image of the video to be classified is extracted.
  • feature image the fully connected layer is used to perform full connection processing according to the feature image extracted by the feature extraction layer to obtain the classification result.
  • a third aspect of the embodiments of the present application provides a video classification device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program When implementing the steps of the method according to any one of the first aspects.
  • a fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements the method according to any one of the first aspects A step of.
  • the embodiments of the present application have the following beneficial effects: the present application decomposes the image of the video to be classified by the channel dimension, determines the decomposed image including multiple channel dimensions, and determines the corresponding sub-dimensions of the multiple channels respectively.
  • the sub-dimension convolution kernel is decomposed by the sub-dimension convolution kernel to obtain the time convolution kernel and the spatial convolution kernel. It is beneficial to reduce the amount of convolution calculation, and to perform convolution processing on the decomposed images in series in the channel dimension, which is beneficial to increase the receptive field and at the same time increase the interaction of channels, which is beneficial to improve the accuracy of video classification.
  • FIG. 1 is a schematic diagram of an implementation scenario of a video classification method provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of an implementation flowchart of a video classification method provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a feature extraction layer provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of another feature extraction layer provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a video classification model provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a video classification apparatus provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a video classification device provided by an embodiment of the present application.
  • Video classification is a fundamental task in video understanding because it can not only extract semantic information from videos, but also provide general video representations for other tasks such as video action detection, localization, etc.
  • 2D convolutional neural networks are the most direct and effective video classification methods, including two-stream methods and sparse temporal sampling strategies.
  • 3D convolutional neural networks cannot fully learn spatiotemporal interaction information, and have poor ability to discriminate complex human behaviors. Therefore, researchers turn to 3D to learn long-term relationships.
  • 3D convolution-based methods are both parameter-intensive and computationally-intensive, making it difficult to deploy in real-world environments, and more parameters have also proven to be more prone to overfitting.
  • the implementation scenario of the video classification method may be as shown in FIG. 1 .
  • the implementation scenario of the video classification includes a video terminal and a server.
  • the video terminal may be an intelligent terminal, or may be a video collection device.
  • the smart terminal may be a smart phone, a tablet computer, a computer, and the like.
  • the video collection device may be a web camera, etc., upload video through the web camera, classify and analyze the video, and give real-time early warning to the monitoring scene.
  • the intelligent terminal may acquire the video to be classified by collecting the camera configured on the intelligent terminal, or may also receive the video to be classified from other devices or servers through the network.
  • the video terminal sends the video to be classified to the server through the network, and the server can classify the video to be classified through the video classification method described in this application, and obtain the classification result of the video to be classified.
  • the server can push the video to different clients for playing or viewing according to the video classification result.
  • the video classification method described in this application can also be arranged in a video terminal, and the video terminal to be classified is processed to determine the classification result of the video to be classified.
  • the video classification method described in this application has low computational cost and excellent recognition ability, which makes our video recognition system have a wide range of commercial applications, including but not limited to:
  • our video classification method also has good prospects in the security field, because security emphasizes timeliness and the accuracy of abnormal motion capture, our video classification method is very suitable for these needs.
  • FIG. 2 is a schematic flowchart of the implementation of a video classification method proposed in an embodiment of the present application, which is described in detail as follows:
  • the subject that obtains the video to be classified may be a video terminal or a server for video classification.
  • the video to be classified may be the video of the monitoring scene collected in real time by the video terminal, or may be the video shot or edited by the video terminal.
  • the video to be classified includes a plurality of video frames.
  • the plurality of video frames are sequenced according to the time sequence of the video frames.
  • the video to be classified includes multiple channel dimensions.
  • the channel dimensions include, but are not limited to, color channels.
  • the video to be classified is input into a trained video classification model for processing, and a classification result of the video to be classified is output;
  • the video classification model includes a feature extraction layer and a fully connected layer, and the feature
  • the extraction layer is used to decompose the image of the video to be classified in the channel dimension, determine the decomposed image including multiple channel dimensions, and determine the sub-dimension convolution kernels corresponding to the sub-dimensions of the multiple channels respectively, and the sub-dimension convolution kernels It is decomposed into a temporal convolution kernel and a spatial convolution kernel, and according to the temporal convolution kernel and the spatial convolution kernel, the decomposed image is subjected to serial convolution processing in the channel dimension, and the features of the video to be classified are extracted.
  • the fully connected layer is configured to perform a fully connected process according to the feature image extracted by the feature extraction layer to obtain the classification result.
  • the video classification model can be trained in advance by using sample videos of calibrated results to obtain the calculation result of the video classification model.
  • the calibration result is compared with the calculation result, and if the difference between the two is less than a preset difference threshold, it is considered that the video classification model has been trained.
  • the difference between the two may be that the probability of the difference between the calculation result of the video classification model and the pre-calibration result is less than a predetermined ratio.
  • the feature extraction layer in the video classification model is used to extract feature images included in the video to be classified.
  • the feature extraction layer performs tensorization processing in the channel dimension, and decomposes the video to be classified in the channel dimension to obtain the channel dimension.
  • K is the number of channel sub-dimensions
  • C 1 , C 2 . . . C K respectively represent the values of sub-dimensions.
  • the parameter of the image corresponding to the video to be classified is C ⁇ T ⁇ H ⁇ W, where C is the channel dimension of the image corresponding to the video to be classified, T is the time dimension of the image corresponding to the video to be classified, and H is the corresponding image of the video to be classified.
  • the height of the image, W is the width of the image corresponding to the video to be classified.
  • the convolution kernel corresponding to the sub-dimension can be determined accordingly.
  • the convolution kernel of the first sub-dimension corresponding to the first sub-dimension is C 1 ⁇ 1 ⁇ ... ⁇ 1 ⁇ t ⁇ h ⁇ w
  • the convolution kernel of the second sub-dimension corresponding to the second sub-dimension is 1 ⁇ C 2 ⁇ ... ⁇ 1 ⁇ t ⁇ h ⁇ w
  • the K-th sub-dimension convolution kernel corresponding to the K-th sub-dimension 1 ⁇ 1 ⁇ ... ⁇ C K ⁇ t ⁇ h ⁇ w, where t is the time dimension of the convolution kernel, h is the height of the convolution kernel, and w is the width of the convolution kernel.
  • the K sub-dimension convolution kernels can be decomposed respectively to obtain temporal convolution kernels and spatial convolution kernels of different sub-dimensions.
  • the first sub-dimension convolution kernel C 1 ⁇ 1 ⁇ ... ⁇ 1 ⁇ t ⁇ h ⁇ w can be decomposed into the first time convolution kernel C 1 ⁇ 1 ⁇ ... ⁇ 1 ⁇ t ⁇ 1 ⁇ 1 and the first time convolution kernel C 1 ⁇ 1 ⁇ ... ⁇ 1 ⁇ t ⁇ 1 ⁇ 1 Spatial convolution kernel C 1 ⁇ 1 ⁇ ... ⁇ 1 ⁇ 1 ⁇ h ⁇ w.
  • the second sub-dimension convolution kernel 1 ⁇ C 2 ⁇ ... ⁇ 1 ⁇ t ⁇ h ⁇ w corresponding to the second sub-dimension can be decomposed into the second time convolution kernel 1 ⁇ C 2 ⁇ ... ⁇ 1 ⁇ t ⁇ 1 ⁇ 1 and the second spatial convolution kernel 1 ⁇ C 2 ⁇ ... ⁇ 1 ⁇ 1 ⁇ h ⁇ w.
  • the K-th sub-dimension convolution kernel can be classified as the K-th temporal convolution kernel 1 ⁇ 1 ⁇ ... ⁇ C K ⁇ t ⁇ 1 ⁇ 1 and the K-th spatial convolution kernel 1 ⁇ 1 ⁇ ... ⁇ C K ⁇ 1 ⁇ h ⁇ w.
  • the concatenated convolution of the determined decomposed image in the channel dimension may refer to the convolution of the i-th sub-dimension convolution kernel to obtain the i-th feature image, and the i-th feature image is taken as the i+1-th sub-dimension convolution
  • the convolution object of the kernel, the i+1th sub-dimension convolution kernel performs convolution processing to obtain the i+1th feature image, and i is greater than 1.
  • the first sub-dimension convolution kernel performs convolution processing on the decomposed image to obtain a first feature image.
  • the convolution processing based on the temporal convolution kernel of the sub-dimension and the convolution processing based on the spatial convolution kernel of the sub-dimension are respectively included.
  • the j-th time convolution kernel obtained by decomposing in the jth sub-dimension convolution kernel is used for convolution processing
  • the jth time feature image is obtained
  • the jth spatial feature image is obtained by performing convolution processing through the jth spatial convolution kernel.
  • the jth temporal feature image and the jth spatial feature image are fused, for example, the jth feature image can be obtained by merging the jth temporal feature image.
  • FIG. 3 is a schematic diagram of a feature image extraction layer provided by an embodiment of the present application.
  • the two sub-dimension convolution kernels are determined as: the first sub-dimension convolution kernel C 1 ⁇ 1 ⁇ t ⁇ h ⁇ w and the second sub-dimension convolution kernel: 1 ⁇ C 2 ⁇ t ⁇ h ⁇ w.
  • the decomposed images are subjected to convolution operations in series, that is, the output of the previous convolution operation is used as the input of the subsequent convolution operation, and the output of the last convolution operation obtains the feature image.
  • any sub-dimension convolution kernel in the first sub-dimension convolution kernel (C 1 tensor) and the second sub-dimension convolution kernel (C 2 tensor) when performing the convolution operation, it includes convolving the sub-dimension
  • the kernel performs the spatiotemporal decomposition operation, and decomposes the sub-dimension convolution kernel into a temporal convolution kernel and a spatial convolution kernel.
  • the first sub-dimension convolution kernel C 1 ⁇ 1 ⁇ t ⁇ h ⁇ w can be decomposed into the first temporal convolution kernel C 1 ⁇ 1 ⁇ t ⁇ 1 ⁇ 1 and the first spatial convolution kernel C 1 ⁇ 1 ⁇ 1 ⁇ h ⁇ w
  • the second sub-dimension convolution kernel: 1 ⁇ C 2 ⁇ t ⁇ h ⁇ w can be decomposed into the second temporal convolution kernel 1 ⁇ C 2 ⁇ t ⁇ 1 ⁇ 1 and the second spatial convolution kernel 1 ⁇ C 2 ⁇ 1 ⁇ h ⁇ w.
  • Convolution operations are performed on the temporal convolution kernel and the spatial convolution kernel obtained by decomposing the sub-dimension convolution kernel to obtain a temporal feature image and a spatial feature image.
  • the feature image of the sub-dimension convolution kernel can be obtained by fusing the temporal feature image and the spatial feature image, such as summation.
  • the feature image output by the last sub-dimension convolution kernel is subjected to dimensionality reduction processing to obtain a feature image that is consistent with the dimensional features of the input image.
  • FIG. 4 is a schematic diagram of implementation of yet another feature extraction layer provided by an embodiment of the present application.
  • the channel dimension reduction process can be performed on it.
  • the channel dimension of the input image can be reduced by means of 1 ⁇ 1 convolution.
  • channel tensorization is performed, that is, channel decomposition is performed to obtain channels of K sub-dimensions.
  • the convolution kernel corresponding to the sub-dimension channel is determined respectively, and the temporal convolution kernel and the spatial convolution kernel of the sub-dimension channel are obtained.
  • the temporal convolution kernel can be obtained.
  • the convolution of the kernel can get the spatial feature image.
  • average pooling can be performed in space to obtain temporal attention that retains temporal information.
  • average pooling can be performed in time to obtain spatial attention that retains spatial information. Then the spatial attention and temporal attention are fused to obtain a fused image.
  • the fused image can also be activated through the channel attention module to further activate the channel features.
  • the channel attention module may further perform interactive processing on the activated image.
  • the activated image can be interactively processed through a 1 ⁇ 1 ⁇ 1 convolution kernel, which is conducive to sufficient interaction of features.
  • the convolution feature extraction operation of each sub-dimension can be performed in a concatenated manner.
  • the first sub-dimension convolution check performs feature extraction on the decomposed image after channel decomposition to obtain a first feature image
  • the second sub-dimension convolution check performs feature extraction on the first feature image to obtain a second feature image, and so on Repeat until the Kth feature image is acquired.
  • the acquired feature image may be subjected to down-channel processing, so that the channel size of the down-channel processed image is consistent with the channel size of the input image.
  • channel recovery processing can be performed on the Kth feature image through a 1 ⁇ 1 convolution kernel.
  • the feature image after restoring the channel can be summed with the input image to obtain the output image.
  • the feature extraction layer can be used to replace some modules in the skeleton of the 2D convolutional neural network.
  • the convolutional neural network skeleton can include multiple layers (including four layers in Figure 5), and the original residual module can be replaced by each layer interval as a feature extraction layer, and the parameters in it can be randomly initialized. The position and number of replacements can be obtained according to experimental statistics, so as to achieve a better balance between accuracy and efficiency.
  • the embodiment of the present application greatly reduces the computation amount of convolution through the spatiotemporal decomposition and channel decomposition of the feature extraction layer, and efficient modeling of spatiotemporal information can be achieved by simply replacing the original 3 ⁇ 3 spatial convolution.
  • the present application operates in three different dimensions of time, space and channel, and realizes the interaction of all channels while increasing the receptive field.
  • the cooperative attention module corresponding to the spatiotemporal convolution operation and the channel attention module are also beneficial to activate different channel features. And insert 1 ⁇ 1 ⁇ 1 convolution in the sub-dimension convolution kernel, which is conducive to the sufficient interaction of features.
  • the video classification method described in this application compared with the method based on SlowFast, requires 128 V-100 graphics cards to train a model, and the required training rounds are 256 rounds. It can be seen that, The training cost of this method is extremely high.
  • the training of the video classification method described in this application only needs 8 2080Ti graphics cards, the price of a single card is about one-tenth of the V100, and the number of training rounds is reduced to 100 rounds. Therefore, the training of the video classification method described in this application. Compared with the previous method, it has a great advantage in cost.
  • this application can take advantage of the 2D model to obtain an initialization model from imageNetPre-train, which greatly accelerates the convergence of the video classification model in this application.
  • our unique sampling strategy can use fewer frames to achieve higher accuracy, which is also inseparable from our model's strong ability to recognize long-term videos, so the video classification method described in this application is inseparable. For the same video, you can accurately determine what category it is with fewer frames.
  • this application has achieved the highest accuracy among current mainstream algorithms on two widely used datasets, kinetics and something-something, while taking into account the low computational cost. This is very rare because the two datasets represent two completely different types of videos, kinetcis for background and object related videos, and something for timing related videos. This shows that our model not only has excellent recognition ability of daily videos (led by YouTube), but also for some professional videos (sports games), our model can also capture the different characteristics between rapidly changing and similar actions, and then Make accurate judgments.
  • the video classification method described in this application has greater advantages.
  • the spatiotemporal receptive field can more accurately locate the correct area of interest, more accurately locate the spatiotemporal position of behavioral and interactive objects, and obtain more accurate judgments.
  • FIG. 6 is a schematic diagram of a video classification apparatus provided by an embodiment of the present application. As shown in FIG. 6 , the apparatus includes:
  • To-be-classified video acquisition unit 601, configured to obtain the to-be-classified video
  • the classification unit 602 is configured to input the video to be classified into a trained video classification model for processing, and output the classification result of the video to be classified; wherein, the video classification model includes a feature extraction layer and a fully connected layer, and the The feature extraction layer is used to decompose the image of the video to be classified in the channel dimension, determine the decomposed image including multiple channel dimensions, and determine the sub-dimension convolution kernel corresponding to the sub-dimensions of the multiple channels respectively, and the sub-dimension volume
  • the accumulation kernel is decomposed into a temporal convolution kernel and a spatial convolution kernel. According to the temporal convolution kernel and the spatial convolution kernel, the decomposed image is subjected to serial convolution processing in the channel dimension, and the video to be classified is extracted.
  • the fully connected layer is configured to perform full connection processing according to the feature image extracted by the feature extraction layer to obtain the classification result.
  • the video classification apparatus shown in FIG. 6 corresponds to the video classification method shown in FIG. 2 .
  • FIG. 7 is a schematic diagram of a video classification device provided by an embodiment of the present application.
  • the video classification device 7 of this embodiment includes: a processor 70 , a memory 71 , and a computer program 72 , such as a video classification program, stored in the memory 71 and executable on the processor 70 .
  • the processor 70 executes the computer program 72, the steps in each of the foregoing video classification method embodiments are implemented.
  • the processor 70 executes the computer program 72, the functions of the modules/units in the foregoing device embodiments are implemented.
  • the computer program 72 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 71 and executed by the processor 70 to complete the this application.
  • the one or more modules/units may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 72 in the video classification device 7 .
  • the video classification device 7 may be a computing device such as a desktop computer, a notebook computer, a palmtop computer, and a cloud server.
  • the video classification device may include, but is not limited to, a processor 70 and a memory 71 .
  • FIG. 7 is only an example of the video classification device 7, and does not constitute a limitation on the video classification device 7, and may include more or less components than the one shown, or combine some components, or different
  • the video classification device may also include input and output devices, network access devices, buses, and the like.
  • the so-called processor 70 may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 71 may be an internal storage unit of the video classification device 7 , such as a hard disk or a memory of the video classification device 7 .
  • the memory 71 can also be an external storage device of the video classification device 7, such as a plug-in hard disk equipped on the video classification device 7, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.
  • the memory 71 may also include both an internal storage unit of the video classification device 7 and an external storage device.
  • the memory 71 is used to store the computer program and other programs and data required by the video classification apparatus.
  • the memory 71 may also be used to temporarily store data that has been output or will be output.
  • the disclosed apparatus/terminal device and method may be implemented in other manners.
  • the apparatus/terminal device embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units. Or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated modules/units if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the present application can implement all or part of the processes in the methods of the above embodiments, and it can also be completed by instructing the relevant hardware through a computer program.
  • the computer program can be stored in a computer-readable storage medium.
  • the computer program When executed by a processor, the steps of each of the above method embodiments can be implemented.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium, etc.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electric carrier signal telecommunication signal and software distribution medium, etc.
  • the content contained in the computer-readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, the computer-readable media Excluded are electrical carrier signals and telecommunication signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种视频分类方法、装置和设备,该方法包括:获取待分类视频;将待分类视频输入到已训练的视频分类模型中处理,特征提取层用于将待分类视频的图像在通道维度进行分解,确定包括多个通道维度的分解图像,以及确定多个通道子维度分别对应的子维度卷积核,将子维度卷积核分解为时间卷积核和空间卷积核,根据时间卷积核和空间卷积核,在通道维度上对分解图像进行串联卷积处理,提取待分类视频的特征图像进行分类。基于多个通道维度的子维度卷积核的时间卷积核和空间卷积核进行卷积,有利于降低卷积计算量,并且有利于增大感受野和通道的交互,有利于提高视频分类的精度。

Description

视频分类方法、装置及设备 技术领域
本申请属于图像处理领域,尤其涉及视频分类方法、装置及设备。
背景技术
随着云计算和边缘计算的高速发展,人们越来越习惯于参加社交平台并且生活在摄像机之下。同时,安防和运输等行业采集了大量包含丰富信息的视频,这些视频包括了人们的日常行为、交通出行等。这些蕴含在视频中的信息亟需高效的视频理解,这是在云和边缘设备进行算法部署的关键一步。而视频分类是视频理解中的一项基础任务,因为它不仅可以从视频中提取语义信息,而且可以为其它任务,如视频行为检测、定位等提供通用的视频表征。
随着深度神经网络在图片分类中的巨大成功,视频分类已经从传统手提特征方法转移到了基于深度学习的方法。在视频分类任务上,深度学习尚未取得类似图像识别的巨大成功。2D卷积神经网络是最直接有效的视频分类方法,包括如双流方法,以及稀疏时间采样策略。但是,这些方法无法充分地学习时空交互信息,对复杂人类行为判别能力较差。如果使用3D卷积神经网络的方法,参数量和运算量都很大,难以在现实环境部署,并且更多的参数也被证明更容易过拟合。
技术问题
有鉴于此,本申请实施例提供了一种视频分类方法、装置及设备,以解决现有技术中使用2D卷积神经网络进行视频分类,无法充分学习时空交互信息,采用3D卷积神经网络,参数量和运算量大,难以实现的问题。
技术解决方案
本申请实施例的第一方面提供了一种视频分类方法,所述方法包括:
获取待分类视频;
将所述待分类视频输入到已训练的视频分类模型中处理,输出所述待分类视频的分类结果;其中,所述视频分类模型包括特征提取层和全连接层,所述特征提取层用于将待分类视频的图像在通道维度进行分解,确定包括多个通道维度的分解图像,以及确定多个通道子维度分别对应的子维度卷积核,将所述子维度卷积核分解为时间卷积核和空间卷积核,根据所述时间卷积核和所述空间卷积核,在通道维度上对所述分解图像进行串联卷积处理,提取所述待分类视频的特征图像,所述全连接层用于根据所述特征提取层所提取的特征图像进行全连接处理,得到所述分类结果。
结合第一方面,在第一方面的第一种可能实现方式中,将待分类视频的图像在通道维度进行分解,确定包括多个通道维度的分解图像,以及确定多个通道子维度分别对应的子维度卷积核,将所述子维度卷积核分解为时间卷积核和空间卷积核,根据所述时间卷积核和所述空间卷积核,在通道维度上对所述分解图像进行串联卷积处理,提取所述待分类视频的特征图像,包括:
根据所述待分类视频的视频帧的通道特征C进行通道分解,确定参数信息为C 1×C 2×…×C K×T×H×W的分解图像,其中,通道特征C=C 1×C 2×…×C K,T为分解图像的时间维度,H为分解图像的高度,W为分解图像的宽度,K为通道维度分解的子维度数量;
确定多个通道子维度分别对应的子维度卷积核包括:第一子维度卷积核C 1×1×…×1×t×h×w,第二子维度卷积核1×C 2×…×1×t×h×w,…,第K子维度卷积核1×1×…×C K×t×h×w,其中,t为卷积核时间维度,h为卷积核高度,w为卷积核宽度;
将所述第一子维度卷积核分解为第一时间卷积核和第一空间卷积核,通过所述第一时间卷积核和第一空间卷积核对所述分解图像进行卷积处理,得到第一特征图像,将所述i子维度卷积核分解为第i时间卷积核和第i空间卷积核,通过所述第i时间卷积核和第i空间卷积核对第i-1特征图像进行卷积处理,得到第i特征图像,其中i大于1,第K特征图像为所述待分类视频的特征图像。
结合第一方面的第一种可能实现方式,在第一方面的第二种可能实现方式中,将所述i子维度卷积核分解为第i时间卷积核和第i空间卷积核,通过所述第i时间卷积核和第i空间卷积核对第i-1特征图像进行卷积处理,得到第i特征图像,包括:
将第i子维度卷积核分解为第i空间卷积核和第i时间卷积核;
通过所述第i空间卷积核提取第i空间特征,根据所述第i时间卷积核提取第i时间特征;
将所述第i空间特征和所述第i时间特征融合,得到第i特征图像;
其中,第一空间卷积核和第一时间卷积核对所述分解图像进行卷积,得到第一特征图像,第i+1空间卷积核和第i+1时间卷积核对第i特征图像进行卷积,得到第i+1特征图像,第K特征图像为所述待分类视频的特征图像,K为通道维度分解的子维度数量。
结合第一方面的第二种可能实现方式,在第一方面的第三种可能实现方式中,在通过所述第i空间卷积核提取第i空间特征之前,所述方法还包括:
在时间维度对待提取特征的图像进行平均池化处理;
在根据所述第i时间卷积核提取第i时间特征之前,所述方法还包括:
在空间维度对待提取特征的图像进行平均池化处理。
结合第一方面的第二种可能实现方式,在第一方面的第四种可能实现方式中,将所述第i空间特征和所述第i时间特征融合,得到第i特征图像,包括:
将所述第i空间特征和所述第i时间特征融合,得到融合图像;
通过通道注意力模块对所述融合图像进行激活处理,得到第i特征图像。
结合第一方面的第二种可能实现方式,在第一方面的第四种可能实现方式中,将所述第i空间特征和所述第i时间特征融合,得到第i特征图像,包括:
将所述第i空间特征和所述第i时间特征融合,得到融合图像;
对所述融合图像进行特征交互处理,得到第i特征图像。
结合第一方面,在第一方面的第六种可能实现方式中,在所述特征提取层进行特征提取之前,还包括对所述待分类视频进行降低通道处理,在所述特征提取层进行特征提取之后,还包括对所述特征图像进行通道恢复处理。
本申请实施例的第二方面提供了一种视频分类方法,所述装置包括:
待分类视频获取单元,用于获取待分类视频;
分类单元,用于将所述待分类视频输入到已训练的视频分类模型中处理,输出所述待分类视频的分类结果;其中,所述视频分类模型包括特征提取层和全连接层,所述特征提取层用于将待分类视频的图像在通道维度进行分解,确定包括多个通道维度的分解图像,以及确定多个通道子维度分别对应的子维度卷积核,将所述子维度卷积核分解为时间卷积核和空间卷积核,根据所述时间卷积核和所述空间卷积核,在通道维度上对所述分解图像进行串联卷积处理,提取所述待分类视频的特征图像,所述全连接层用于根据所述特征提取层所提取的特征图像进行全连接处理,得到所述分类结果。
本申请实施例的第三方面提供了一种视频分类设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如第一方面任一项所述方法的步骤。
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如第一方面任一项所述方法的步骤。
有益效果
本申请实施例与现有技术相比存在的有益效果是:本申请通过对待分类视频的图像进行通道维度的分解,确定包括多个通道维度的分解图像,并确定多个通道子维度分别对应的子维度卷积核,通过子维度卷积核分解得到时间卷积核和空间卷积核,基于多个通道维度的子维度卷积核的时间卷积核和空间卷积核进行卷积,有利于降低卷积计算量,并且在通道维度上对分解图像进行串联的卷积处理,有利于增大感受野的同时,还能够增加通道的交互,有利于提高视频分类的精度。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种视频分类方法实施场景示意图;
图2是本申请实施例提供的一种视频分类方法的实现流程示意图;
图3是本申请实施例提供的一种特征提取层示意图;
图4是本申请实施例提供的又一特征提取层示意图;
图5为本申请实施例提供的一种视频分类模型示意图中;
图6是本申请实施例提供的一种视频分类装置的示意图;
图7是本申请实施例提供的视频分类设备的示意图。
本发明的最佳实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
为了说明本申请所述的技术方案,下面通过具体实施例来进行说明。
随着云计算和边缘计算的高速发展,万物互联的智能时代正在加速到来。同时,许多的工业如安防和运输业,采集了大量包含丰富信息的视频,这些视频包括了人们的日常行为、交通出行等。这些蕴含在视频中的信息亟需高效的视频理解,这是在云和边缘设备进行算法部署的关键一步。而视频分类是视频理解中的一项基础任务,因为它不仅可以从视频中提取语义信息,而且可以为其他任务如视频行为检测、定位等提供通用的视频表征。
随着深度神经网络在图片分类中的巨大成功,视频分类已经从原本的传统手提特征方法转移到了基于深度学习的方法上来,然而在视频分类任务上,深度学习尚未取得类似图像识别的巨大成功。2D卷积神经网络是最直接有效的视频分类方法,包括双流方法以及稀疏时间采样策略。
但是,2D卷积神经网络无法充分地学习时空交互信息,对复杂人类行为判别能力较差。因此,研究人员转而利用3D来学习长时间关系,然而,基于3D卷积的方法参数量和运算量都很大,难以在现实环境部署,更多的参数也被证明更容易过拟合。
基于上述问题,本申请实施例提出了一种视频分类方法,该视频分类方法的实现场景可以如图1所示,视频分类的实现场景包括视频终端和服务器。其中,视频终端可以为智能终端,也可以为视频采集设备。比如,智能终端可以为智能手机、平板电脑、计算机等。所述视频采集设备可以为网络摄像头等,通过网络摄像头上传视频,对视频进行分类解析,对监控场景进行实时预警。其中,智能终端可以通过智能终端所配置的摄像头采集得到待分类视频,或者也可以通过网络从其它设备或服务器中接收待分类视频。视频终端通过网络将待分类视频发送至服务器,服务器可以通过本申请所述的视频分类方法,对待分类视频进行分类处理,得到待分类视频的分类结果。服务器可以根据视频分类结果,将视频推送至不同的用户端进行播放或查看。
当然,不局限于此,还可以将本申请所述的视频分类方法布局于视频终端,通过视频终端对待分类视频分类处理,确定待分类视频的分类结果。
本申请所述的视频分类方法,具有低廉的计算成本和出色的识别能力,使得我们的视频识别系统的商业应用十分广泛,包括但不限于:
第一,精准推送短视频,电商直播。由于我们的视频分类方法对于长时和短时视频都具有优秀的建模能力,所以我们可以更加精准的识别出用户喜欢的视频内容精准的刻画用户画像达到精准投送广告的目的。
第二,对于一些实时性要求很高的任务,比方说体育比赛,我们的视频分类方法的计算速度更快也更准,因此可以在很短的时间内给予运动员反馈,例如运动员在训练时可以通过可穿戴设备及时获得他刚刚动作的反馈。
第三,我们的视频分类方法同样在安防领域有着很好的前景,因为安防强调时效性,以及异常动作捕捉的准确性,我们的视频分类方法非常契合这些需求。
图2为本申请实施例提出的一种视频分类方法的实现流程示意图,详述如下:
在S201中,获取待分类视频。
本申请实施例中获取待分类视频的主体,可以为视频终端,也可以为用于视频分类的服务器。
所述待分类视频,可以为视频终端实时采集的监控场景的视频,也可以为视频终端所拍摄或剪辑的视频。
所述待分类视频包括多个视频帧。多个视频帧按照视频帧的时间先后排序。所述待分类视频包括多个通道维度。所述通道维度包括但不限于颜色通道。
在S202中,将所述待分类视频输入到已训练的视频分类模型中处理,输出所述待分类视频的分类结果;其中,所述视频分类模型包括特征提取层和全连接层,所述特征提取层用于将待分类视频的图像在通道维度进行分解,确定包括多个通道维度的分解图像,以及确定多个通道子维度分别对应的子维度卷积核,将所述子维度卷积核分解为时间卷积核和空间卷积核,根据所述时间卷积核和所述空间卷积核,在通道维度上对所述分解图像进行串联卷积处理,提取所述待分类视频的特征图像,所述全连接层用于根据所述特征提取层所提取的特征图像进行全连接处理,得到所述分类结果。
具体的,所述视频分类模型,可以预先通过已标定结果的样本视频进行训练,得到视频分类模型的计算结果。将标定结果与计算结果进行比较,如果两者的差异小于预先设定的差异阈值,则认为所述视频分类模型已训练完成。其中,两者的差异可以为视频分类模型的计算结果与预先的标定结果出现不同的几率小于预定的比例。
所述视频分类模型中的特征提取层,用于提取待分类视频中包括的特征图像。为了避免对待分类视频进行3D或者3D以上维度的卷积计算所带来的大量参数的缺陷,所述特征提取层在通道维度进行张量化处理,将待分类视频在通道维度进行分解,得到通道维度所对应的多个子维度。比如,可以将通道维度C分解为多个子维度的乘积,表示为:C=C 1×C 2×…×C K。其中,K为通道子维度的数量,C 1、C 2…C K分别表示子维度的数值。比如,假设通道维度的数值为4,可以分解为:4=2×2,即可以得到两个子维度,且子维度的数值为2。
假设待分类视频对应的图像的参数为C×T×H×W,其中,C为待分类视频对应的图像的通道维度,T为待分类视频对应的图像的时间维度,H为待分类视频对应的图像的高度,W为待分类视频对应的图像的宽度。通过通道分解C=C 1×C 2×…×C K,得到参数信息为:C 1×C 2×…×C K×T×H×W的分解图像。
对于K个子维度中的任意一个子维度,可以相应的确定该子维度所对应的卷积核。比如,第一子维度对应的第一子维度卷积核为C 1×1×…×1×t×h×w,第二子维度对应的第二子维度卷积核1×C 2×…×1×t×h×w,…,第K子维度对应的第K子维度卷积核1×1×…×C K×t×h×w,其中,t为卷积核的时间维度,h为卷积核的高度,w为卷积核的宽度。
将子维度卷积核分解为时间卷积核和空间卷积核时,可以对K个子维度卷积核分别进行分解,得到不同子维度的时间卷积核和空间卷积核。比如,第一子维度卷积核C 1×1×…×1×t×h×w,可以分解为第一时间卷积核C 1×1×…×1×t×1×1和第一空间卷积核C 1×1×…×1×1×h×w。第二子维度对应的第二子维度卷积核1×C 2×…×1×t×h×w,可以分解为第二时间卷积核1×C 2×…×1×t×1×1和第二空间卷积核1×C 2×…×1×1×h×w。第K子维度卷积核可以分类为第K时间卷积核1×1×…×C K×t×1×1和第K空间卷积核1×1×…×C K×1×h×w。
其中,在通道维度上对所确定的分解图像进行串联卷积,可以指第i子维度卷积核进行卷积处理得到第i特征图像,将第i特征图像作为第i+1子维度卷积核的卷积对象,第i+1子维度卷积核进行卷积处理得到第i+1特征图像,i大于1。且第一子维度卷积核对分解图像进行卷积处理,得到第一特征图像。
对于每个子维度卷积核进行卷积处理时,分别包括基于该子维度的时间卷积核的卷积处理,以及基于该子维度的空间卷积核的卷积处理。
比如,对于第j(1大于或等于j小于或等于K)子维度卷积核进行卷积处理时,通过第j子维度卷积核中分解得到的第j时间卷积核进行卷积处理,得到第j时间特征图像,通过第j空间卷积核进行卷积处理,得到第j空间特征图像。将第j时间特征图像,以及第j空间特征图像进行融合处理,比如可以通过求和的方式融合处理,得到第j特征图像。可以将第j特征图像作为第j+1子维度卷积核的卷积对象,或者输出为第K特征图像(当j+1=K)。
图3为本申请实施例提供的一种特征图像提取层的示意图,如图3所示,待分类视频对应输入的图像的参数信息为C×T×H×W,且通道数C=4,时间维度T=3,高度H=3,宽度=3。经过通道分解,将通道维度分解为两个子维度,分别为子维度C 1=2和子维度C 2=2。根据通道维度的分解,确定两个子维度卷积核分别为:第一子维度卷积核C 1×1×t×h×w和第二子维度卷积核:1×C 2×t×h×w。通过两个子维度卷积核对分解图像进行串联的卷积操作,即前一卷积操作的输出,作为后一卷积操作的输入,最后一个卷积操作的输出得到特征图像。
对于第一子维度卷积核(C 1张量)和第二子维度卷积核(C 2张量)中的任意子维度卷积核,在执行卷积操作时,包括将子维度卷积核进行时空分解的操作,将子维度卷积核分解为时间卷积核和空间卷积核。
比如,第一子维度卷积核C 1×1×t×h×w可以分解为第一时间卷积核C 1×1×t×1×1和第一空间卷积核C 1×1×1×h×w,第二子维度卷积核:1×C 2×t×h×w可以分解为第二时间卷积核1×C 2×t×1×1和第二空间卷积核1×C 2×1×h×w。
对子维度卷积核分解得到的时间卷积核和空间卷积核,分别进行卷积操作,得到时间特征图像和空间特征图像。将时间特征图像和空间特征图像融合处理,比如求和处理,可以得到该子维度卷积核的特征图像。将最后一个子维度卷积核输出的特征图像进行降维处理,可得到与输入的图像的维度特征一致的特征图像。
图4为本申请实施例提供的一种又一特征提取层的实现示意图。如图4所示,对于输入特征提取模块的图像,可以对其进行通道的降维处理,比如,可以通过1×1卷积的方式,降低输入的图像的通道的维度。对降低了通道维度的图像,进行通道张量化处理,即进行通道分解,得到K个子维度的通道。对于每个子维度,分别确定子维度通道对应的卷积核,得到子维度通道的时间卷积核和空间卷积核,通过时间卷积核的卷积,可以得到时间特征图像,通过空间卷积核的卷积,可以得到空间特征图像。
在可能的实现方式中,如图4所示,通过时间卷积核得到时间特征图像后,可在空间上进行平均池化,得到保留有时间信息的时间注意力。通过空间卷积核得到空间特征图像后,可在时间上进行平均池化,得到保留有空间信息的空间注意力。然后将空间注意力与时间注意力融合,得到融合图像。
在可能的实现方式中,还可以通过通道注意力模块对融合图像进行激活处理,进一步激活通道的特征。
在所述通道注意力模块对融合图像进行激活处理后,还可以进一步对激活处理后的图像进行交互处理。比如,可以通过1×1×1的卷积核对激活处理后的图像进行交互处理,从而有利于特征的充分交互。
对于K个子维度,可以通过串联的方式,执行每个子维度的卷积特征提取操作。第一子维度卷积核对进行了通道分解后的分解图像进行特征提取处理,得到第一特征图像,第二子维度卷积核对第一特征图像进行特征提取处理,得到第二特征图像,依此重复,直到获取第K特征图像。
在获取到第K特征图像后,可以对获取的特征图像进行降通道处理,使得降通道处理后的图像,与输入的图像的通道尺寸一致。比如,可以通过1×1卷积核对第K特征图像进行通道恢复处理。
在恢复通道后,可以将恢复通道后的特征图像,与输入的图像求和,得到输出图像。
在本申请实施例中,如图5所示的视频分类模型示意图中,所述特征提取层可以用于替换2D卷积神经网络骨架中的部分模块。比如,卷积神经网络骨架中可以包括多层(图5中包括四层),可以将每一层间隔替换原本的残差模块为特征提取层,并对其中的参数进行随机初始化。替换的位置和数量可以根据实验统计得出,从而在准确率和效率上取得更好的平衡。输入的视频帧经逐层的时空建模后,输入到全连接层,通过全连接层输出视频的分类结果,包括如视频中包括的行为的分类结果等。
本申请实施例通过特征提取层的时空分解和通道分解,极大的减少了卷积的计算量,通过简单替换原来的3×3空间卷积即可实现对时空信息的高效建模。并且本申请在时间、空间和通道三个不同的维度进行操作,增大感受野的同时,实现了所有通道的交互。另外,本申请实施例还可通过与时空卷积操作对应的协同注意力模块,以及通道注意力模块,有利于激活不同的通道特征。并且在子维度卷积核中插入1×1×1卷积,有利于特征的充分交互。
在卷积计算量方面,本申请所述视频分类方法,与基于SlowFast方法相比,训练一个模型所需要的显卡为128张V-100,所需训练的轮次为256轮,可以看出,该方法的训练成本极高。而本申请所述的视频分类方法训练仅仅需要8张2080Ti显卡,单张价格为V100的十分之一左右,训练轮次减少为100轮次,所以,本申请所述的视频分类方法的训练成本,相比较之前方法有极大的优势。
本申请的训练成本上的优势,主要在于本申请对于视频任务特别的模型设计:
第一,本申请可以利用2D模型的优势,从imageNetPre-train得到初始化模型,这很大程度的加速了本申请中的视频分类模型的收敛。
第二,我们在2D模型的基础上,将专门为视频任务设计的特征提取层嵌入,该特征提取层本身计算成本远远低于传统的3D卷积,甚至低于普通的2D卷积,这使得我们的模型在训练速度上比2D模型更快。在实际部署上,本申请所述的视频分类系统所需要的时间成本和计算成本非常低,这主要是来源于:1、视频分类模型本身的高效性,在相同的条件下,我们模型的理论速度比TSN,TSM这些之前非常高效的模型还要快20%以上,实际运行速度我们的模型会更快。第二,我们特有的采样策略可以利用更少的帧数来达到更高的准确率,这同样与我们模型的对于长时间视频的识别能力强分不开,所以本申请所述的视频分类方法对于同样一个视频利用更少的帧就可以准确的判断是什么类别。
在准确率方面,本申请在兼顾计算成本低廉的同时,在两个广泛应用的数据集kinetics和something-something上,我们都取得的目前主流算法中相对最高的准确率。这是非常难得的,因为这两个数据集代表两种完全不同的视频类型,kinetcis为代表的是背景和物体相关的视频,而something则是时序相关的视频。这说明了我们的模型不仅拥有着出色的日常视频(以YouTube为首)的识别能力,对于一些专业的视频(体育比赛)我们的模型也能捕捉快速变化而相似的动作之间不同的特点,进而做出准确的判断。
经过实验对比,将本申请所述的视频分类方法,与现有的视频分类方法,包括如CSN、R(2+1)D等方法相比,本申请所述视频分类方法,具有更大的时空感受野,可以更为准确的定位正确的关注区域,更准确地定位行为和交互的物体的时空位置,得到更为准确的判断。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
图6为本申请实施例提供的一种视频分类装置的示意图,如图6所示,该装置包括:
待分类视频获取单元601,用于获取待分类视频;
分类单元602,用于将所述待分类视频输入到已训练的视频分类模型中处理,输出所述待分类视频的分类结果;其中,所述视频分类模型包括特征提取层和全连接层,所述特征提取层用于将待分类视频的图像在通道维度进行分解,确定包括多个通道维度的分解图像,以及确定多个通道子维度分别对应的子维度卷积核,将所述子维度卷积核分解为时间卷积核和空间卷积核,根据所述时间卷积核和所述空间卷积核,在通道维度上对所述分解图像进行串联卷积处理,提取所述待分类视频的特征图像,所述全连接层用于根据所述特征提取层所提取的特征图像进行全连接处理,得到所述分类结果。
图6所示的视频分类装置,与图2所示的视频分类方法对应。
图7是本申请一实施例提供的视频分类设备的示意图。如图7所示,该实施例的视频分类设备7包括:处理器70、存储器71以及存储在所述存储器71中并可在所述处理器70上运行的计算机程序72,例如视频分类程序。所述处理器70执行所述计算机程序72时实现上述各个视频分类方法实施例中的步骤。或者,所述处理器70执行所述计算机程序72时实现上述各装置实施例中各模块/单元的功能。
示例性的,所述计算机程序72可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器71中,并由所述处理器70执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序72在所述视频分类设备7中的执行过程。
所述视频分类设备7可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述视频分类设备可包括,但不仅限于,处理器70、存储器71。本领域技术人员可以理解,图7仅仅是视频分类设备7的示例,并不构成对视频分类设备7的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述视频分类设备还可以包括输入输出设备、网络接入设备、总线等。
所称处理器70可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器 (Digital Signal Processor,DSP)、专用集成电路 (Application Specific Integrated Circuit,ASIC)、现场可编程门阵列 (Field-Programmable Gate Array,FPGA) 或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器71可以是所述视频分类设备7的内部存储单元,例如视频分类设备7的硬盘或内存。所述存储器71也可以是所述视频分类设备7的外部存储设备,例如所述视频分类设备7上配备的插接式硬盘,智能存储卡(Smart Media Card, SMC),安全数字(Secure Digital, SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器71还可以既包括所述视频分类设备7的内部存储单元也包括外部存储设备。所述存储器71用于存储所述计算机程序以及所述视频分类设备所需的其他程序和数据。所述存储器71还可以用于暂时地存储已经输出或者将要输出的数据。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的装置/终端设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/终端设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括是电载波信号和电信信号。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (10)

  1. 一种视频分类方法,其特征在于,所述方法包括:
    获取待分类视频;
    将所述待分类视频输入到已训练的视频分类模型中处理,输出所述待分类视频的分类结果;其中,所述视频分类模型包括特征提取层和全连接层,所述特征提取层用于将待分类视频的图像在通道维度进行分解,确定包括多个通道维度的分解图像,以及确定多个通道子维度分别对应的子维度卷积核,将所述子维度卷积核分解为时间卷积核和空间卷积核,根据所述时间卷积核和所述空间卷积核,在通道维度上对所述分解图像进行串联卷积处理,提取所述待分类视频的特征图像,所述全连接层用于根据所述特征提取层所提取的特征图像进行全连接处理,得到所述分类结果。
  2. 根据权利要求1所述的方法,其特征在于,将待分类视频的图像在通道维度进行分解,确定包括多个通道维度的分解图像,以及确定多个通道子维度分别对应的子维度卷积核,将所述子维度卷积核分解为时间卷积核和空间卷积核,根据所述时间卷积核和所述空间卷积核,在通道维度上对所述分解图像进行串联卷积处理,提取所述待分类视频的特征图像,包括:
    根据所述待分类视频的视频帧的通道特征C进行通道分解,确定参数信息为C 1×C 2×…×C K×T×H×W的分解图像,其中,通道特征C=C 1×C 2×…×C K,T为分解图像的时间维度,H为分解图像的高度,W为分解图像的宽度,K为通道维度分解的子维度数量;
    确定多个通道子维度分别对应的子维度卷积核包括:第一子维度卷积核C 1×1×…×1×t×h×w,第二子维度卷积核1×C 2×…×1×t×h×w,…,第K子维度卷积核1×1×…×C K×t×h×w,其中,t为卷积核时间维度,h为卷积核高度,w为卷积核宽度;
    将所述第一子维度卷积核分解为第一时间卷积核和第一空间卷积核,通过所述第一时间卷积核和第一空间卷积核对所述分解图像进行卷积处理,得到第一特征图像,将所述i子维度卷积核分解为第i时间卷积核和第i空间卷积核,通过所述第i时间卷积核和第i空间卷积核对第i-1特征图像进行卷积处理,得到第i特征图像,其中i大于1,第K特征图像为所述待分类视频的特征图像。
  3. 根据权利要求2所述的方法,其特征在于,将所述i子维度卷积核分解为第i时间卷积核和第i空间卷积核,通过所述第i时间卷积核和第i空间卷积核对第i-1特征图像进行卷积处理,得到第i特征图像,包括:
    将第i子维度卷积核分解为第i空间卷积核和第i时间卷积核;
    通过所述第i空间卷积核提取第i空间特征,根据所述第i时间卷积核提取第i时间特征;
    将所述第i空间特征和所述第i时间特征融合,得到第i特征图像,第K特征图像为所述待分类视频的特征图像,K为通道维度分解的子维度数量。
  4. 根据权利要求3所述的方法,其特征在于,在通过所述第i空间卷积核提取第i空间特征之前,所述方法还包括:
    在时间维度对待提取特征的图像进行平均池化处理;
    在根据所述第i时间卷积核提取第i时间特征之前,所述方法还包括:
    在空间维度对待提取特征的图像进行平均池化处理。
  5. 根据权利要求3所述的方法,其特征在于,将所述第i空间特征和所述第i时间特征融合,得到第i特征图像,包括:
    将所述第i空间特征和所述第i时间特征融合,得到融合图像;
    通过通道注意力模块对所述融合图像进行激活处理,得到第i特征图像。
  6. 根据权利要求3所述的方法,其特征在于,将所述第i空间特征和所述第i时间特征融合,得到第i特征图像,包括:
    将所述第i空间特征和所述第i时间特征融合,得到融合图像;
    对所述融合图像进行特征交互处理,得到第i特征图像。
  7. 根据权利要求1所述的方法,其特征在于,在所述特征提取层进行特征提取之前,还包括对所述待分类视频进行降低通道处理,在所述特征提取层进行特征提取之后,还包括对所述特征图像进行通道恢复处理。
  8. 一种视频分类方法,其特征在于,所述装置包括:
    待分类视频获取单元,用于获取待分类视频;
    分类单元,用于将所述待分类视频输入到已训练的视频分类模型中处理,输出所述待分类视频的分类结果;其中,所述视频分类模型包括特征提取层和全连接层,所述特征提取层用于将待分类视频的图像在通道维度进行分解,确定包括多个通道维度的分解图像,以及确定多个通道子维度分别对应的子维度卷积核,将所述子维度卷积核分解为时间卷积核和空间卷积核,根据所述时间卷积核和所述空间卷积核,在通道维度上对所述分解图像进行串联卷积处理,提取所述待分类视频的特征图像,所述全连接层用于根据所述特征提取层所提取的特征图像进行全连接处理,得到所述分类结果。
  9. 一种视频分类设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述方法的步骤。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述方法的步骤。
PCT/CN2021/137912 2021-03-05 2021-12-14 视频分类方法、装置及设备 WO2022183805A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110245391.3 2021-03-05
CN202110245391.3A CN112926472A (zh) 2021-03-05 2021-03-05 视频分类方法、装置及设备

Publications (1)

Publication Number Publication Date
WO2022183805A1 true WO2022183805A1 (zh) 2022-09-09

Family

ID=76173469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/137912 WO2022183805A1 (zh) 2021-03-05 2021-12-14 视频分类方法、装置及设备

Country Status (2)

Country Link
CN (1) CN112926472A (zh)
WO (1) WO2022183805A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926472A (zh) * 2021-03-05 2021-06-08 深圳先进技术研究院 视频分类方法、装置及设备
CN113869182B (zh) * 2021-09-24 2024-05-31 北京理工大学 一种视频异常检测网络及其训练方法
CN116168334A (zh) * 2023-04-26 2023-05-26 深圳金三立视频科技股份有限公司 一种视频行为分类的方法及终端
CN117434452B (zh) * 2023-12-08 2024-03-05 珠海市嘉德电能科技有限公司 锂电池充放电检测方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110182469A1 (en) * 2010-01-28 2011-07-28 Nec Laboratories America, Inc. 3d convolutional neural networks for automatic human action recognition
CN110163080A (zh) * 2019-04-02 2019-08-23 腾讯科技(深圳)有限公司 人脸关键点检测方法及装置、存储介质和电子设备
CN110929622A (zh) * 2019-11-15 2020-03-27 腾讯科技(深圳)有限公司 视频分类方法、模型训练方法、装置、设备及存储介质
CN111859023A (zh) * 2020-06-11 2020-10-30 中国科学院深圳先进技术研究院 视频分类方法、装置、设备及计算机可读存储介质
CN112926472A (zh) * 2021-03-05 2021-06-08 深圳先进技术研究院 视频分类方法、装置及设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110182469A1 (en) * 2010-01-28 2011-07-28 Nec Laboratories America, Inc. 3d convolutional neural networks for automatic human action recognition
CN110163080A (zh) * 2019-04-02 2019-08-23 腾讯科技(深圳)有限公司 人脸关键点检测方法及装置、存储介质和电子设备
CN110929622A (zh) * 2019-11-15 2020-03-27 腾讯科技(深圳)有限公司 视频分类方法、模型训练方法、装置、设备及存储介质
CN111859023A (zh) * 2020-06-11 2020-10-30 中国科学院深圳先进技术研究院 视频分类方法、装置、设备及计算机可读存储介质
CN112926472A (zh) * 2021-03-05 2021-06-08 深圳先进技术研究院 视频分类方法、装置及设备

Also Published As

Publication number Publication date
CN112926472A (zh) 2021-06-08

Similar Documents

Publication Publication Date Title
Sun et al. Lattice long short-term memory for human action recognition
WO2022183805A1 (zh) 视频分类方法、装置及设备
Liu et al. Learning spatio-temporal representations for action recognition: A genetic programming approach
Dong et al. Vehicle type classification using a semisupervised convolutional neural network
CN105512289B (zh) 基于深度学习和哈希的图像检索方法
CN111310676A (zh) 基于CNN-LSTM和attention的视频动作识别方法
Sahoo et al. On an algorithm for human action recognition
Huang et al. Identification of the source camera of images based on convolutional neural network
Funk et al. Beyond planar symmetry: Modeling human perception of reflection and rotation symmetries in the wild
CN111488805B (zh) 一种基于显著性特征提取的视频行为识别方法
CN109919011A (zh) 一种基于多时长信息的动作视频识别方法
CN107330392A (zh) 视频场景标注装置与方法
CN111539290A (zh) 视频动作识别方法、装置、电子设备及存储介质
CN110232318A (zh) 穴位识别方法、装置、电子设备及存储介质
Chandran et al. Missing child identification system using deep learning and multiclass SVM
CN110222572A (zh) 跟踪方法、装置、电子设备及存储介质
CN111553419A (zh) 一种图像识别方法、装置、设备以及可读存储介质
Li et al. Multi-scale sparse network with cross-attention mechanism for image-based butterflies fine-grained classification
CN112183240A (zh) 一种基于3d时间流和并行空间流的双流卷积行为识别方法
CN111898703A (zh) 多标签视频分类方法、模型训练方法、装置及介质
CN113269224A (zh) 一种场景图像分类方法、系统及存储介质
CN114282059A (zh) 视频检索的方法、装置、设备及存储介质
Wang et al. Basketball shooting angle calculation and analysis by deeply-learned vision model
Li et al. Fast recognition of pig faces based on improved Yolov3
Hu et al. Lightweight multi-scale network with attention for facial expression recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21928876

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21928876

Country of ref document: EP

Kind code of ref document: A1