WO2021248859A1 - Procédé et appareil de classification vidéo, ainsi que dispositif et support de stockage lisible par ordinateur - Google Patents

Procédé et appareil de classification vidéo, ainsi que dispositif et support de stockage lisible par ordinateur Download PDF

Info

Publication number
WO2021248859A1
WO2021248859A1 PCT/CN2020/134995 CN2020134995W WO2021248859A1 WO 2021248859 A1 WO2021248859 A1 WO 2021248859A1 CN 2020134995 W CN2020134995 W CN 2020134995W WO 2021248859 A1 WO2021248859 A1 WO 2021248859A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
pooling
information
feature extraction
feature information
Prior art date
Application number
PCT/CN2020/134995
Other languages
English (en)
Chinese (zh)
Other versions
WO2021248859A9 (fr
Inventor
乔宇
王亚立
李先航
周志鹏
邹静
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2021248859A1 publication Critical patent/WO2021248859A1/fr
Publication of WO2021248859A9 publication Critical patent/WO2021248859A9/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application belongs to the field of image processing, and particularly relates to video classification methods, devices, equipment, and computer-readable storage media.
  • the embodiments of the present application provide a video classification method, device, device, and computer-readable storage medium to solve the problem of the conventional three-dimensional convolution kernel for convolution calculation video classification compared to the two-dimensional convolution calculation. , Will add extra parameters, leading to the problem of increased calculation.
  • the first aspect of the embodiments of the present application provides a video classification method, and the method includes:
  • the video classification model includes a feature extraction layer and a fully connected layer, and the feature extraction layer is used for
  • the spatial feature information of the multiple video frames is extracted through two-dimensional convolution, and the temporal feature information of the multiple video frames is extracted through pooling, and the spatial feature information and the temporal feature information are merged to output the fused feature information, so
  • the fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.
  • the feature extraction layer includes N feature extraction sublayers, N ⁇ 1, and the first feature extraction sublayer in the N feature extraction sublayers
  • the input information of the layer is the multiple video frames
  • the output information of the previous feature extraction sublayer is the input information of the next feature extraction sublayer
  • the output information of the Nth feature extraction sublayer is the output information of the feature extraction layer Fusion feature information
  • each feature extraction sublayer in the N feature extraction sublayers includes a large receptive field context feature extraction branch and a small receptive field core feature extraction branch
  • each feature in the N feature extraction sublayers includes:
  • the temporal feature information extracted by the context feature extraction branch of the large receptive field and the spatial feature information extracted by the core feature extraction branch of the small receptive field are fused to obtain output information.
  • the input information is pooled through the context feature extraction branch of the large receptive field, and the time when the input information is extracted Characteristic information, including:
  • the input information is three-dimensionally pooled through the context feature extraction branch of the large receptive field to obtain pooled information, including :
  • the input information is pooled by the three-dimensional pooling kernel ⁇ t, K, K ⁇ in the context feature extraction branch of the large receptive field to obtain pooling information, where t is the smaller the kernel size in the time direction, and t Less than or equal to the video duration, K is the size of the pooling kernel in the two-dimensional space where the image is located, and the three-dimensional pooling kernel is the size of the pooling pixel selected in a single pooling calculation.
  • the size of the N three-dimensional pooling cores is completely The same, or the sizes of the N three-dimensional pooling nuclei are completely different, or the sizes of some of the N three-dimensional pooling nuclei are the same, and the three-dimensional pooling nuclei are the pooling pixels selected in a single pooling calculation the size of.
  • the sizes of N three-dimensional pooling nuclei are completely different, including:
  • the size of the three-dimensional pooling core is gradually increased.
  • gradually increasing the size of the three-dimensional pooling core includes:
  • the convolution parameter of the two-dimensional convolution process in the context feature extraction branch of the large receptive field is the same as that of the small
  • the convolution parameters of the two-dimensional convolution processing in the core feature extraction branch of the receptive field are the same.
  • the size of the image corresponding to the input information of the pooling process and the output information of the pooling process are consistent.
  • the input feature image or video frame is filled in the time dimension or the space dimension, so that after the pooling check is filled After the input information is pooled, the size of the image corresponding to the output information obtained is consistent with the size of the image corresponding to the input information.
  • fusing the spatial feature information and the temporal feature information to output fused feature information includes:
  • the image of the spatial feature information is superimposed on the image of the temporal feature information to generate the fusion feature information.
  • an embodiment of the present application provides a video classification device, the device including:
  • a to-be-classified video obtaining unit configured to obtain a to-be-classified video, where the to-be-classified video includes a plurality of video frames;
  • the classification unit is configured to input the video to be classified into a trained video classification model for processing, and output the classification result of the video to be classified; wherein, the video classification model includes a feature extraction layer and a fully connected layer, and The feature extraction layer is used to extract the spatial feature information of the multiple video frames through two-dimensional convolution, and extract the temporal feature information of the multiple video frames through pooling, and output the fusion of the spatial feature information and the temporal feature information
  • the feature information is fused, and the fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.
  • the third aspect of the embodiments of the present application provides a video classification device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes the computer program At this time, the video classification device is made to implement the video classification method as described in any one of the first aspect.
  • a fourth aspect of the embodiments of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the video classification described in any one of the first aspect is implemented method.
  • this application uses a classification model to extract the spatial feature information of multiple video frames in the video to be classified through two-dimensional convolution, and extracts the information of multiple video frames through pooling.
  • Time feature information is combined with time feature information and spatial feature information, and the classification result is obtained through the fully connected layer. Since this application can obtain the temporal feature information of the video to be classified through pooling, compared to the calculation of the three-dimensional convolution kernel, while retaining the temporal feature information, the two-dimensional convolution calculation method adopted by this application can greatly reduce the volume.
  • the calculation of product parameters helps reduce the amount of calculation for video classification.
  • any two-dimensional convolutional network can be inserted to classify videos, which is beneficial to improve the diversity and versatility of video classification methods.
  • FIG. 1 is a schematic diagram of a video classification application scenario provided by an embodiment of the present application
  • Fig. 2 is a schematic diagram of video classification using three-dimensional convolution in the prior art
  • FIG. 3 is a schematic diagram of the implementation process of a video classification method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the implementation of a video classification method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of the implementation of a video classification provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of another implementation of video classification provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a video classification device provided by an embodiment of the present application.
  • Fig. 8 is a schematic diagram of a video classification device provided by an embodiment of the present application.
  • video classification technology is used to classify the collected surveillance video to determine whether there is an abnormality in the video content.
  • the video classification method described in this application is not sensitive to the speed of the action frame change, and can effectively model the actions of different durations.
  • the surveillance video can be classified through the modeling, which can help the user to quickly find the key surveillance Information, or send abnormal reminders to the monitoring personnel in time, so that the monitoring personnel can deal with the abnormalities in the surveillance video in a timely manner.
  • the video classification technology can be used to classify a large number of videos into different scenes, different moods, and different types of videos, so that users can quickly find the videos they need.
  • smart sports training or video-assisted refereeing it includes faster-moving sports videos, such as shooting, gymnastics or speed skating, and slower sports videos, such as yoga videos.
  • the video classification method described in this application is not sensitive to the speed and time of the motion, and the motion in the motion video can be classified.
  • the platform server receives the self-photographed video uploaded by terminal A, and classifies the uploaded video to obtain the category of the video uploaded by terminal A.
  • the number of uploaded videos is increasing, the number of videos in the same category is also increasing.
  • other terminals such as terminal B browse, obtain the video category browsed by terminal B through the pre-classification result.
  • the platform can search for other videos in the same category and recommend them to terminal B according to the category of the video browsed by terminal B, so as to improve the user experience of browsing the video.
  • a three-dimensional convolution kernel including time information is selected, such as a 3*1*1 temporal convolution kernel, and the video to be classified is convolved.
  • the three-dimensional convolution kernel includes the width W, height H, and duration T of the image.
  • the three-dimensional convolution kernel increases the parameter calculation of the time dimension, adds a large number of parameters, and increases the amount of calculation for video classification.
  • the video classification method includes:
  • step S301 a video to be classified is obtained, and the video to be classified includes multiple video frames.
  • the video to be classified in the embodiment of the present application may be a video stored in a user terminal, a video collected by a monitoring device, or a video uploaded by a platform user received by a video entertainment platform.
  • the video is a video collected by a monitoring device
  • the real-time collected video can be divided into several sub video segments according to a preset time period, and the collected sub video segments can be classified to determine whether the sub video segments are Whether there is an abnormality.
  • the video to be classified includes multiple video frames, and the multiple video frames are sequentially arranged in a time sequence. According to the video to be classified, the spatial information of the width W and the height H of each video frame can be determined. According to the time interval between video frames and the initial playback time, the playback time corresponding to each video frame can be determined.
  • step S302 the video to be classified is input into a trained video classification model for processing, and the classification result of the video to be classified is output; wherein, the video classification model includes a feature extraction layer and a fully connected layer.
  • the feature extraction layer is used to extract the spatial feature information of the multiple video frames through two-dimensional convolution, and extract the temporal feature information of the multiple video frames through pooling, and output the fusion of the spatial feature information and the temporal feature information
  • the feature information is fused, and the fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.
  • the feature extraction layer may include a large receptive field context feature extraction branch and a small receptive field core feature extraction branch.
  • the large receptive field context feature extraction branch is used to extract time feature information, or it can also extract time feature information including time feature information, and the context feature is also time feature information.
  • the large receptive field can be obtained by cascading multiple feature extraction sub-layers, or by gradually increasing the size of the three-dimensional pooling core.
  • the small receptive field core feature extraction branch is used to extract the spatial feature information of the two-dimensional plane in each video frame in the video to be classified.
  • the feature extraction layer is also used to fuse the extracted temporal feature information and spatial feature information to obtain fused feature information. That is, through the dual-branch structure, the context information extracted by the context extraction branch of the large receptive field and the core features extracted by the core feature extraction branch of the small receptive field can be effectively obtained.
  • the feature extraction layer may include N feature extraction sublayers, where N is greater than or equal to 1.
  • the feature extraction layer may include one feature extraction sub-layer, and the fused feature information is output through one feature extraction sub-layer, and the fused feature information is fully connected through the fully connected layer to obtain the classification result.
  • the output information of the feature extraction sublayer of the previous or previous level is used as the input information of the feature extraction sublayer of the next or next level.
  • the fusion feature information output by the i-th feature extraction sub-layer is used as the input information of the i+1-th feature extraction sub-layer.
  • the fused feature information output by the i-th feature extraction sublayer is fused with time feature information and spatial feature information, and the i+1th feature extraction sublayer can further extract feature information through pooling.
  • i is greater than or equal to 1 and less than N.
  • the fusion feature information refers to the feature information after the time feature information and the space feature information are fused.
  • the fusion processing may refer to the superposition of feature information. For example, it may be an image corresponding to temporal feature information, and an image corresponding to spatial feature information, which is subjected to pixel superposition processing.
  • the input information of the pooling process can be made to match the output information of the pooling process.
  • the size is consistent.
  • the input information can be filled with Padding processing, that is, the input feature image or video frame is filled in the time dimension, or the space dimension is also included, so that the pooling check is performed on the filled input information.
  • the size of the output information obtained is consistent with the size of the unfilled input information.
  • the formula can be:
  • the padding parameter can be selected to have a size of 2.
  • the large and small pooling core is 3*3*3, which means that the dimensional size of the pooling core in the two-dimensional plane where the image to be pooled is located is 3*3, and the unit can be pixels or other predetermined length units.
  • the length of the time dimension is 3, and the unit may be the video duration, for example, the video duration of 3 seconds.
  • the number of video frames corresponding to the video duration can be determined by the video duration.
  • the definition of the three-dimensional pooling core may not be limited to this, and the size of the pooling core in the time dimension can also be determined directly by the number of video frames.
  • the two-dimensional convolution refers to the convolution performed on the dimensions of the plane where the image of the video frame is located, that is, the two dimensions of width and height.
  • the larger the selected convolution kernel and the smaller one are the convolution kernels in the two-dimensional space.
  • the spatial feature information can be extracted based on a predetermined fixed-size convolution kernel.
  • existing neural network models can also be used, such as LeNet-based convolutional neural networks, AlexNet-based convolutional neural networks, ResNet-based convolutional neural networks, Google-based convolutional neural networks, and VGGNet-based convolutional neural networks.
  • Neural network models such as convolutional neural networks extract spatial feature information. Therefore, in the process of extracting spatial feature information, there is no need to change the ability of the convolutional neural network to recognize video frames to obtain the spatial feature information of the video frames in the video to be classified. Included feature information.
  • any two-dimensional convolutional network can be inserted into the video classification method described in this application, the effect of the three-dimensional convolutional network on the collection of temporal feature information is achieved, and the optimization of feature hardware or deep learning platform is not required, so there is no need With the help of a specific network design, the versatility of the video classification method described in this application can be effectively improved.
  • the input information can be three-dimensionally pooled through the large receptive field context feature extraction branch to obtain pooled information, and then the pooled information can be secondarily performed through the large receptive field context feature extraction branch.
  • Dimensional convolution processing to obtain temporal feature information.
  • the two-dimensional convolution is based on the convolution operation on the two-dimensional plane where the image of a single video frame is located, without increasing the feature information of the two-dimensional image.
  • the spatial feature information of each frame of the video to be classified is obtained, that is, the feature information of the width W and height H dimensions of each video frame is obtained.
  • the convolution kernel of the two-dimensional convolution can be expressed as: ⁇ C1, C2, 1, K, K ⁇ , where C1 represents the number of channels of the input feature image, and C2 represents the output feature image
  • the number of channels in the convolution kernel, where the "1" is located, indicates the time dimension of the convolution kernel, and "1" means that the convolution kernel is not expanded in the time dimension, that is, each time the two-dimensional convolution is performed, it is only for the same
  • the image of the video frame is convolved
  • K represents the size of the convolution kernel in the two-dimensional space where the video frame is located.
  • the time feature information is extracted through pooling, and the pooling processing may include pooling processing methods such as maximum pooling, average pooling, or global average pooling. For example, when the maximum pooling operation is selected, the pixels to be pooled can be selected according to the pooling kernel, and the pixel with the largest pixel value can be selected as the pixel value after pooling.
  • pooling processing methods such as maximum pooling, average pooling, or global average pooling. For example, when the maximum pooling operation is selected, the pixels to be pooled can be selected according to the pooling kernel, and the pixel with the largest pixel value can be selected as the pixel value after pooling.
  • the three-dimensional pooling kernel can be expressed as ⁇ t, K, K ⁇ , where t represents the size of the pooling kernel in the time direction, and K represents the size of the pooling kernel in the two-dimensional space where the image is located. .
  • the number of corresponding video frames in the pooling process is also different.
  • the same video frame can be used as the objects pooled by different pooling cores.
  • the K value in the pooling kernel is greater than 1, it means that the pooling kernel also pools multiple pixels or regions in the two-dimensional space.
  • a pooling operation with padding can be used to fill the edges of the pooled image to ensure the consistency of the image size of the input information and output information before and after pooling.
  • convolution processing is performed on the output information of the pooling process.
  • the pooled output information is fused with spatiotemporal information of the size of t*K*K on the adjacent time and space, and then convolution is performed on the pooled output information through a two-dimensional convolution method to obtain multiple video frames Time characteristic information.
  • the convolution operation of the small receptive field core feature extraction branch and the large receptive field context feature extraction branch may use the same convolution parameter to perform the convolution operation in a manner of sharing parameters.
  • the size of any two three-dimensional pooling nuclei may be different, or it may be that the sizes of the N three-dimensional pooling nuclei are all the same, or It may also be that the size of some of the three-dimensional pooling nuclei is the same, and the size of some of the three-dimensional pooling nuclei are different.
  • the three-dimensional pooling core used in the three-dimensional pooling process of the context feature extraction branch of the large receptive field may adopt different sizes of time dimensions or different sizes of space dimensions.
  • adjusting the size of the pooling core used in the three-dimensional pooling may include adjusting the size of the time dimension or the time direction in the three-dimensional pooling core, or the size of the three-dimensional pooling core in the two-dimensional space where the video frame is located.
  • the size of the dimensionality, or the size of the three-dimensional pooling core in the time and space dimensions to obtain different sizes of three-dimensional pooling cores.
  • the corresponding spatiotemporal feature information is calculated, and the spatiotemporal feature information includes time Characteristic information.
  • you can gradually increase the size of the pooling core including gradually increasing the size of the pooling core in the time dimension, or gradually increasing the size of the two-dimensional space where the video frame of the pooling core is located, or increasing the pool at the same time
  • the size of the core in the time dimension and the dimensions of the two-dimensional space where the video frame is located to obtain the feature image after pooling, so that the time feature information of different duration features obtained after the pooling of different pooling cores is gradually merged to obtain more Fine-grained spatio-temporal feature information.
  • the images corresponding to the spatial feature information and the temporal feature information use the same convolution parameters for convolution operations, the images of the temporal feature information and the temporal feature information
  • the information represented by the corresponding points has spatial consistency, that is, the size of the spatial feature information and the temporal feature information are the same, and the strategy of point-by-point addition in space can be adopted to obtain the fusion feature information.
  • the fusion feature information obtained by fusing the spatial feature information and the temporal feature information, and the spatial feature information extracts the spatial features of the video frame through two-dimensional convolution, and the temporal feature information extracts the temporal and spatial features of the image through pooling, thus fusing the feature information Including the spatial features and spatiotemporal features of the images in the video to be classified, the fusion feature information is synthesized through a fully connected layer, and the video to be classified is classified according to the integrated fusion feature information to obtain a video classification result. For example, a fully connected calculation may be performed on the fusion feature information according to a preset weight coefficient of a fully connected layer, and the result of the video classification may be determined by comparing the calculation result with a preset classification standard.
  • the video classification model may include two or more feature extraction layers, and two or more spatiotemporal features can be extracted through the two or more feature extraction layers.
  • Image the video to be classified is a kind of spatiotemporal feature image.
  • the feature extraction layer includes two feature extraction sublayers.
  • the feature extraction layer may be referred to as a SmallBig unit for short.
  • the feature extraction layer in the video classification model includes two feature extraction sublayers, namely SmallBig unit 1 and SmallBig unit 2, and the previous feature extraction sublayer SmallBig unit 1 extracts the fusion feature information, It can be used as the input of the next-level feature extraction sub-layer SmallBig unit 2.
  • the fully connected layer performs video classification and outputs the category of the video.
  • the video to be classified is input to the first-level feature extraction layer SmallBig unit 1, and the first convolution operation of two-dimensional convolution is performed on the multiple video frames included therein, and the multiple video frames are obtained. Included spatial feature information.
  • the first pooling operation of the video frame of the video to be classified in the time dimension it includes multiple video frames in the video to be classified, using a three-dimensional pooling kernel with a predetermined duration parameter to perform pooling processing.
  • the second convolution parameter of the first convolution operation is used to perform the second two-dimensional convolution on the pooled image. Convolution operation to obtain time feature information.
  • the spatio-temporal feature information is fused with the temporal feature information to obtain the fusion feature information.
  • the corresponding pixels of the image corresponding to the spatial feature information and the temporal feature information are pixel-added to obtain the fusion feature information including the spatial feature and the temporal feature
  • the fusion feature information may include multiple frames of images.
  • the fusion feature information is input to the second-level feature extraction sublayer SmallBig unit 2, and the image of each channel in the fusion feature information is subjected to the third convolution operation to further obtain the spatial features in the fusion feature information of the SmallBig unit 1.
  • a second pooling operation is performed on the fusion feature information in the time dimension according to the time sequence of the channels, and the pooling information obtained by the second pooling operation Perform the fourth convolution operation to further extract the time feature information of multiple images in the fusion feature information of the SmallBig unit 1.
  • the fourth convolution operation and the third convolution operation use the same convolution parameters.
  • Fig. 6 is a schematic diagram of implementing video classification through three feature extraction sublayers provided by an embodiment of the application.
  • a third-level feature extraction sublayer SmallBig unit 3 is added.
  • the first-level feature extraction sub-layer SmallBig unit 1 output fusion feature information is processed, and the second-level feature extraction sub-layer SmallBig unit 2 fusion processing
  • the fusion feature information output by the SmallBig unit 2 of the second-level feature extraction sublayer is obtained.
  • the third-level feature extraction sub-layer SmallBig unit 3 separately processes the fusion feature information output by the second-level feature extraction sub-layer SmallBig unit 2 through two-dimensional convolution and pooling, and further extracts temporal feature information and space The feature information is fused to obtain the fused feature information output by the SmallBig unit 3 of the third-level feature extraction sublayer.
  • the feature extraction layer is also used to superimpose the to-be-classified video with the fusion feature information output by the feature extraction layer to form a residual connection to update the fusion feature information.
  • the fused data includes the time feature information calculated by the third-level feature extraction sub-layer SmallBig unit 3 and
  • the spatial feature information is superimposed with the video to be classified, and the temporal feature information and spatial feature information extracted by the third-level feature extraction sublayer are merged to form the residual connection structure, so that the newly added parameters are not used during training. It will affect the parameters of the original pre-training image network, which will help improve the pre-training effect of the image network, and the introduction of residuals will help speed up the convergence and improve the training efficiency of the video classification model.
  • the first convolution kernel used by the feature extraction subunit of the first level is the larger and the smaller is the first convolution kernel
  • pooling uses the first pooling kernel
  • the second level feature extraction subunit uses the convolution
  • the larger one is the second convolution kernel
  • the second pooling kernel is used for pooling.
  • the larger and smaller convolution kernels used by the third-level feature extraction subunit are the third convolution kernel, and the third pooling is used for pooling.
  • the first convolution kernel used in the two-dimensional convolution and the third convolution kernel used in the third convolution operation are smaller than the second convolution kernel used in the second convolution operation. size.
  • the larger or smaller of the first convolution kernel and the third convolution kernel are 1*1*1, and the larger or smaller of the second convolution kernel is 1*3*3.
  • the first pooling core and the second pooling core may be smaller than the third pooling core used in the third pooling operation.
  • the larger and smaller pooled cores of the first pooled core and the second pooled core are 3*3*3, and the smaller of the third pooled core is 3*3*.
  • T can be the video duration or the number of video frames corresponding to the video duration.
  • t is the duration.
  • t is the number of video frames.
  • the temporal characteristics of the video frame of the entire video length can be extracted.
  • the output fusion feature information has a global time perception to follow.
  • two spatially local receptive fields have been added, so that the spatial receptive field of the overall module has also been increased.
  • the video classification system described in this application can be trained using optimization algorithms such as stochastic gradient descent (SGD), and the data set can be mainstream video task data.
  • SGD stochastic gradient descent
  • the video classification method described in this application can provide higher accuracy, faster convergence and better robustness, which is comparable to the current state-of-the-art
  • our video classification and recognition with only 8 frames of input is better than the 32-frame Nonlocal-R50 (non-local R50 network), and it uses 4.9 times less than the 128-frame Nonlocal-R50.
  • the number of floating-point operations per second (the full English name is floating-point operations per second, and the English abbreviation is GFlops), but with the same accuracy.
  • the performance of the 8-frame input of the video classification method described in this application is better than the current state-of-the-art 36-frame input fast and slow combined R50 network (the full English name is SlowFast-R50).
  • the present application also provides a method for training a video classification model.
  • the method includes: obtaining a sample video in a sample video set and a sample classification result of the sample video, the sample video including a plurality of video frames; Extract the spatial feature information in the sample video; extract the temporal feature information in the sample video through pooling; fuse the spatial feature information and the temporal feature information to obtain fused feature information, and perform full connection processing on the fused feature information to obtain a model Classification result; the model classification result and the sample classification result are corrected, the parameters of the two-dimensional convolution are corrected, and the step of extracting the spatial feature information in the sample video through the two-dimensional convolution is returned to the execution until the model classification result is consistent with The sample classification result meets the preset condition, and the trained video classification model is obtained.
  • the structure of the video classification model is consistent with the neural network model adopted by the video classification method shown in FIG. 2, and will not be repeated here.
  • FIG. 7 is a schematic diagram of a video classification device provided by an embodiment of the application, and the video classification device includes:
  • the video to be classified acquisition unit 701 is configured to acquire a video to be classified, where the video to be classified includes a plurality of video frames;
  • the classification unit 702 is configured to input the video to be classified into a trained video classification model for processing, and output the classification result of the video to be classified; wherein, the video classification model includes a feature extraction layer and a fully connected layer, so The feature extraction layer is used for extracting the spatial feature information of the multiple video frames through two-dimensional convolution, and extracting the temporal feature information of the multiple video frames through pooling, and fusing the spatial feature information and temporal feature information Output fusion feature information, and the fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.
  • the video classification device described in FIG. 7 corresponds to the video classification method shown in FIG. 3. With the video classification device, the video classification method described in any of the above embodiments can be executed.
  • Fig. 8 is a schematic diagram of a video classification device provided by an embodiment of the present application.
  • the video classification device 8 of this embodiment includes a processor 80, a memory 81, and a computer program 82 stored in the memory 81 and running on the processor 80, such as a video classification program.
  • the processor 80 executes the computer program 82, the steps in the foregoing embodiments of the video classification method are implemented.
  • the processor 80 executes the computer program 82, the function of each module/unit in the foregoing device embodiments is realized.
  • the computer program 82 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 81 and executed by the processor 80 to complete This application.
  • the one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program 82 in the video classification device 8.
  • the video classification device 8 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the video classification device may include, but is not limited to, a processor 80 and a memory 81.
  • FIG. 8 is only an example of the video classification device 8 and does not constitute a limitation on the video classification device 8. It may include more or less components than shown in the figure, or combine certain components, or different
  • the video classification device may also include input and output devices, network access devices, buses, and so on.
  • the so-called processor 80 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 81 may be an internal storage unit of the video classification device 8, for example, a hard disk or a memory of the video classification device 8.
  • the memory 81 may also be an external storage device of the video classification device 8, such as a plug-in hard disk equipped on the video classification device 8, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD card, Flash Card, etc.
  • the memory 81 may also include both an internal storage unit of the video classification device 8 and an external storage device.
  • the memory 81 is used to store the computer program and other programs and data required by the video classification device.
  • the memory 81 can also be used to temporarily store data that has been output or will be output.
  • the disclosed device/terminal device and method may be implemented in other ways.
  • the device/terminal device embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units.
  • components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated module/unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through a computer program.
  • the computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, it can implement the steps of the foregoing method embodiments.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate form.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media, etc.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electrical carrier signal telecommunications signal
  • software distribution media etc.
  • the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction.
  • the computer-readable medium Does not include electrical carrier signals and telecommunication signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

La présente invention concerne un procédé et un appareil de classification vidéo, ainsi qu'un dispositif et un support de stockage lisible par ordinateur. Le procédé de classification vidéo consiste : à obtenir une vidéo à classifier, ladite vidéo comprenant de multiples trames vidéo (S301) ; et à entrer ladite vidéo dans un modèle de classification de vidéo formé pour un traitement et à délivrer en sortie un résultat de classification de ladite vidéo, le modèle de classification vidéo comprenant une couche d'extraction de caractéristiques et une couche complètement connectée, la couche d'extraction de caractéristiques étant utilisée pour extraire des informations de caractéristiques spatiales au moyen d'une convolution bidimensionnelle, à extraire des informations de caractéristiques temporelles au moyen d'une mise en commun et à fusionner les informations de caractéristiques spatiales et les informations de caractéristiques temporelles pour délivrer en sortie des informations de caractéristiques fusionnées, et la couche complètement connectée étant utilisée pour effectuer un traitement de connexion complet sur les informations de caractéristiques fusionnées pour obtenir le résultat de classification (S302). Selon le procédé, par rapport au calcul d'un noyau de convolution tridimensionnel, des informations de caractéristiques d'une dimension temporelle de ladite vidéo sont obtenues par regroupement et la convolution bidimensionnelle utilisée peut réduire considérablement le calcul de paramètres de convolution, ce qui facilite la réduction de la complexité de calcul de la classification vidéo.
PCT/CN2020/134995 2020-06-11 2020-12-09 Procédé et appareil de classification vidéo, ainsi que dispositif et support de stockage lisible par ordinateur WO2021248859A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010531316.9A CN111859023B (zh) 2020-06-11 2020-06-11 视频分类方法、装置、设备及计算机可读存储介质
CN202010531316.9 2020-06-11

Publications (2)

Publication Number Publication Date
WO2021248859A1 true WO2021248859A1 (fr) 2021-12-16
WO2021248859A9 WO2021248859A9 (fr) 2022-02-10

Family

ID=72986143

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/134995 WO2021248859A1 (fr) 2020-06-11 2020-12-09 Procédé et appareil de classification vidéo, ainsi que dispositif et support de stockage lisible par ordinateur

Country Status (2)

Country Link
CN (1) CN111859023B (fr)
WO (1) WO2021248859A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115130539A (zh) * 2022-04-21 2022-09-30 腾讯科技(深圳)有限公司 分类模型训练、数据分类方法、装置和计算机设备
CN115243031A (zh) * 2022-06-17 2022-10-25 合肥工业大学智能制造技术研究院 一种基于质量注意力机制的视频时空特征优化方法、系统、电子设备及存储介质
CN116824641A (zh) * 2023-08-29 2023-09-29 卡奥斯工业智能研究院(青岛)有限公司 姿态分类方法、装置、设备和计算机存储介质
WO2024001139A1 (fr) * 2022-06-30 2024-01-04 海信集团控股股份有限公司 Procédé et appareil de classification de vidéo et dispositif électronique
CN117668719A (zh) * 2023-11-14 2024-03-08 深圳大学 一种自适应阈值的隧道监测数据异常检测方法
CN114677704B (zh) * 2022-02-23 2024-03-26 西北大学 一种基于三维卷积的时空特征多层次融合的行为识别方法
CN118214958A (zh) * 2024-05-21 2024-06-18 环球数科集团有限公司 一种基于超分融合的视频合成系统

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859023B (zh) * 2020-06-11 2024-05-03 中国科学院深圳先进技术研究院 视频分类方法、装置、设备及计算机可读存储介质
CN112580696A (zh) * 2020-12-03 2021-03-30 星宏传媒有限公司 一种基于视频理解的广告标签分类方法、系统及设备
CN112597824A (zh) * 2020-12-07 2021-04-02 深延科技(北京)有限公司 行为识别方法、装置、电子设备和存储介质
CN112926472A (zh) * 2021-03-05 2021-06-08 深圳先进技术研究院 视频分类方法、装置及设备
CN113536898B (zh) * 2021-05-31 2023-08-29 大连民族大学 全面特征捕捉型时间卷积网络、视频动作分割方法、计算机系统和介质
CN113569811A (zh) * 2021-08-30 2021-10-29 创泽智能机器人集团股份有限公司 一种行为识别方法及相关装置
CN114529761B (zh) * 2022-01-29 2024-10-15 腾讯科技(深圳)有限公司 基于分类模型的视频分类方法、装置、设备、介质及产品
CN118214922B (zh) * 2024-05-17 2024-08-30 环球数科集团有限公司 一种使用CNNs滤波器捕获视频空间和时间特征的系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635790A (zh) * 2019-01-28 2019-04-16 杭州电子科技大学 一种基于3d卷积的行人异常行为识别方法
CN109670446A (zh) * 2018-12-20 2019-04-23 泉州装备制造研究所 基于线性动态系统和深度网络的异常行为检测方法
US20190188379A1 (en) * 2017-12-18 2019-06-20 Paypal, Inc. Spatial and temporal convolution networks for system calls based process monitoring
CN110766096A (zh) * 2019-10-31 2020-02-07 北京金山云网络技术有限公司 视频分类方法、装置及电子设备
CN110781830A (zh) * 2019-10-28 2020-02-11 西安电子科技大学 基于空-时联合卷积的sar序列图像分类方法
CN111859023A (zh) * 2020-06-11 2020-10-30 中国科学院深圳先进技术研究院 视频分类方法、装置、设备及计算机可读存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10402697B2 (en) * 2016-08-01 2019-09-03 Nvidia Corporation Fusing multilayer and multimodal deep neural networks for video classification
CN107292247A (zh) * 2017-06-05 2017-10-24 浙江理工大学 一种基于残差网络的人体行为识别方法及装置
CN107463949B (zh) * 2017-07-14 2020-02-21 北京协同创新研究院 一种视频动作分类的处理方法及装置
CN108304926B (zh) * 2018-01-08 2020-12-29 中国科学院计算技术研究所 一种适用于神经网络的池化计算装置及方法
CN110032926B (zh) * 2019-02-22 2021-05-11 哈尔滨工业大学(深圳) 一种基于深度学习的视频分类方法以及设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190188379A1 (en) * 2017-12-18 2019-06-20 Paypal, Inc. Spatial and temporal convolution networks for system calls based process monitoring
CN109670446A (zh) * 2018-12-20 2019-04-23 泉州装备制造研究所 基于线性动态系统和深度网络的异常行为检测方法
CN109635790A (zh) * 2019-01-28 2019-04-16 杭州电子科技大学 一种基于3d卷积的行人异常行为识别方法
CN110781830A (zh) * 2019-10-28 2020-02-11 西安电子科技大学 基于空-时联合卷积的sar序列图像分类方法
CN110766096A (zh) * 2019-10-31 2020-02-07 北京金山云网络技术有限公司 视频分类方法、装置及电子设备
CN111859023A (zh) * 2020-06-11 2020-10-30 中国科学院深圳先进技术研究院 视频分类方法、装置、设备及计算机可读存储介质

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114677704B (zh) * 2022-02-23 2024-03-26 西北大学 一种基于三维卷积的时空特征多层次融合的行为识别方法
CN115130539A (zh) * 2022-04-21 2022-09-30 腾讯科技(深圳)有限公司 分类模型训练、数据分类方法、装置和计算机设备
CN115243031A (zh) * 2022-06-17 2022-10-25 合肥工业大学智能制造技术研究院 一种基于质量注意力机制的视频时空特征优化方法、系统、电子设备及存储介质
WO2024001139A1 (fr) * 2022-06-30 2024-01-04 海信集团控股股份有限公司 Procédé et appareil de classification de vidéo et dispositif électronique
CN116824641A (zh) * 2023-08-29 2023-09-29 卡奥斯工业智能研究院(青岛)有限公司 姿态分类方法、装置、设备和计算机存储介质
CN116824641B (zh) * 2023-08-29 2024-01-09 卡奥斯工业智能研究院(青岛)有限公司 姿态分类方法、装置、设备和计算机存储介质
CN117668719A (zh) * 2023-11-14 2024-03-08 深圳大学 一种自适应阈值的隧道监测数据异常检测方法
CN118214958A (zh) * 2024-05-21 2024-06-18 环球数科集团有限公司 一种基于超分融合的视频合成系统

Also Published As

Publication number Publication date
WO2021248859A9 (fr) 2022-02-10
CN111859023B (zh) 2024-05-03
CN111859023A (zh) 2020-10-30

Similar Documents

Publication Publication Date Title
WO2021248859A1 (fr) Procédé et appareil de classification vidéo, ainsi que dispositif et support de stockage lisible par ordinateur
WO2021043168A1 (fr) Procédé d'entraînement de réseau de ré-identification de personnes et procédé et appareil de ré-identification de personnes
CN109492612B (zh) 基于骨骼点的跌倒检测方法及其跌倒检测装置
WO2020199693A1 (fr) Procédé et appareil de reconnaissance faciale de grande pose et dispositif associé
EP4156017A1 (fr) Procédé et appareil de reconnaissance d'action, dispositif et support de stockage
CN110503076B (zh) 基于人工智能的视频分类方法、装置、设备和介质
WO2022121485A1 (fr) Procédé et appareil de classification multi-étiquettes d'image, dispositif informatique et support de stockage
CN112070044B (zh) 一种视频物体分类方法及装置
CN108830211A (zh) 基于深度学习的人脸识别方法及相关产品
WO2021073311A1 (fr) Procédé et appareil de reconnaissance d'image, support d'enregistrement lisible par ordinateur et puce
WO2023179429A1 (fr) Procédé et appareil de traitement de données vidéo, dispositif électronique et support de stockage
WO2023174098A1 (fr) Procédé et appareil de détection de geste en temps réel
CN111368672A (zh) 一种用于遗传病面部识别模型的构建方法及装置
CN110222718B (zh) 图像处理的方法及装置
WO2023168998A1 (fr) Procédé et appareil d'identification de clip vidéo, dispositif, et support de stockage
CN106803054B (zh) 人脸模型矩阵训练方法和装置
CN113159200B (zh) 对象分析方法、装置及存储介质
CN113033448B (zh) 一种基于多尺度卷积和注意力的遥感影像去云残差神经网络系统、方法、设备及存储介质
CN111401267B (zh) 基于自学习局部特征表征的视频行人再识别方法及系统
CN111177460A (zh) 提取关键帧的方法及装置
CN112183359A (zh) 视频中的暴力内容检测方法、装置及设备
CN113139490B (zh) 一种图像特征匹配方法、装置、计算机设备及存储介质
WO2022222519A1 (fr) Procédé et appareil de génération d'image de défaut
CN112183299B (zh) 行人属性预测方法、装置、电子设备及存储介质
CN109389089B (zh) 基于人工智能算法的多人行为识别方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20940364

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20940364

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20940364

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.07.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20940364

Country of ref document: EP

Kind code of ref document: A1