WO2021248859A1 - Video classification method and apparatus, and device, and computer readable storage medium - Google Patents

Video classification method and apparatus, and device, and computer readable storage medium Download PDF

Info

Publication number
WO2021248859A1
WO2021248859A1 PCT/CN2020/134995 CN2020134995W WO2021248859A1 WO 2021248859 A1 WO2021248859 A1 WO 2021248859A1 CN 2020134995 W CN2020134995 W CN 2020134995W WO 2021248859 A1 WO2021248859 A1 WO 2021248859A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
pooling
information
feature extraction
feature information
Prior art date
Application number
PCT/CN2020/134995
Other languages
French (fr)
Chinese (zh)
Other versions
WO2021248859A9 (en
Inventor
乔宇
王亚立
李先航
周志鹏
邹静
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2021248859A1 publication Critical patent/WO2021248859A1/en
Publication of WO2021248859A9 publication Critical patent/WO2021248859A9/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application belongs to the field of image processing, and particularly relates to video classification methods, devices, equipment, and computer-readable storage media.
  • the embodiments of the present application provide a video classification method, device, device, and computer-readable storage medium to solve the problem of the conventional three-dimensional convolution kernel for convolution calculation video classification compared to the two-dimensional convolution calculation. , Will add extra parameters, leading to the problem of increased calculation.
  • the first aspect of the embodiments of the present application provides a video classification method, and the method includes:
  • the video classification model includes a feature extraction layer and a fully connected layer, and the feature extraction layer is used for
  • the spatial feature information of the multiple video frames is extracted through two-dimensional convolution, and the temporal feature information of the multiple video frames is extracted through pooling, and the spatial feature information and the temporal feature information are merged to output the fused feature information, so
  • the fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.
  • the feature extraction layer includes N feature extraction sublayers, N ⁇ 1, and the first feature extraction sublayer in the N feature extraction sublayers
  • the input information of the layer is the multiple video frames
  • the output information of the previous feature extraction sublayer is the input information of the next feature extraction sublayer
  • the output information of the Nth feature extraction sublayer is the output information of the feature extraction layer Fusion feature information
  • each feature extraction sublayer in the N feature extraction sublayers includes a large receptive field context feature extraction branch and a small receptive field core feature extraction branch
  • each feature in the N feature extraction sublayers includes:
  • the temporal feature information extracted by the context feature extraction branch of the large receptive field and the spatial feature information extracted by the core feature extraction branch of the small receptive field are fused to obtain output information.
  • the input information is pooled through the context feature extraction branch of the large receptive field, and the time when the input information is extracted Characteristic information, including:
  • the input information is three-dimensionally pooled through the context feature extraction branch of the large receptive field to obtain pooled information, including :
  • the input information is pooled by the three-dimensional pooling kernel ⁇ t, K, K ⁇ in the context feature extraction branch of the large receptive field to obtain pooling information, where t is the smaller the kernel size in the time direction, and t Less than or equal to the video duration, K is the size of the pooling kernel in the two-dimensional space where the image is located, and the three-dimensional pooling kernel is the size of the pooling pixel selected in a single pooling calculation.
  • the size of the N three-dimensional pooling cores is completely The same, or the sizes of the N three-dimensional pooling nuclei are completely different, or the sizes of some of the N three-dimensional pooling nuclei are the same, and the three-dimensional pooling nuclei are the pooling pixels selected in a single pooling calculation the size of.
  • the sizes of N three-dimensional pooling nuclei are completely different, including:
  • the size of the three-dimensional pooling core is gradually increased.
  • gradually increasing the size of the three-dimensional pooling core includes:
  • the convolution parameter of the two-dimensional convolution process in the context feature extraction branch of the large receptive field is the same as that of the small
  • the convolution parameters of the two-dimensional convolution processing in the core feature extraction branch of the receptive field are the same.
  • the size of the image corresponding to the input information of the pooling process and the output information of the pooling process are consistent.
  • the input feature image or video frame is filled in the time dimension or the space dimension, so that after the pooling check is filled After the input information is pooled, the size of the image corresponding to the output information obtained is consistent with the size of the image corresponding to the input information.
  • fusing the spatial feature information and the temporal feature information to output fused feature information includes:
  • the image of the spatial feature information is superimposed on the image of the temporal feature information to generate the fusion feature information.
  • an embodiment of the present application provides a video classification device, the device including:
  • a to-be-classified video obtaining unit configured to obtain a to-be-classified video, where the to-be-classified video includes a plurality of video frames;
  • the classification unit is configured to input the video to be classified into a trained video classification model for processing, and output the classification result of the video to be classified; wherein, the video classification model includes a feature extraction layer and a fully connected layer, and The feature extraction layer is used to extract the spatial feature information of the multiple video frames through two-dimensional convolution, and extract the temporal feature information of the multiple video frames through pooling, and output the fusion of the spatial feature information and the temporal feature information
  • the feature information is fused, and the fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.
  • the third aspect of the embodiments of the present application provides a video classification device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes the computer program At this time, the video classification device is made to implement the video classification method as described in any one of the first aspect.
  • a fourth aspect of the embodiments of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the video classification described in any one of the first aspect is implemented method.
  • this application uses a classification model to extract the spatial feature information of multiple video frames in the video to be classified through two-dimensional convolution, and extracts the information of multiple video frames through pooling.
  • Time feature information is combined with time feature information and spatial feature information, and the classification result is obtained through the fully connected layer. Since this application can obtain the temporal feature information of the video to be classified through pooling, compared to the calculation of the three-dimensional convolution kernel, while retaining the temporal feature information, the two-dimensional convolution calculation method adopted by this application can greatly reduce the volume.
  • the calculation of product parameters helps reduce the amount of calculation for video classification.
  • any two-dimensional convolutional network can be inserted to classify videos, which is beneficial to improve the diversity and versatility of video classification methods.
  • FIG. 1 is a schematic diagram of a video classification application scenario provided by an embodiment of the present application
  • Fig. 2 is a schematic diagram of video classification using three-dimensional convolution in the prior art
  • FIG. 3 is a schematic diagram of the implementation process of a video classification method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the implementation of a video classification method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of the implementation of a video classification provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of another implementation of video classification provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a video classification device provided by an embodiment of the present application.
  • Fig. 8 is a schematic diagram of a video classification device provided by an embodiment of the present application.
  • video classification technology is used to classify the collected surveillance video to determine whether there is an abnormality in the video content.
  • the video classification method described in this application is not sensitive to the speed of the action frame change, and can effectively model the actions of different durations.
  • the surveillance video can be classified through the modeling, which can help the user to quickly find the key surveillance Information, or send abnormal reminders to the monitoring personnel in time, so that the monitoring personnel can deal with the abnormalities in the surveillance video in a timely manner.
  • the video classification technology can be used to classify a large number of videos into different scenes, different moods, and different types of videos, so that users can quickly find the videos they need.
  • smart sports training or video-assisted refereeing it includes faster-moving sports videos, such as shooting, gymnastics or speed skating, and slower sports videos, such as yoga videos.
  • the video classification method described in this application is not sensitive to the speed and time of the motion, and the motion in the motion video can be classified.
  • the platform server receives the self-photographed video uploaded by terminal A, and classifies the uploaded video to obtain the category of the video uploaded by terminal A.
  • the number of uploaded videos is increasing, the number of videos in the same category is also increasing.
  • other terminals such as terminal B browse, obtain the video category browsed by terminal B through the pre-classification result.
  • the platform can search for other videos in the same category and recommend them to terminal B according to the category of the video browsed by terminal B, so as to improve the user experience of browsing the video.
  • a three-dimensional convolution kernel including time information is selected, such as a 3*1*1 temporal convolution kernel, and the video to be classified is convolved.
  • the three-dimensional convolution kernel includes the width W, height H, and duration T of the image.
  • the three-dimensional convolution kernel increases the parameter calculation of the time dimension, adds a large number of parameters, and increases the amount of calculation for video classification.
  • the video classification method includes:
  • step S301 a video to be classified is obtained, and the video to be classified includes multiple video frames.
  • the video to be classified in the embodiment of the present application may be a video stored in a user terminal, a video collected by a monitoring device, or a video uploaded by a platform user received by a video entertainment platform.
  • the video is a video collected by a monitoring device
  • the real-time collected video can be divided into several sub video segments according to a preset time period, and the collected sub video segments can be classified to determine whether the sub video segments are Whether there is an abnormality.
  • the video to be classified includes multiple video frames, and the multiple video frames are sequentially arranged in a time sequence. According to the video to be classified, the spatial information of the width W and the height H of each video frame can be determined. According to the time interval between video frames and the initial playback time, the playback time corresponding to each video frame can be determined.
  • step S302 the video to be classified is input into a trained video classification model for processing, and the classification result of the video to be classified is output; wherein, the video classification model includes a feature extraction layer and a fully connected layer.
  • the feature extraction layer is used to extract the spatial feature information of the multiple video frames through two-dimensional convolution, and extract the temporal feature information of the multiple video frames through pooling, and output the fusion of the spatial feature information and the temporal feature information
  • the feature information is fused, and the fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.
  • the feature extraction layer may include a large receptive field context feature extraction branch and a small receptive field core feature extraction branch.
  • the large receptive field context feature extraction branch is used to extract time feature information, or it can also extract time feature information including time feature information, and the context feature is also time feature information.
  • the large receptive field can be obtained by cascading multiple feature extraction sub-layers, or by gradually increasing the size of the three-dimensional pooling core.
  • the small receptive field core feature extraction branch is used to extract the spatial feature information of the two-dimensional plane in each video frame in the video to be classified.
  • the feature extraction layer is also used to fuse the extracted temporal feature information and spatial feature information to obtain fused feature information. That is, through the dual-branch structure, the context information extracted by the context extraction branch of the large receptive field and the core features extracted by the core feature extraction branch of the small receptive field can be effectively obtained.
  • the feature extraction layer may include N feature extraction sublayers, where N is greater than or equal to 1.
  • the feature extraction layer may include one feature extraction sub-layer, and the fused feature information is output through one feature extraction sub-layer, and the fused feature information is fully connected through the fully connected layer to obtain the classification result.
  • the output information of the feature extraction sublayer of the previous or previous level is used as the input information of the feature extraction sublayer of the next or next level.
  • the fusion feature information output by the i-th feature extraction sub-layer is used as the input information of the i+1-th feature extraction sub-layer.
  • the fused feature information output by the i-th feature extraction sublayer is fused with time feature information and spatial feature information, and the i+1th feature extraction sublayer can further extract feature information through pooling.
  • i is greater than or equal to 1 and less than N.
  • the fusion feature information refers to the feature information after the time feature information and the space feature information are fused.
  • the fusion processing may refer to the superposition of feature information. For example, it may be an image corresponding to temporal feature information, and an image corresponding to spatial feature information, which is subjected to pixel superposition processing.
  • the input information of the pooling process can be made to match the output information of the pooling process.
  • the size is consistent.
  • the input information can be filled with Padding processing, that is, the input feature image or video frame is filled in the time dimension, or the space dimension is also included, so that the pooling check is performed on the filled input information.
  • the size of the output information obtained is consistent with the size of the unfilled input information.
  • the formula can be:
  • the padding parameter can be selected to have a size of 2.
  • the large and small pooling core is 3*3*3, which means that the dimensional size of the pooling core in the two-dimensional plane where the image to be pooled is located is 3*3, and the unit can be pixels or other predetermined length units.
  • the length of the time dimension is 3, and the unit may be the video duration, for example, the video duration of 3 seconds.
  • the number of video frames corresponding to the video duration can be determined by the video duration.
  • the definition of the three-dimensional pooling core may not be limited to this, and the size of the pooling core in the time dimension can also be determined directly by the number of video frames.
  • the two-dimensional convolution refers to the convolution performed on the dimensions of the plane where the image of the video frame is located, that is, the two dimensions of width and height.
  • the larger the selected convolution kernel and the smaller one are the convolution kernels in the two-dimensional space.
  • the spatial feature information can be extracted based on a predetermined fixed-size convolution kernel.
  • existing neural network models can also be used, such as LeNet-based convolutional neural networks, AlexNet-based convolutional neural networks, ResNet-based convolutional neural networks, Google-based convolutional neural networks, and VGGNet-based convolutional neural networks.
  • Neural network models such as convolutional neural networks extract spatial feature information. Therefore, in the process of extracting spatial feature information, there is no need to change the ability of the convolutional neural network to recognize video frames to obtain the spatial feature information of the video frames in the video to be classified. Included feature information.
  • any two-dimensional convolutional network can be inserted into the video classification method described in this application, the effect of the three-dimensional convolutional network on the collection of temporal feature information is achieved, and the optimization of feature hardware or deep learning platform is not required, so there is no need With the help of a specific network design, the versatility of the video classification method described in this application can be effectively improved.
  • the input information can be three-dimensionally pooled through the large receptive field context feature extraction branch to obtain pooled information, and then the pooled information can be secondarily performed through the large receptive field context feature extraction branch.
  • Dimensional convolution processing to obtain temporal feature information.
  • the two-dimensional convolution is based on the convolution operation on the two-dimensional plane where the image of a single video frame is located, without increasing the feature information of the two-dimensional image.
  • the spatial feature information of each frame of the video to be classified is obtained, that is, the feature information of the width W and height H dimensions of each video frame is obtained.
  • the convolution kernel of the two-dimensional convolution can be expressed as: ⁇ C1, C2, 1, K, K ⁇ , where C1 represents the number of channels of the input feature image, and C2 represents the output feature image
  • the number of channels in the convolution kernel, where the "1" is located, indicates the time dimension of the convolution kernel, and "1" means that the convolution kernel is not expanded in the time dimension, that is, each time the two-dimensional convolution is performed, it is only for the same
  • the image of the video frame is convolved
  • K represents the size of the convolution kernel in the two-dimensional space where the video frame is located.
  • the time feature information is extracted through pooling, and the pooling processing may include pooling processing methods such as maximum pooling, average pooling, or global average pooling. For example, when the maximum pooling operation is selected, the pixels to be pooled can be selected according to the pooling kernel, and the pixel with the largest pixel value can be selected as the pixel value after pooling.
  • pooling processing methods such as maximum pooling, average pooling, or global average pooling. For example, when the maximum pooling operation is selected, the pixels to be pooled can be selected according to the pooling kernel, and the pixel with the largest pixel value can be selected as the pixel value after pooling.
  • the three-dimensional pooling kernel can be expressed as ⁇ t, K, K ⁇ , where t represents the size of the pooling kernel in the time direction, and K represents the size of the pooling kernel in the two-dimensional space where the image is located. .
  • the number of corresponding video frames in the pooling process is also different.
  • the same video frame can be used as the objects pooled by different pooling cores.
  • the K value in the pooling kernel is greater than 1, it means that the pooling kernel also pools multiple pixels or regions in the two-dimensional space.
  • a pooling operation with padding can be used to fill the edges of the pooled image to ensure the consistency of the image size of the input information and output information before and after pooling.
  • convolution processing is performed on the output information of the pooling process.
  • the pooled output information is fused with spatiotemporal information of the size of t*K*K on the adjacent time and space, and then convolution is performed on the pooled output information through a two-dimensional convolution method to obtain multiple video frames Time characteristic information.
  • the convolution operation of the small receptive field core feature extraction branch and the large receptive field context feature extraction branch may use the same convolution parameter to perform the convolution operation in a manner of sharing parameters.
  • the size of any two three-dimensional pooling nuclei may be different, or it may be that the sizes of the N three-dimensional pooling nuclei are all the same, or It may also be that the size of some of the three-dimensional pooling nuclei is the same, and the size of some of the three-dimensional pooling nuclei are different.
  • the three-dimensional pooling core used in the three-dimensional pooling process of the context feature extraction branch of the large receptive field may adopt different sizes of time dimensions or different sizes of space dimensions.
  • adjusting the size of the pooling core used in the three-dimensional pooling may include adjusting the size of the time dimension or the time direction in the three-dimensional pooling core, or the size of the three-dimensional pooling core in the two-dimensional space where the video frame is located.
  • the size of the dimensionality, or the size of the three-dimensional pooling core in the time and space dimensions to obtain different sizes of three-dimensional pooling cores.
  • the corresponding spatiotemporal feature information is calculated, and the spatiotemporal feature information includes time Characteristic information.
  • you can gradually increase the size of the pooling core including gradually increasing the size of the pooling core in the time dimension, or gradually increasing the size of the two-dimensional space where the video frame of the pooling core is located, or increasing the pool at the same time
  • the size of the core in the time dimension and the dimensions of the two-dimensional space where the video frame is located to obtain the feature image after pooling, so that the time feature information of different duration features obtained after the pooling of different pooling cores is gradually merged to obtain more Fine-grained spatio-temporal feature information.
  • the images corresponding to the spatial feature information and the temporal feature information use the same convolution parameters for convolution operations, the images of the temporal feature information and the temporal feature information
  • the information represented by the corresponding points has spatial consistency, that is, the size of the spatial feature information and the temporal feature information are the same, and the strategy of point-by-point addition in space can be adopted to obtain the fusion feature information.
  • the fusion feature information obtained by fusing the spatial feature information and the temporal feature information, and the spatial feature information extracts the spatial features of the video frame through two-dimensional convolution, and the temporal feature information extracts the temporal and spatial features of the image through pooling, thus fusing the feature information Including the spatial features and spatiotemporal features of the images in the video to be classified, the fusion feature information is synthesized through a fully connected layer, and the video to be classified is classified according to the integrated fusion feature information to obtain a video classification result. For example, a fully connected calculation may be performed on the fusion feature information according to a preset weight coefficient of a fully connected layer, and the result of the video classification may be determined by comparing the calculation result with a preset classification standard.
  • the video classification model may include two or more feature extraction layers, and two or more spatiotemporal features can be extracted through the two or more feature extraction layers.
  • Image the video to be classified is a kind of spatiotemporal feature image.
  • the feature extraction layer includes two feature extraction sublayers.
  • the feature extraction layer may be referred to as a SmallBig unit for short.
  • the feature extraction layer in the video classification model includes two feature extraction sublayers, namely SmallBig unit 1 and SmallBig unit 2, and the previous feature extraction sublayer SmallBig unit 1 extracts the fusion feature information, It can be used as the input of the next-level feature extraction sub-layer SmallBig unit 2.
  • the fully connected layer performs video classification and outputs the category of the video.
  • the video to be classified is input to the first-level feature extraction layer SmallBig unit 1, and the first convolution operation of two-dimensional convolution is performed on the multiple video frames included therein, and the multiple video frames are obtained. Included spatial feature information.
  • the first pooling operation of the video frame of the video to be classified in the time dimension it includes multiple video frames in the video to be classified, using a three-dimensional pooling kernel with a predetermined duration parameter to perform pooling processing.
  • the second convolution parameter of the first convolution operation is used to perform the second two-dimensional convolution on the pooled image. Convolution operation to obtain time feature information.
  • the spatio-temporal feature information is fused with the temporal feature information to obtain the fusion feature information.
  • the corresponding pixels of the image corresponding to the spatial feature information and the temporal feature information are pixel-added to obtain the fusion feature information including the spatial feature and the temporal feature
  • the fusion feature information may include multiple frames of images.
  • the fusion feature information is input to the second-level feature extraction sublayer SmallBig unit 2, and the image of each channel in the fusion feature information is subjected to the third convolution operation to further obtain the spatial features in the fusion feature information of the SmallBig unit 1.
  • a second pooling operation is performed on the fusion feature information in the time dimension according to the time sequence of the channels, and the pooling information obtained by the second pooling operation Perform the fourth convolution operation to further extract the time feature information of multiple images in the fusion feature information of the SmallBig unit 1.
  • the fourth convolution operation and the third convolution operation use the same convolution parameters.
  • Fig. 6 is a schematic diagram of implementing video classification through three feature extraction sublayers provided by an embodiment of the application.
  • a third-level feature extraction sublayer SmallBig unit 3 is added.
  • the first-level feature extraction sub-layer SmallBig unit 1 output fusion feature information is processed, and the second-level feature extraction sub-layer SmallBig unit 2 fusion processing
  • the fusion feature information output by the SmallBig unit 2 of the second-level feature extraction sublayer is obtained.
  • the third-level feature extraction sub-layer SmallBig unit 3 separately processes the fusion feature information output by the second-level feature extraction sub-layer SmallBig unit 2 through two-dimensional convolution and pooling, and further extracts temporal feature information and space The feature information is fused to obtain the fused feature information output by the SmallBig unit 3 of the third-level feature extraction sublayer.
  • the feature extraction layer is also used to superimpose the to-be-classified video with the fusion feature information output by the feature extraction layer to form a residual connection to update the fusion feature information.
  • the fused data includes the time feature information calculated by the third-level feature extraction sub-layer SmallBig unit 3 and
  • the spatial feature information is superimposed with the video to be classified, and the temporal feature information and spatial feature information extracted by the third-level feature extraction sublayer are merged to form the residual connection structure, so that the newly added parameters are not used during training. It will affect the parameters of the original pre-training image network, which will help improve the pre-training effect of the image network, and the introduction of residuals will help speed up the convergence and improve the training efficiency of the video classification model.
  • the first convolution kernel used by the feature extraction subunit of the first level is the larger and the smaller is the first convolution kernel
  • pooling uses the first pooling kernel
  • the second level feature extraction subunit uses the convolution
  • the larger one is the second convolution kernel
  • the second pooling kernel is used for pooling.
  • the larger and smaller convolution kernels used by the third-level feature extraction subunit are the third convolution kernel, and the third pooling is used for pooling.
  • the first convolution kernel used in the two-dimensional convolution and the third convolution kernel used in the third convolution operation are smaller than the second convolution kernel used in the second convolution operation. size.
  • the larger or smaller of the first convolution kernel and the third convolution kernel are 1*1*1, and the larger or smaller of the second convolution kernel is 1*3*3.
  • the first pooling core and the second pooling core may be smaller than the third pooling core used in the third pooling operation.
  • the larger and smaller pooled cores of the first pooled core and the second pooled core are 3*3*3, and the smaller of the third pooled core is 3*3*.
  • T can be the video duration or the number of video frames corresponding to the video duration.
  • t is the duration.
  • t is the number of video frames.
  • the temporal characteristics of the video frame of the entire video length can be extracted.
  • the output fusion feature information has a global time perception to follow.
  • two spatially local receptive fields have been added, so that the spatial receptive field of the overall module has also been increased.
  • the video classification system described in this application can be trained using optimization algorithms such as stochastic gradient descent (SGD), and the data set can be mainstream video task data.
  • SGD stochastic gradient descent
  • the video classification method described in this application can provide higher accuracy, faster convergence and better robustness, which is comparable to the current state-of-the-art
  • our video classification and recognition with only 8 frames of input is better than the 32-frame Nonlocal-R50 (non-local R50 network), and it uses 4.9 times less than the 128-frame Nonlocal-R50.
  • the number of floating-point operations per second (the full English name is floating-point operations per second, and the English abbreviation is GFlops), but with the same accuracy.
  • the performance of the 8-frame input of the video classification method described in this application is better than the current state-of-the-art 36-frame input fast and slow combined R50 network (the full English name is SlowFast-R50).
  • the present application also provides a method for training a video classification model.
  • the method includes: obtaining a sample video in a sample video set and a sample classification result of the sample video, the sample video including a plurality of video frames; Extract the spatial feature information in the sample video; extract the temporal feature information in the sample video through pooling; fuse the spatial feature information and the temporal feature information to obtain fused feature information, and perform full connection processing on the fused feature information to obtain a model Classification result; the model classification result and the sample classification result are corrected, the parameters of the two-dimensional convolution are corrected, and the step of extracting the spatial feature information in the sample video through the two-dimensional convolution is returned to the execution until the model classification result is consistent with The sample classification result meets the preset condition, and the trained video classification model is obtained.
  • the structure of the video classification model is consistent with the neural network model adopted by the video classification method shown in FIG. 2, and will not be repeated here.
  • FIG. 7 is a schematic diagram of a video classification device provided by an embodiment of the application, and the video classification device includes:
  • the video to be classified acquisition unit 701 is configured to acquire a video to be classified, where the video to be classified includes a plurality of video frames;
  • the classification unit 702 is configured to input the video to be classified into a trained video classification model for processing, and output the classification result of the video to be classified; wherein, the video classification model includes a feature extraction layer and a fully connected layer, so The feature extraction layer is used for extracting the spatial feature information of the multiple video frames through two-dimensional convolution, and extracting the temporal feature information of the multiple video frames through pooling, and fusing the spatial feature information and temporal feature information Output fusion feature information, and the fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.
  • the video classification device described in FIG. 7 corresponds to the video classification method shown in FIG. 3. With the video classification device, the video classification method described in any of the above embodiments can be executed.
  • Fig. 8 is a schematic diagram of a video classification device provided by an embodiment of the present application.
  • the video classification device 8 of this embodiment includes a processor 80, a memory 81, and a computer program 82 stored in the memory 81 and running on the processor 80, such as a video classification program.
  • the processor 80 executes the computer program 82, the steps in the foregoing embodiments of the video classification method are implemented.
  • the processor 80 executes the computer program 82, the function of each module/unit in the foregoing device embodiments is realized.
  • the computer program 82 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 81 and executed by the processor 80 to complete This application.
  • the one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program 82 in the video classification device 8.
  • the video classification device 8 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the video classification device may include, but is not limited to, a processor 80 and a memory 81.
  • FIG. 8 is only an example of the video classification device 8 and does not constitute a limitation on the video classification device 8. It may include more or less components than shown in the figure, or combine certain components, or different
  • the video classification device may also include input and output devices, network access devices, buses, and so on.
  • the so-called processor 80 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 81 may be an internal storage unit of the video classification device 8, for example, a hard disk or a memory of the video classification device 8.
  • the memory 81 may also be an external storage device of the video classification device 8, such as a plug-in hard disk equipped on the video classification device 8, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD card, Flash Card, etc.
  • the memory 81 may also include both an internal storage unit of the video classification device 8 and an external storage device.
  • the memory 81 is used to store the computer program and other programs and data required by the video classification device.
  • the memory 81 can also be used to temporarily store data that has been output or will be output.
  • the disclosed device/terminal device and method may be implemented in other ways.
  • the device/terminal device embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units.
  • components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated module/unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through a computer program.
  • the computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, it can implement the steps of the foregoing method embodiments.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate form.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media, etc.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electrical carrier signal telecommunications signal
  • software distribution media etc.
  • the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction.
  • the computer-readable medium Does not include electrical carrier signals and telecommunication signals.

Abstract

A video classification method and apparatus, and a device, and a computer readable storage medium. The video classification method comprises: obtaining a video to be classified, said video comprising multiple video frames (S301); and inputting said video to a trained video classification model for processing, and outputting a classification result of said video, wherein the video classification model comprises a feature extraction layer and a fully connected layer, the feature extraction layer is used for extracting spatial feature information by means of a two-dimensional convolution, extracting temporal feature information by means of pooling, and fusing the spatial feature information and the temporal feature information to output fused feature information, and the fully connected layer is used for performing full connection processing on the fused feature information to obtain the classification result (S302). According to the method, with respect to calculation of a three-dimensional convolution kernel, feature information of a temporal dimension of said video is obtained by means of pooling, and the used two-dimensional convolution can greatly reduce the calculation of convolution parameters, thereby facilitating reduction of the computational complexity of video classification.

Description

视频分类方法、装置、设备及计算机可读存储介质Video classification method, device, equipment and computer readable storage medium 技术领域Technical field
本申请属于图像处理领域,尤其涉及视频分类方法、装置、设备及计算机可读存储介质。This application belongs to the field of image processing, and particularly relates to video classification methods, devices, equipment, and computer-readable storage media.
背景技术Background technique
为了便于对图像管理,可通过深度学习的方式,对图像内容进行识别和分类。近年来,随着卷积神经网络在图像分类任务上取得的重大突破,通过二维卷积神经网络对图像分类的准确度甚至超过了人类分类的准确度。In order to facilitate image management, deep learning can be used to identify and classify image content. In recent years, as convolutional neural networks have made major breakthroughs in image classification tasks, the accuracy of image classification through two-dimensional convolutional neural networks has even exceeded that of human classification.
在使用二维卷积神经网络对图像进行精准分类的同时,也可以将其应用于由图像构成的视频的分类。由于视频数据相较于静态图片多了一个时间维度,为了提取视频中的时间维度的信息,通常采用包括时间维度的三维卷积核,在时间和空间上同时提取特征。但是,通过三维卷积核进行卷积计算时,相对于二维卷积计算,会增加额外的参数,导致计算量增大。While using a two-dimensional convolutional neural network to accurately classify images, it can also be applied to the classification of videos composed of images. Since video data has one more time dimension than static pictures, in order to extract the time dimension information in the video, a three-dimensional convolution kernel including the time dimension is usually used to extract features in both time and space. However, when performing convolution calculations with a three-dimensional convolution kernel, additional parameters will be added compared to two-dimensional convolution calculations, resulting in an increase in the amount of calculation.
技术问题technical problem
有鉴于此,本申请实施例提供了视频分类方法、装置、设备及计算机可读存储介质,以解决现有技术中通过三维卷积核进行卷积计算视频分类时,相对于二维卷积计算,会增加额外参数,导致计算量增大的问题。In view of this, the embodiments of the present application provide a video classification method, device, device, and computer-readable storage medium to solve the problem of the conventional three-dimensional convolution kernel for convolution calculation video classification compared to the two-dimensional convolution calculation. , Will add extra parameters, leading to the problem of increased calculation.
技术解决方案Technical solutions
为解决上述技术问题,本申请实施例采用的技术方案是:In order to solve the above technical problems, the technical solutions adopted in the embodiments of this application are:
本申请实施例的第一方面提供了一种视频分类方法,所述方法包括:The first aspect of the embodiments of the present application provides a video classification method, and the method includes:
获取待分类视频,所述待分类视频包括多个视频帧;Acquiring a video to be classified, where the video to be classified includes a plurality of video frames;
将所述待分类视频输入到已训练的视频分类模型中处理,输出所述待分类视频的分类结果;其中,所述视频分类模型包括特征提取层和全连接层,所述特征提取层用于通过二维卷积提取所述多个视频帧的空间特征信息,并通过池化提取所述多个视频帧的时间特征信息,以及融合所述空间特征信息和时间特征信息输出融合特征信息,所述全连接层用于对所述特征提取层输出的融合特征信息进行全连接处理,得到所述分类结果。Input the video to be classified into a trained video classification model for processing, and output the classification result of the video to be classified; wherein, the video classification model includes a feature extraction layer and a fully connected layer, and the feature extraction layer is used for The spatial feature information of the multiple video frames is extracted through two-dimensional convolution, and the temporal feature information of the multiple video frames is extracted through pooling, and the spatial feature information and the temporal feature information are merged to output the fused feature information, so The fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.
结合第一方面,在第一方面的第一种可能实现方式中,所述特征提取层包括N个特征提取子层,N≥1,所述N个特征提取子层中第一个特征提取子层的输入信息为所述多个视频帧,前一个特征提取子层的输出信息为后一个特征提取子层的输入信息,第N个特征提取子层的输出信息为所述特征提取层输出的融合特征信息;所述N个特征提取子层中的每 个特征提取子层包括大感受野上下文特征提取分支和小感受野核心特征提取分支,所述N个特征提取子层中的每个特征提取子层对输入信息的处理,包括:With reference to the first aspect, in a first possible implementation manner of the first aspect, the feature extraction layer includes N feature extraction sublayers, N≥1, and the first feature extraction sublayer in the N feature extraction sublayers The input information of the layer is the multiple video frames, the output information of the previous feature extraction sublayer is the input information of the next feature extraction sublayer, and the output information of the Nth feature extraction sublayer is the output information of the feature extraction layer Fusion feature information; each feature extraction sublayer in the N feature extraction sublayers includes a large receptive field context feature extraction branch and a small receptive field core feature extraction branch, and each feature in the N feature extraction sublayers The processing of the input information by the extraction sub-layer includes:
通过大感受野上下文特征提取分支对所述输入信息进行池化处理,提取所述输入信息的时间特征信息;Pooling the input information through the context feature extraction branch of the large receptive field, and extracting the time feature information of the input information;
通过小感受野核心特征提取分支对所述输入信息进行二维卷积处理,提取所述输入信息的空间特征信息;Performing two-dimensional convolution processing on the input information through the core feature extraction branch of the small receptive field to extract the spatial feature information of the input information;
对大感受野上下文特征提取分支提取到的时间特征信息和小感受野核心特征提取分支提取到的空间特征信息进行融合,得到输出信息。The temporal feature information extracted by the context feature extraction branch of the large receptive field and the spatial feature information extracted by the core feature extraction branch of the small receptive field are fused to obtain output information.
结合第一方面的第一种可能实现方式,在第一方面的第二种可能实现方式中,通过大感受野上下文特征提取分支对所述输入信息进行池化处理,提取所述输入信息的时间特征信息,包括:In combination with the first possible implementation of the first aspect, in the second possible implementation of the first aspect, the input information is pooled through the context feature extraction branch of the large receptive field, and the time when the input information is extracted Characteristic information, including:
通过大感受野上下文特征提取分支对所述输入信息进行三维池化处理,得到池化信息;Performing three-dimensional pooling processing on the input information through the context feature extraction branch of the large receptive field to obtain pooling information;
通过大感受野上下文特征提取分支对所述池化信息进行二维卷积处理,得到时间特征信息。Performing two-dimensional convolution processing on the pooling information through the context feature extraction branch of the large receptive field to obtain temporal feature information.
结合第一方面的第二种可能实现方式,在第一方面的第三种可能实现方式中,通过大感受野上下文特征提取分支对所述输入信息进行三维池化处理,得到池化信息,包括:In combination with the second possible implementation manner of the first aspect, in the third possible implementation manner of the first aspect, the input information is three-dimensionally pooled through the context feature extraction branch of the large receptive field to obtain pooled information, including :
通过大感受野上下文特征提取分支中的三维池化核{t,K,K}对所述输入信息进行池化处理,得到池化信息,其中,t为时间方向的核大的小,且t小于或等于视频时长,K为池化核在图像所在的二维空间的大小,所述三维池化核为单次池化计算时所选定的池化像素的大小。The input information is pooled by the three-dimensional pooling kernel {t, K, K} in the context feature extraction branch of the large receptive field to obtain pooling information, where t is the smaller the kernel size in the time direction, and t Less than or equal to the video duration, K is the size of the pooling kernel in the two-dimensional space where the image is located, and the three-dimensional pooling kernel is the size of the pooling pixel selected in a single pooling calculation.
结合第一方面的第三种可能实现方式,在第一方面的第四种可能实现方式中,在所述特征提取层包括的N个三维池化核中,N个三维池化核的大小完全相同,或者N个三维池化核的大小完全不同,或者N个三维池化核中部分池化核的大小相同,所述三维池化核为单次池化计算时所选定的池化像素的大小。In combination with the third possible implementation manner of the first aspect, in the fourth possible implementation manner of the first aspect, among the N three-dimensional pooling cores included in the feature extraction layer, the size of the N three-dimensional pooling cores is completely The same, or the sizes of the N three-dimensional pooling nuclei are completely different, or the sizes of some of the N three-dimensional pooling nuclei are the same, and the three-dimensional pooling nuclei are the pooling pixels selected in a single pooling calculation the size of.
结合第一方面的第三种或第四种可能实现方式,在第一方面的第五种可能实现方式中,N个三维池化核的大小完全不同包括:In combination with the third or fourth possible implementation manner of the first aspect, in the fifth possible implementation manner of the first aspect, the sizes of N three-dimensional pooling nuclei are completely different, including:
随着特征信息提取的先后,逐步增加所述三维池化核的大小。With the sequence of feature information extraction, the size of the three-dimensional pooling core is gradually increased.
结合第一方面的第五种可能实现方式,在第一方面的第六种可能实现方式中,逐步增加所述三维池化核的大小,包括:In combination with the fifth possible implementation manner of the first aspect, in the sixth possible implementation manner of the first aspect, gradually increasing the size of the three-dimensional pooling core includes:
逐步增加所述三维池化核的时间方向的大小;Gradually increasing the size of the three-dimensional pooling core in the time direction;
或者,逐步增加所述三维池化核在视频帧所在的二维空间的维度的大小;Or, gradually increase the size of the dimension of the three-dimensional pooling core in the two-dimensional space where the video frame is located;
或者,逐步增加所述三维池化核的时间方向的大小和视频帧所在的二维空间的维度的大小。Or, gradually increase the size of the time direction of the three-dimensional pooling core and the size of the two-dimensional space where the video frame is located.
结合第一方面的第二种可能实现方式,在第一方面的第七种可能实现方式中,所述大感受野上下文特征提取分支中的二维卷积处理的卷积参数,与所述小感受野核心特征提取分支中的二维卷积处理的卷积参数相同。In combination with the second possible implementation manner of the first aspect, in the seventh possible implementation manner of the first aspect, the convolution parameter of the two-dimensional convolution process in the context feature extraction branch of the large receptive field is the same as that of the small The convolution parameters of the two-dimensional convolution processing in the core feature extraction branch of the receptive field are the same.
结合第一方面,在第一方面的第八种可能实现方式中,所述池化处理的输入信息与池化处理的输出信息所对应的图像的尺寸一致。With reference to the first aspect, in an eighth possible implementation manner of the first aspect, the size of the image corresponding to the input information of the pooling process and the output information of the pooling process are consistent.
结合第一方面的第八种可能实现方式,在第一方面的第九种可能实现方式中,通过对输入的特征图像或视频帧,在时间维度或者空间维度进行填充,使得池化核对填充后的输入信息进行池化处理后,得到的输出信息对应的图像的尺寸,与输入信息对应的图像的尺寸一致。In combination with the eighth possible implementation manner of the first aspect, in the ninth possible implementation manner of the first aspect, the input feature image or video frame is filled in the time dimension or the space dimension, so that after the pooling check is filled After the input information is pooled, the size of the image corresponding to the output information obtained is consistent with the size of the image corresponding to the input information.
结合第一方面,在第一方面的第十种可能实现方式中,融合所述空间特征信息和时间特征信息输出融合特征信息,包括:With reference to the first aspect, in the tenth possible implementation manner of the first aspect, fusing the spatial feature information and the temporal feature information to output fused feature information includes:
将所述空间特征信息的图像,与所述时间特征信息的图像叠加,生成所述融合特征信息。The image of the spatial feature information is superimposed on the image of the temporal feature information to generate the fusion feature information.
第二方面,本申请实施例提供了一种视频分类装置,所述装置包括:In the second aspect, an embodiment of the present application provides a video classification device, the device including:
待分类视频获取单元,用于获取待分类视频,所述待分类视频包括多个视频帧;A to-be-classified video obtaining unit, configured to obtain a to-be-classified video, where the to-be-classified video includes a plurality of video frames;
分类单元,用于将所述待分类视频输入到已训练的视频分类模型中处理,输出所述待分类视频的分类结果;其中,所述视频分类模型包括特征提取层和全连接层,所述特征提取层用于通过二维卷积提取所述多个视频帧的空间特征信息,并通过池化提取所述多个视频帧的时间特征信息,以及融合所述空间特征信息和时间特征信息输出融合特征信息,所述全连接层用于对所述特征提取层输出的融合特征信息进行全连接处理,得到所述分类结果。The classification unit is configured to input the video to be classified into a trained video classification model for processing, and output the classification result of the video to be classified; wherein, the video classification model includes a feature extraction layer and a fully connected layer, and The feature extraction layer is used to extract the spatial feature information of the multiple video frames through two-dimensional convolution, and extract the temporal feature information of the multiple video frames through pooling, and output the fusion of the spatial feature information and the temporal feature information The feature information is fused, and the fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.
本申请实施例的第三方面提供了一种视频分类设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时,使得视频分类设备实现如第一方面任一项所述视频分类方法。The third aspect of the embodiments of the present application provides a video classification device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes the computer program At this time, the video classification device is made to implement the video classification method as described in any one of the first aspect.
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现第一方面任一项所述视频分类方法。A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the video classification described in any one of the first aspect is implemented method.
有益效果Beneficial effect
本申请实施例与现有技术相比存在的有益效果是:本申请通过分类模型通过二维卷积 提取待分类视频中的多个视频帧的空间特征信息,通过池化提取多个视频帧的时间特征信息,并融合时间特征信息和空间特征信息,通过全连接层得到分类结果。由于本申请通过池化即可获得待分类视频的时间特征信息,相对于三维卷积核计算,本申请在保留时间特征信息的同时,所采用的二维卷积计算方式,可以大大的减少卷积参数的计算,有利于降低视频分类的计算量。并且本申请实施例可以插入任意二维卷积网络对视频进行分类,有利于提高视频分类方法多样性和通用性。Compared with the prior art, the embodiments of this application have the following beneficial effects: this application uses a classification model to extract the spatial feature information of multiple video frames in the video to be classified through two-dimensional convolution, and extracts the information of multiple video frames through pooling. Time feature information is combined with time feature information and spatial feature information, and the classification result is obtained through the fully connected layer. Since this application can obtain the temporal feature information of the video to be classified through pooling, compared to the calculation of the three-dimensional convolution kernel, while retaining the temporal feature information, the two-dimensional convolution calculation method adopted by this application can greatly reduce the volume. The calculation of product parameters helps reduce the amount of calculation for video classification. In addition, in the embodiments of the present application, any two-dimensional convolutional network can be inserted to classify videos, which is beneficial to improve the diversity and versatility of video classification methods.
附图说明Description of the drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或示范性技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the accompanying drawings used in the description of the embodiments or exemplary technologies. Obviously, the accompanying drawings in the following description are only of the present application. For some embodiments, those of ordinary skill in the art can obtain other drawings based on these drawings without creative work.
图1是本申请实施例提供的的视频分类应用场景示意图;FIG. 1 is a schematic diagram of a video classification application scenario provided by an embodiment of the present application;
图2是现有技术采用三维卷积进行视频分类的示意图;Fig. 2 is a schematic diagram of video classification using three-dimensional convolution in the prior art;
图3是本申请实施例提供的一种视频分类方法的实现流程示意图;FIG. 3 is a schematic diagram of the implementation process of a video classification method provided by an embodiment of the present application;
图4是本申请实施例提供的视频分类方法的实现示意图;FIG. 4 is a schematic diagram of the implementation of a video classification method provided by an embodiment of the present application;
图5是本申请实施例提供的一种视频分类的实现示意图;FIG. 5 is a schematic diagram of the implementation of a video classification provided by an embodiment of the present application;
图6是本申请实施例提供的又一种视频分类的实现示意图;FIG. 6 is a schematic diagram of another implementation of video classification provided by an embodiment of the present application;
图7是本申请实施例提供的一种视频分类装置的示意图;FIG. 7 is a schematic diagram of a video classification device provided by an embodiment of the present application;
图8是本申请实施例提供的视频分类设备的示意图。Fig. 8 is a schematic diagram of a video classification device provided by an embodiment of the present application.
本发明的实施方式Embodiments of the present invention
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are proposed for a thorough understanding of the embodiments of the present application. However, it should be clear to those skilled in the art that the present application can also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, devices, circuits, and methods are omitted to avoid unnecessary details from obstructing the description of this application.
为了说明本申请所述的技术方案,下面通过具体实施例来进行说明。In order to illustrate the technical solution described in the present application, specific embodiments are used for description below.
随着视频数据的兴起,越来越多的场景中需要使用到视频分类技术。通过本申请实施例所述的视频分类方法对视频进行分类管理,可以有效的提高视频使用的便利性。With the rise of video data, more and more scenes need to use video classification technology. Using the video classification method described in the embodiments of the present application to classify and manage videos can effectively improve the convenience of video use.
比如,在智能监控领域,通过视频分类技术,将采集的监控视频进行分类,判断视频内容是否存在异常。本申请所述视频分类方法,对于动作帧的变化快慢不敏感,对于不同持续时长的动作均可以有效的进行建模,通过建模对监控视频进行分类,可以帮忙用户快速的查找到关键的监控信息,或者及时向监控人员发送异常提醒,使监控人员能够及时的 处理监控视频中的异常。For example, in the field of intelligent surveillance, video classification technology is used to classify the collected surveillance video to determine whether there is an abnormality in the video content. The video classification method described in this application is not sensitive to the speed of the action frame change, and can effectively model the actions of different durations. The surveillance video can be classified through the modeling, which can help the user to quickly find the key surveillance Information, or send abnormal reminders to the monitoring personnel in time, so that the monitoring personnel can deal with the abnormalities in the surveillance video in a timely manner.
比如,当设备中存储有大量的视频时,可以通过视频分类技术,可以将大量的视频分类为不同场景、不同心情、不同风络等类型的视频,从而便于用户快速的找到所需要的视频。For example, when a large number of videos are stored in the device, the video classification technology can be used to classify a large number of videos into different scenes, different moods, and different types of videos, so that users can quickly find the videos they need.
比如,对于智能体育运动训练或视频辅助裁判方面,包括较快动作的运动视频,比如投篮、体操或速滑等视频,以及较慢动作的运动视频,比如瑜珈视频等。可通过本申请所述视频分类方法对运动快慢时间不敏感的特性,对运动视频中的动作进行分类。For example, for smart sports training or video-assisted refereeing, it includes faster-moving sports videos, such as shooting, gymnastics or speed skating, and slower sports videos, such as yoga videos. The video classification method described in this application is not sensitive to the speed and time of the motion, and the motion in the motion video can be classified.
又比如,如图1所示,在视频娱乐平台中,平台服务器接收终端A上传的自己拍摄的视频,对所上传的视频进行分类处理,得到终端A所上传的视频的类别。当上传的视频数量越来越多时,对于同一类别的视频的数量也越来越多。当其它终端,比如终端B浏览时,通过预先分类结果,得到终端B浏览的视频类别。平台可以根据终端B所浏览的视频的类别,在相同类别中寻找其它视频并推荐给终端B,提升用户浏览视频的使用体验。For another example, as shown in FIG. 1, in a video entertainment platform, the platform server receives the self-photographed video uploaded by terminal A, and classifies the uploaded video to obtain the category of the video uploaded by terminal A. When the number of uploaded videos is increasing, the number of videos in the same category is also increasing. When other terminals, such as terminal B browse, obtain the video category browsed by terminal B through the pre-classification result. The platform can search for other videos in the same category and recommend them to terminal B according to the category of the video browsed by terminal B, so as to improve the user experience of browsing the video.
然而,目前较为常用的视频分类算法中,如图2所示,选用包括时间信息的三维卷积核,比如3*1*1的时间卷积核,对待分类视频进行卷积操作。三维卷积核包括图像的宽度W、高度H以及时长T,在进行卷积计算时,除了空间特征的参数计算,比如图2所示的图像中宽度W、高度H所在维度的参数的计算,还包括时间维度的参数计算。相比于传统的二维卷积核,三维卷积核增加了时间维度的参数计算,增加大量的参数,提高了视频分类的计算量。However, in the currently more commonly used video classification algorithms, as shown in Figure 2, a three-dimensional convolution kernel including time information is selected, such as a 3*1*1 temporal convolution kernel, and the video to be classified is convolved. The three-dimensional convolution kernel includes the width W, height H, and duration T of the image. When performing the convolution calculation, in addition to the calculation of the parameters of the spatial characteristics, such as the calculation of the parameters of the width W and the height H in the image shown in Figure 2, It also includes parameter calculations in the time dimension. Compared with the traditional two-dimensional convolution kernel, the three-dimensional convolution kernel increases the parameter calculation of the time dimension, adds a large number of parameters, and increases the amount of calculation for video classification.
为了降低视频分类计算时的计算量,本申请实施例提供了一种视频分类方法,如图3所示,所述视频分类方法包括:In order to reduce the amount of calculation during video classification calculation, an embodiment of the present application provides a video classification method. As shown in FIG. 3, the video classification method includes:
在步骤S301中,获取待分类视频,待分类视频包括多个视频帧。In step S301, a video to be classified is obtained, and the video to be classified includes multiple video frames.
本申请实施例所述的待分类视频,可以为用户终端中存储的视频、监控设备所采集的视频或视频娱乐平台接收到的平台用户所上传的视频。当所述视频为监控设备所采集的视频时,可以根据预先设定的时间周期,将实时采集的视频划分为若干个子视频段,对所采集的子视频段进行分类,从而判断子视频段中是否所存在异常。The video to be classified in the embodiment of the present application may be a video stored in a user terminal, a video collected by a monitoring device, or a video uploaded by a platform user received by a video entertainment platform. When the video is a video collected by a monitoring device, the real-time collected video can be divided into several sub video segments according to a preset time period, and the collected sub video segments can be classified to determine whether the sub video segments are Whether there is an abnormality.
所述待分类视频中包括多个视频帧,且多个视频帧按照时间顺序依次排列。根据待分类视频,可以确定每个视频帧的宽度W和高度H的空间信息。根据视频帧之间的时间间隔和初始播放时间,可以确定各个视频帧所对应的播放时间。The video to be classified includes multiple video frames, and the multiple video frames are sequentially arranged in a time sequence. According to the video to be classified, the spatial information of the width W and the height H of each video frame can be determined. According to the time interval between video frames and the initial playback time, the playback time corresponding to each video frame can be determined.
在步骤S302中,将所述待分类视频输入到已训练的视频分类模型中处理,输出所述待分类视频的分类结果;其中,所述视频分类模型包括特征提取层和全连接层,所述特征提取层用于通过二维卷积提取所述多个视频帧的空间特征信息,并通过池化提取所述多个 视频帧的时间特征信息,以及融合所述空间特征信息和时间特征信息输出融合特征信息,所述全连接层用于对所述特征提取层输出的融合特征信息进行全连接处理,得到所述分类结果。In step S302, the video to be classified is input into a trained video classification model for processing, and the classification result of the video to be classified is output; wherein, the video classification model includes a feature extraction layer and a fully connected layer. The feature extraction layer is used to extract the spatial feature information of the multiple video frames through two-dimensional convolution, and extract the temporal feature information of the multiple video frames through pooling, and output the fusion of the spatial feature information and the temporal feature information The feature information is fused, and the fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.
所述特征提取层可以包括大感受野上下文特征提取分支和小感受野核心特征提取分支。其中,大感受野上下文特征提取分支用于提取时间特征信息,或者也可以提取包括时间特征信息的时空特征信息,所述上下文特征也即时间特征信息。所述大感受野,可以通过多个特征提取子层级联的方式获得,也可以通过逐渐增加三维池化核的大小的方式获得。所述小感受野核心特征提取分支用于提取待分类视频中的每个视频帧中的二维平面的空间特征信息。所述特征提取层还用于将提取的时间特征信息和空间特征信息融合,得到融合特征信息。即,通过双分支结构,可以有效的获取到由大感受野上下文提取分支所提取上下文信息,以及小感受野核心特征提取分支所提取的核心特征。The feature extraction layer may include a large receptive field context feature extraction branch and a small receptive field core feature extraction branch. Among them, the large receptive field context feature extraction branch is used to extract time feature information, or it can also extract time feature information including time feature information, and the context feature is also time feature information. The large receptive field can be obtained by cascading multiple feature extraction sub-layers, or by gradually increasing the size of the three-dimensional pooling core. The small receptive field core feature extraction branch is used to extract the spatial feature information of the two-dimensional plane in each video frame in the video to be classified. The feature extraction layer is also used to fuse the extracted temporal feature information and spatial feature information to obtain fused feature information. That is, through the dual-branch structure, the context information extracted by the context extraction branch of the large receptive field and the core features extracted by the core feature extraction branch of the small receptive field can be effectively obtained.
在可能的实现方式中,所述视频分类模型中,所述特征提取层可以包括N特征提取子层,其中,N大于或等于1。In a possible implementation, in the video classification model, the feature extraction layer may include N feature extraction sublayers, where N is greater than or equal to 1.
比如,所述特征提取层可以包括1个特征提取子层,通过一个特征提取子层输出融合特征信息,通过全连接层对所述融合特征信息进行全连接处理,得到所述分类结果。For example, the feature extraction layer may include one feature extraction sub-layer, and the fused feature information is output through one feature extraction sub-layer, and the fused feature information is fully connected through the fully connected layer to obtain the classification result.
当所述N大于或等于2时,前一个或前一级的特征提取子层的输出信息,作为后一个或后一级的特征提取子层的输入信息。比如,第i个特征提取子层输出的融合特征信息,作为第i+1个特征提取子层的输入信息。其中,第i个特征提取子层所输出的融合的特征信息,融合有时间特征信息和空间特征信息,可以由第i+1个特征提取子层通过池化进一步提取特征信息。其中,i大于或等于1且小于N。When the N is greater than or equal to 2, the output information of the feature extraction sublayer of the previous or previous level is used as the input information of the feature extraction sublayer of the next or next level. For example, the fusion feature information output by the i-th feature extraction sub-layer is used as the input information of the i+1-th feature extraction sub-layer. Among them, the fused feature information output by the i-th feature extraction sublayer is fused with time feature information and spatial feature information, and the i+1th feature extraction sublayer can further extract feature information through pooling. Among them, i is greater than or equal to 1 and less than N.
其中,所述融合特征信息,是指对时间特征信息和空间特征信息进行了融合处理后的特征信息。所述融合处理,可以是指特征信息的叠加。比如,可以是时间特征信息对应的图像,与空间特征信息对应的图像,进行像素的叠加处理。Wherein, the fusion feature information refers to the feature information after the time feature information and the space feature information are fused. The fusion processing may refer to the superposition of feature information. For example, it may be an image corresponding to temporal feature information, and an image corresponding to spatial feature information, which is subjected to pixel superposition processing.
为了使得融合时的时间特征信息和空间特征信息所对应的图像的尺寸一致,当对输入信息进行池化处理时,可以使得池化处理的输入信息与池化处理的输出信息所对应的图像的尺寸的一致。In order to make the size of the image corresponding to the temporal feature information and the spatial feature information consistent during fusion, when the input information is pooled, the input information of the pooling process can be made to match the output information of the pooling process. The size is consistent.
在一种实现方式中,可以对输入信息进行填充Padding处理,即对输入的特征图像或视频帧,在时间维度,或者还包括空间维度进行填充,从而使得池化核对填充后的输入信息进行池化处理后,得到的输出信息的尺寸,与未填充的输入信息的尺寸一致。In one implementation, the input information can be filled with Padding processing, that is, the input feature image or video frame is filled in the time dimension, or the space dimension is also included, so that the pooling check is performed on the filled input information. After the transformation, the size of the output information obtained is consistent with the size of the unfilled input information.
比如,在确定了输入信息的尺寸为n,池化核的大小为f,步长为s,填充大小为p,输出信息的尺寸为o,可以根据公式:For example, after determining that the size of the input information is n, the size of the pooling core is f, the step size is s, the padding size is p, and the size of the output information is o, the formula can be:
Figure PCTCN2020134995-appb-000001
Figure PCTCN2020134995-appb-000001
来计算需要填充的大小。To calculate the size that needs to be filled.
比如,对于池化核大的小为3*3*3、步长为1的池化操作,为了得到输出信息与输入信息的尺寸相同,可以选用填充参数padding的大小为2。For example, for a pooling operation with a large pooling core of 3*3*3 and a step size of 1, in order to obtain the same size of the output information and the input information, the padding parameter can be selected to have a size of 2.
其中,池化核大的小为3*3*3,是指池化核在被池化的图像所在的二维平面的维度大小为3*3,单位可以为像素或者其它预定的长度单位。在时间维度的长度为3,单位可以为视频时长,比如3秒视频时长,通过视频时长可以确定该视频时长所对应的视频帧的数量。当然,所述三维池化核的定义可以不局限于此,还可以直接通过视频帧的数量来确定池化核在时间维度的大小。Among them, the large and small pooling core is 3*3*3, which means that the dimensional size of the pooling core in the two-dimensional plane where the image to be pooled is located is 3*3, and the unit can be pixels or other predetermined length units. The length of the time dimension is 3, and the unit may be the video duration, for example, the video duration of 3 seconds. The number of video frames corresponding to the video duration can be determined by the video duration. Of course, the definition of the three-dimensional pooling core may not be limited to this, and the size of the pooling core in the time dimension can also be determined directly by the number of video frames.
所述二维卷积,是指对视频帧的图像所在平面的维度,即宽度和高度这两个维度所进行的卷积。所选用的卷积核大的小为二维空间的卷积核。The two-dimensional convolution refers to the convolution performed on the dimensions of the plane where the image of the video frame is located, that is, the two dimensions of width and height. The larger the selected convolution kernel and the smaller one are the convolution kernels in the two-dimensional space.
二维卷积提取空间特征信息时,可以基于预定的固定大小的卷积核来完成空间特征信息的提取。当然,也可以选用现有的神经网络模型,比如可以为LeNet架构的卷积神经网络、AlexNet架构的卷积神经网络、ResNet架构的卷积神经网络、Google架构的卷积神经网络、VGGNet架构的卷积神经网络等神经网络模型,提取空间特征信息。因此,在对空间特征信息提取的过程中,无需改变卷积神经网络本身对视频帧的识别能力,获得待分类视频中的视频帧的空间特征信息,即视频帧在宽度W和高度H维度所包括的特征信息。When the spatial feature information is extracted by two-dimensional convolution, the spatial feature information can be extracted based on a predetermined fixed-size convolution kernel. Of course, existing neural network models can also be used, such as LeNet-based convolutional neural networks, AlexNet-based convolutional neural networks, ResNet-based convolutional neural networks, Google-based convolutional neural networks, and VGGNet-based convolutional neural networks. Neural network models such as convolutional neural networks extract spatial feature information. Therefore, in the process of extracting spatial feature information, there is no need to change the ability of the convolutional neural network to recognize video frames to obtain the spatial feature information of the video frames in the video to be classified. Included feature information.
由于本申请所述的视频分类方法中可以插入任意的二维卷积网络,达到三维卷积网络对时间特征信息采集的效果,并且不需要特征的硬件或深度学习平台的优化,从而也不需要借助于特定的网络设计,因而能够有效的提高本申请所述视频分类方法的通用性。Since any two-dimensional convolutional network can be inserted into the video classification method described in this application, the effect of the three-dimensional convolutional network on the collection of temporal feature information is achieved, and the optimization of feature hardware or deep learning platform is not required, so there is no need With the help of a specific network design, the versatility of the video classification method described in this application can be effectively improved.
并且,相对于目前使用的即插即用的视频识别模块,包括如时间位移模块TSM和非局部神经网络nonlocal视频识别模块,在保证了分类结果准确性的前提下,有利于降低分类过程中的计算量。Moreover, compared with the plug-and-play video recognition modules currently used, including time shift module TSM and nonlocal neural network nonlocal video recognition module, it is helpful to reduce the classification process under the premise of ensuring the accuracy of the classification results. Calculation amount.
在可能的实现方式中,可以通过大感受野上下文特征提取分支对所述输入信息进行三维池化处理,得到池化信息,然后再通过大感受野上下文特征提取分支对所述池化信息进行二维卷积处理,得到时间特征信息。In a possible implementation manner, the input information can be three-dimensionally pooled through the large receptive field context feature extraction branch to obtain pooled information, and then the pooled information can be secondarily performed through the large receptive field context feature extraction branch. Dimensional convolution processing to obtain temporal feature information.
比如,在图4所示的视频分类方法结构示意图中,所述二维卷积通过基于单个视频帧的图像所在的二维平面进行卷积操作,在不会增加对二维图像的特征信息的提取复杂度的前提下,获取待分类视频的各帧图像的空间特征信息,即获取各个视频帧的宽度W和高度H维度的特征信息。For example, in the schematic structural diagram of the video classification method shown in FIG. 4, the two-dimensional convolution is based on the convolution operation on the two-dimensional plane where the image of a single video frame is located, without increasing the feature information of the two-dimensional image. Under the premise of extracting complexity, the spatial feature information of each frame of the video to be classified is obtained, that is, the feature information of the width W and height H dimensions of each video frame is obtained.
在一种实现方式中,所述二维卷积的卷积核,可以表示为:{C1,C2,1,K,K},其中C1表示输入的特征图像的通道数,C2表示输出特征图像的通道数,卷积核中的“1”所在的位置,表示卷积核时间维度,“1”表示其卷积核不在时间维度扩展,即每次进行二维卷积时,仅仅是针对同一视频帧的图像进行卷积,K表示该卷积核在视频帧所在的二维空间上的大小。In one implementation, the convolution kernel of the two-dimensional convolution can be expressed as: {C1, C2, 1, K, K}, where C1 represents the number of channels of the input feature image, and C2 represents the output feature image The number of channels in the convolution kernel, where the "1" is located, indicates the time dimension of the convolution kernel, and "1" means that the convolution kernel is not expanded in the time dimension, that is, each time the two-dimensional convolution is performed, it is only for the same The image of the video frame is convolved, and K represents the size of the convolution kernel in the two-dimensional space where the video frame is located.
通过池化提取时间特征信息,所述池化处理可以包括最大池化、平均池化或全局平均池化等池化处理方式。比如,当选用最大池化的操作时,可以根据池化核选择需要池化的像素,并选择像素值最大的像素,作为池化后的像素值。The time feature information is extracted through pooling, and the pooling processing may include pooling processing methods such as maximum pooling, average pooling, or global average pooling. For example, when the maximum pooling operation is selected, the pixels to be pooled can be selected according to the pooling kernel, and the pixel with the largest pixel value can be selected as the pixel value after pooling.
在一种实现方式中,所述三维池化核可以表示为{t,K,K},其中t表示池化核在时间方向的大小,K表示池化核在图像所在的二维空间的大小。特定的,我们可以设定t=3或者t=T(视频长度,或者视频时长所对应的视频帧数量或图像数量)。由于池化操作不需要进行卷积计算,仅需要比较数值大小即可,所以所需要的计算量非常小。In one implementation, the three-dimensional pooling kernel can be expressed as {t, K, K}, where t represents the size of the pooling kernel in the time direction, and K represents the size of the pooling kernel in the two-dimensional space where the image is located. . Specifically, we can set t=3 or t=T (video length, or the number of video frames or images corresponding to the video duration). Since the pooling operation does not need to perform convolution calculation, only the numerical value needs to be compared, so the required calculation amount is very small.
对于时间方向大小的参数t所选择的大小的不同,在池化过程中所对应的视频帧的数量也不同。根据池化步长的设定,同一视频帧可以作为不同池化核所池化的对象。当池化核中的K值大于1时,表示池化核同时还对二维空间中的多个像素或区域进行池化。为了便于后续融合,在池化时,可以采用带有填充padding的池化操作,对池化后的图像的边缘进行填充,保证池化前后的输入信息与输出信息的图像的尺寸的一致性。For the different sizes selected by the parameter t of the size in the time direction, the number of corresponding video frames in the pooling process is also different. According to the setting of the pooling step size, the same video frame can be used as the objects pooled by different pooling cores. When the K value in the pooling kernel is greater than 1, it means that the pooling kernel also pools multiple pixels or regions in the two-dimensional space. In order to facilitate subsequent fusion, during pooling, a pooling operation with padding can be used to fill the edges of the pooled image to ensure the consistency of the image size of the input information and output information before and after pooling.
在池化处理后,对池化处理的输出信息进行卷积处理。池化后的输出信息融合了相邻时空上t*K*K大小的时空信息,再通过二维卷积的方式,对所述池化后的输出信息进行卷积操作,得到多个视频帧的时间特征信息。After the pooling process, convolution processing is performed on the output information of the pooling process. The pooled output information is fused with spatiotemporal information of the size of t*K*K on the adjacent time and space, and then convolution is performed on the pooled output information through a two-dimensional convolution method to obtain multiple video frames Time characteristic information.
在一种实现方式中,所述小感受野核心特征提取分支和大感受野上下文特征提取分支的卷积操作,可以通过共享参数的方式,使用相同的卷积参数进行卷积操作。从而使得提取时间特征信息时,不需要引入新的计算时间维度的特征信息的卷积参数,可以在获取时间特征信息,不需要增加时间特征信息的计算参数,减少视频分类模型的计算量。In an implementation manner, the convolution operation of the small receptive field core feature extraction branch and the large receptive field context feature extraction branch may use the same convolution parameter to perform the convolution operation in a manner of sharing parameters. As a result, when extracting time feature information, there is no need to introduce new convolution parameters for calculating the feature information of the time dimension, and the time feature information can be acquired without increasing the calculation parameters of the time feature information, thereby reducing the amount of calculation of the video classification model.
在N个特征提取子层中所包括的N个三维池化核中,可以为任意两个三维池化核的大小均不相同,或者也可以为N个三维池化核的大小均相同,或者也可以为部分三维池化核的大小相同,部分三维池化核的大小不同。Among the N three-dimensional pooling nuclei included in the N feature extraction sublayers, the size of any two three-dimensional pooling nuclei may be different, or it may be that the sizes of the N three-dimensional pooling nuclei are all the same, or It may also be that the size of some of the three-dimensional pooling nuclei is the same, and the size of some of the three-dimensional pooling nuclei are different.
在可能的实现方式中,所述大感受野上下文特征提取分支的三维池化处理所采用的三维池化核,可以采用不同大小的时间维度或不同大小的空间维度。In a possible implementation manner, the three-dimensional pooling core used in the three-dimensional pooling process of the context feature extraction branch of the large receptive field may adopt different sizes of time dimensions or different sizes of space dimensions.
比如,调整所述三维池化所采用的池化核的大小,可以包括调整所述三维池化核中的时间维度或时间方向的大小,或三维池化核在视频帧所在的二维空间的维度的大小,或者 三维池化核在时间维度和空间维度的大小,得到不同大小的三维池化核,通过不同大小的三维池化核,计算得到相应的时空特征信息,该时空特征信息包括时间特征信息。For example, adjusting the size of the pooling core used in the three-dimensional pooling may include adjusting the size of the time dimension or the time direction in the three-dimensional pooling core, or the size of the three-dimensional pooling core in the two-dimensional space where the video frame is located. The size of the dimensionality, or the size of the three-dimensional pooling core in the time and space dimensions, to obtain different sizes of three-dimensional pooling cores. Through the three-dimensional pooling cores of different sizes, the corresponding spatiotemporal feature information is calculated, and the spatiotemporal feature information includes time Characteristic information.
在可能的实现方式中,可以通过逐步增大池化核的大小,包括逐步增加池化核在时间维度的大小,或者逐步增加池化核的视频帧所在的二维空间的大小,或者同时增加池化核在时间维度的大小和视频帧所在的二维空间的维度,得到池化后的特征图像,从而通过不同池化核所池化后得到的不同时长特征的时间特征信息,逐步融合得到更加细粒度的时空特征信息。In possible implementations, you can gradually increase the size of the pooling core, including gradually increasing the size of the pooling core in the time dimension, or gradually increasing the size of the two-dimensional space where the video frame of the pooling core is located, or increasing the pool at the same time The size of the core in the time dimension and the dimensions of the two-dimensional space where the video frame is located, to obtain the feature image after pooling, so that the time feature information of different duration features obtained after the pooling of different pooling cores is gradually merged to obtain more Fine-grained spatio-temporal feature information.
在提取所述时间特征信息时,如图4所示,由于空间特征信息和时间特征信息所对应的图像,采用相同的卷积参数进行卷积操作,因此,时空特征信息和时间特征信息的图像的对应点所代表的信息具有空间一致性,即空间特征信息和时间特征信息的大小一致,可以采用空间上逐点相加的策略,得到融合特征信息。When extracting the temporal feature information, as shown in Figure 4, since the images corresponding to the spatial feature information and the temporal feature information use the same convolution parameters for convolution operations, the images of the temporal feature information and the temporal feature information The information represented by the corresponding points has spatial consistency, that is, the size of the spatial feature information and the temporal feature information are the same, and the strategy of point-by-point addition in space can be adopted to obtain the fusion feature information.
通过融合空间特征信息和时间特征信息所得到的融合特征信息,且空间特征信息通过二维卷积提取了视频帧的空间特征,时间特征信息通过池化提取了图像的时空特征,因而融合特征信息包括待分类视频中的图像的空间特征和时空特征,通过全连接层综合所述融合特征信息,根据综合的融合特征信息,对待分类视频进行分类,得到视频分类结果。比如,可以根据预先设定的全连接层权重系数,对所述融合特征信息进行全连接计算,根据计算结果与预设的分类标准进行比较,确定视频分类结果。The fusion feature information obtained by fusing the spatial feature information and the temporal feature information, and the spatial feature information extracts the spatial features of the video frame through two-dimensional convolution, and the temporal feature information extracts the temporal and spatial features of the image through pooling, thus fusing the feature information Including the spatial features and spatiotemporal features of the images in the video to be classified, the fusion feature information is synthesized through a fully connected layer, and the video to be classified is classified according to the integrated fusion feature information to obtain a video classification result. For example, a fully connected calculation may be performed on the fusion feature information according to a preset weight coefficient of a fully connected layer, and the result of the video classification may be determined by comparing the calculation result with a preset classification standard.
由于在视频分类过程中不需要增加时间维度的卷积参数的计算,只需要进行简单的池化操作,即可有效的获取待分类视频的时空特征信息,有利于减少计算的参数量,降低视频分类计算复杂度。Since there is no need to increase the calculation of convolution parameters in the time dimension during the video classification process, only a simple pooling operation is required to effectively obtain the spatiotemporal feature information of the video to be classified, which is beneficial to reduce the amount of calculation parameters and reduce the video Classification calculation complexity.
在本申请可能的实现方式中,所述视频分类模型可以包括两个或两个以上的特征提取层,通过两个或两个以上的特征提取层,可以提取两个或两个以上的时空特征图像(待分类视频属于时空特征图像中的一种)。比如,在图5所示的视频分类实现示意图中,特征提取层包括两个特征提取子层,在本申请实施例中,所述特征提取层可以简称为SmallBig单元。如图5所示,视频分类模型中的特征提取层包括两个特征提取子层,分别为SmallBig单元1和SmallBig单元2,且在先的特征提取子层SmallBig单元1所提取的融合特征信息,可以作为下一级特征提取子层SmallBig单元2的输入,根据第二级特征提取子层SmallBig单元2所得到的融合特征信息,由全连接层进行视频分类,输出视频所属类别。In a possible implementation of this application, the video classification model may include two or more feature extraction layers, and two or more spatiotemporal features can be extracted through the two or more feature extraction layers. Image (the video to be classified is a kind of spatiotemporal feature image). For example, in the schematic diagram of video classification implementation shown in FIG. 5, the feature extraction layer includes two feature extraction sublayers. In this embodiment of the present application, the feature extraction layer may be referred to as a SmallBig unit for short. As shown in Figure 5, the feature extraction layer in the video classification model includes two feature extraction sublayers, namely SmallBig unit 1 and SmallBig unit 2, and the previous feature extraction sublayer SmallBig unit 1 extracts the fusion feature information, It can be used as the input of the next-level feature extraction sub-layer SmallBig unit 2. According to the fusion feature information obtained by the second-level feature extraction sub-layer SmallBig unit 2, the fully connected layer performs video classification and outputs the category of the video.
具体的,如图5所示,待分类视频输入至第一级特征提取层SmallBig单元1,对其中包括的多个视频帧进行二维卷积的第一卷积操作,得到多个视频帧所包括的空间特征信息。经过对待分类视频的视频帧在时间维度的第一池化操作,包括对待分类视频中的多个视频 帧,采用预定时长参数的三维池化核进行池化处理。对于池化后的图像,再进一步通过与第一卷积操作的卷积参数共享的方式,即采用第一卷积操作的卷积参数,对池化后的图像进行二维卷积的第二卷积操作,得到时间特征信息。然后将时空特征信息与时间特征信息融合,得到融合特征信息。根据时间特征信息和空间特征信息所对应的图像的尺寸的一致信息,将空间特征信息和时间特征信息所对应的图像的对应像素点进行像素相加,得到包括空间特征和时空特征的融合特征信息,所述融合特征信息可以包括多帧图像。Specifically, as shown in FIG. 5, the video to be classified is input to the first-level feature extraction layer SmallBig unit 1, and the first convolution operation of two-dimensional convolution is performed on the multiple video frames included therein, and the multiple video frames are obtained. Included spatial feature information. After the first pooling operation of the video frame of the video to be classified in the time dimension, it includes multiple video frames in the video to be classified, using a three-dimensional pooling kernel with a predetermined duration parameter to perform pooling processing. For the pooled image, it is further shared with the convolution parameters of the first convolution operation, that is, the second convolution parameter of the first convolution operation is used to perform the second two-dimensional convolution on the pooled image. Convolution operation to obtain time feature information. Then the spatio-temporal feature information is fused with the temporal feature information to obtain the fusion feature information. According to the consistent information of the size of the image corresponding to the temporal feature information and the spatial feature information, the corresponding pixels of the image corresponding to the spatial feature information and the temporal feature information are pixel-added to obtain the fusion feature information including the spatial feature and the temporal feature The fusion feature information may include multiple frames of images.
将融合特征信息输入至第二级的特征提取子层SmallBig单元2,对融合特征信息中的每个通道的图像,经过第三卷积操作,进一步得到SmallBig单元1的融合特征信息中的空间特征信息。对SmallBig单元1输出的融合特征信息中的多个通道的图像,根据通道的时间顺序,在时间维度对所述融合特征信息进行第二池化操作,对第二池化操作得到的池化信息进行第四卷积操作,进一步提取SmallBig单元1的融合特征信息中的多个图像的时间特征信息。其中,第四卷积操作与第三卷积操作采用相同的卷积参数。The fusion feature information is input to the second-level feature extraction sublayer SmallBig unit 2, and the image of each channel in the fusion feature information is subjected to the third convolution operation to further obtain the spatial features in the fusion feature information of the SmallBig unit 1. information. For images of multiple channels in the fusion feature information output by the SmallBig unit 1, a second pooling operation is performed on the fusion feature information in the time dimension according to the time sequence of the channels, and the pooling information obtained by the second pooling operation Perform the fourth convolution operation to further extract the time feature information of multiple images in the fusion feature information of the SmallBig unit 1. Among them, the fourth convolution operation and the third convolution operation use the same convolution parameters.
当然,不必局限于此,所述特征提取子层SmallBig单元的个数还可以包括三个或者三个以上。如图6为本申请实施例提供的一种通过三个特征提取子层进行视频分类的实现示意图,在图5的基础上,增加了第三级的特征提取子层SmallBig单元3。经由第二级的卷积操作和第二级的池化操作分别对第一级的特征提取子层SmallBig单元1所输出的融合特征信息处理,第二级的特征提取子层SmallBig单元2融合处理后的时间特征信息和空间特征信息,得到第二级的特征提取子层SmallBig单元2输出的融合特征信息。第三级的特征提取子层SmallBig单元3通过二维卷积和池化,分别对所述第二级的特征提取子层SmallBig单元2输出的融合特征信息进行处理,进一步提取时间特征信息和空间特征信息,融合得到第三级的特征提取子层SmallBig单元3输出的融合特征信息。Of course, it is not necessarily limited to this, and the number of SmallBig units in the feature extraction sublayer may also include three or more. Fig. 6 is a schematic diagram of implementing video classification through three feature extraction sublayers provided by an embodiment of the application. On the basis of Fig. 5, a third-level feature extraction sublayer SmallBig unit 3 is added. Through the second-level convolution operation and the second-level pooling operation, the first-level feature extraction sub-layer SmallBig unit 1 output fusion feature information is processed, and the second-level feature extraction sub-layer SmallBig unit 2 fusion processing After the temporal feature information and the spatial feature information, the fusion feature information output by the SmallBig unit 2 of the second-level feature extraction sublayer is obtained. The third-level feature extraction sub-layer SmallBig unit 3 separately processes the fusion feature information output by the second-level feature extraction sub-layer SmallBig unit 2 through two-dimensional convolution and pooling, and further extracts temporal feature information and space The feature information is fused to obtain the fused feature information output by the SmallBig unit 3 of the third-level feature extraction sublayer.
在可能的实现方式中,所述特征提取层还用于将所述待分类视频与所述特征提取层输出的融合特征信息叠加,构成残差连接来更新所述融合特征信息。In a possible implementation, the feature extraction layer is also used to superimpose the to-be-classified video with the fusion feature information output by the feature extraction layer to form a residual connection to update the fusion feature information.
比如,对于图6所示的视频分类模型,在第三级的特征提取子层SmallBig单元3中,所融合的数据包括第三级的特征提取子层SmallBig单元3所计算得到的时间特征信息和空间特征信息,还叠加有待分类视频,将待分类视频与第三级的特征提取子层所提取的时间特征信息和空间特征信息融合,构成残差连接结构,使得训练时,新加入的参数不会影响到原先的预训练图像网络的参数,有利于提升图像网络的预训练效果,并且通过引入残差,有利于加快收敛,提高视频分类模型的训练效率。For example, for the video classification model shown in Figure 6, in the third-level feature extraction sub-layer SmallBig unit 3, the fused data includes the time feature information calculated by the third-level feature extraction sub-layer SmallBig unit 3 and The spatial feature information is superimposed with the video to be classified, and the temporal feature information and spatial feature information extracted by the third-level feature extraction sublayer are merged to form the residual connection structure, so that the newly added parameters are not used during training. It will affect the parameters of the original pre-training image network, which will help improve the pre-training effect of the image network, and the introduction of residuals will help speed up the convergence and improve the training efficiency of the video classification model.
如图6所示,第一级的特征提取子单元采用的卷积核大的小为第一卷积核,池化采用第一池化核,第二级的特征提取子单元采用的卷积核大的小为第二卷积核,池化采用第二 池化核,第三级的特征提取子单元所采用的卷积核大的小为第三卷积核,池化采用第三池化核。As shown in Figure 6, the first convolution kernel used by the feature extraction subunit of the first level is the larger and the smaller is the first convolution kernel, pooling uses the first pooling kernel, and the second level feature extraction subunit uses the convolution The larger one is the second convolution kernel, and the second pooling kernel is used for pooling. The larger and smaller convolution kernels used by the third-level feature extraction subunit are the third convolution kernel, and the third pooling is used for pooling. Chemical nucleus.
在可能的实现方式中,所述二维卷积采用的第一卷积核和第三卷积操作所采用的第三卷积核,小于第二卷积操作所采用的第二卷积核的大小。在一种实现方式中,图6所示,所述第一卷积核和第三卷积核大的小为1*1*1,第二卷积核大的小为1*3*3。通过第一卷积核和第三卷积核,可以完成多个通道和时空信息的融合。通过第二卷积核,可以用于完成时空特征的提取。In a possible implementation manner, the first convolution kernel used in the two-dimensional convolution and the third convolution kernel used in the third convolution operation are smaller than the second convolution kernel used in the second convolution operation. size. In an implementation manner, as shown in FIG. 6, the larger or smaller of the first convolution kernel and the third convolution kernel are 1*1*1, and the larger or smaller of the second convolution kernel is 1*3*3. Through the first convolution kernel and the third convolution kernel, the fusion of multiple channels and spatiotemporal information can be completed. Through the second convolution kernel, it can be used to complete the extraction of spatiotemporal features.
在可能的实现方式中,第一池化核和第二池化核,可以小于第三池化操作所采用的第三池化核大的小。在一种实现方式中,如图6所示,池化的第一池化核和第二池化核大的小为3*3*3,第三池化核大的小为3*3*T,其中T可以为视频时长,或者也可以为视频时长所对应的视频帧的数量。当所述T为视频时长时,t为时长。当所述T为视频时长对应的视频帧数量时,t为视频帧数量。通过第一池化核和第二池化核,可以捕捉相邻三帧中的立体空间中的9个像素点的池化值,比如最大值池化。通过第三池化核,可以提取整个视频长度的视频帧的时间特征。通过在时间维度逐渐增加时间感受野,结合卷积学习的空间特征,使得输出的融合特征信息具有全局的时间感受紧随。并且,在SmallBig单元1和SmallBig单元3增加了两次空间上局部的感受野,使得整体模块的空间感受野也得到了增加。In a possible implementation manner, the first pooling core and the second pooling core may be smaller than the third pooling core used in the third pooling operation. In one implementation, as shown in Figure 6, the larger and smaller pooled cores of the first pooled core and the second pooled core are 3*3*3, and the smaller of the third pooled core is 3*3*. T, where T can be the video duration or the number of video frames corresponding to the video duration. When the T is the video duration, t is the duration. When the T is the number of video frames corresponding to the video duration, t is the number of video frames. Through the first pooling core and the second pooling core, the pooling values of 9 pixels in the stereo space in three adjacent frames can be captured, such as the maximum pooling. Through the third pooling core, the temporal characteristics of the video frame of the entire video length can be extracted. By gradually increasing the time receptive field in the time dimension, combined with the spatial characteristics of convolutional learning, the output fusion feature information has a global time perception to follow. In addition, in SmallBig unit 1 and SmallBig unit 3, two spatially local receptive fields have been added, so that the spatial receptive field of the overall module has also been increased.
在实际应用中,本申请所述的视频分类系统,可以使用随机梯度下降算法(SGD)等优化算法进行训练,数据集可以采用主流的视频任务数据。通过在数据集训练的实验结果可知,在该网络结构中,本申请所述的视频分类方法,能够提供更高的精度、更快的收敛性和更好的鲁棒性,与目前最先进的网络相比,我们的仅输入8帧图像的视频分类识别,以更高的精度优于输入32帧的Nonlocal-R50(非局部R50网络),并且使用比输入128帧的Nonlocal-R50少4.9倍的每秒浮点运算次数(英文全称为floating-point operations per second,英文简称为GFlops),但具有相同的精度。此外,同样的的GFlops下,本申请所述视频分类方法输入8帧的性能,优于当前最先进的36帧输入的快慢结合R50网络(英文全称为SlowFast-R50)。这些结果表明,本申请所述的用于视频分类的视频分类模型是一种准确、高效的视频分类模型。In practical applications, the video classification system described in this application can be trained using optimization algorithms such as stochastic gradient descent (SGD), and the data set can be mainstream video task data. Through the experimental results of training on the data set, it can be known that in this network structure, the video classification method described in this application can provide higher accuracy, faster convergence and better robustness, which is comparable to the current state-of-the-art Compared with the network, our video classification and recognition with only 8 frames of input is better than the 32-frame Nonlocal-R50 (non-local R50 network), and it uses 4.9 times less than the 128-frame Nonlocal-R50. The number of floating-point operations per second (the full English name is floating-point operations per second, and the English abbreviation is GFlops), but with the same accuracy. In addition, under the same GFlops, the performance of the 8-frame input of the video classification method described in this application is better than the current state-of-the-art 36-frame input fast and slow combined R50 network (the full English name is SlowFast-R50). These results indicate that the video classification model for video classification described in this application is an accurate and efficient video classification model.
另外,本申请还提供了一种视频分类模型训练方法,该方法包括:获取样本视频集中的样本视频,以及样本视频的样本分类结果,所述样本视频包括多个视频帧;通过二维卷积提取样本视频中的空间特征信息;通过池化提取样本视频中的时间特征信息;融合所述空间特征信息和时间特征信息,得到融合特征信息,对所述融合特征信息进行全连接处理, 得到模型分类结果;将所述模型分类结果与样本分类结果,对二维卷积的参数进行修正,并返回执行通过二维卷积提取样本视频中的空间特征信息的步骤,直至所述模型分类结果与所述样本分类结果满足预设条件,得到已训练的视频分类模型。In addition, the present application also provides a method for training a video classification model. The method includes: obtaining a sample video in a sample video set and a sample classification result of the sample video, the sample video including a plurality of video frames; Extract the spatial feature information in the sample video; extract the temporal feature information in the sample video through pooling; fuse the spatial feature information and the temporal feature information to obtain fused feature information, and perform full connection processing on the fused feature information to obtain a model Classification result; the model classification result and the sample classification result are corrected, the parameters of the two-dimensional convolution are corrected, and the step of extracting the spatial feature information in the sample video through the two-dimensional convolution is returned to the execution until the model classification result is consistent with The sample classification result meets the preset condition, and the trained video classification model is obtained.
所述视频分类模型的结构与图2所示的视频分类方法所采用的神经网络模型一致,在此不作重复赘述。The structure of the video classification model is consistent with the neural network model adopted by the video classification method shown in FIG. 2, and will not be repeated here.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
图7为本申请实施例提供的一种视频分类装置的示意图,所述视频分类装置包括:FIG. 7 is a schematic diagram of a video classification device provided by an embodiment of the application, and the video classification device includes:
待分类视频获取单元701,用于获取待分类视频,所述待分类视频包括多个视频帧;The video to be classified acquisition unit 701 is configured to acquire a video to be classified, where the video to be classified includes a plurality of video frames;
分类单元702,用于将所述待分类视频输入到已训练的视频分类模型中处理,输出所述待分类视频的分类结果;其中,所述视频分类模型包括特征提取层和全连接层,所述特征提取层用于通过二维卷积提取所述多个视频帧的空间特征信息,并通过池化提取所述多个视频帧的时间特征信息,以及融合所述空间特征信息和时间特征信息输出融合特征信息,所述全连接层用于对所述特征提取层输出的融合特征信息进行全连接处理,得到所述分类结果。The classification unit 702 is configured to input the video to be classified into a trained video classification model for processing, and output the classification result of the video to be classified; wherein, the video classification model includes a feature extraction layer and a fully connected layer, so The feature extraction layer is used for extracting the spatial feature information of the multiple video frames through two-dimensional convolution, and extracting the temporal feature information of the multiple video frames through pooling, and fusing the spatial feature information and temporal feature information Output fusion feature information, and the fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.
图7所述的视频分类装置,与图3所示的视频分类方法对应。通过所述视频分类装置,可以执行上述任一实施例所描述的视频分类方法。The video classification device described in FIG. 7 corresponds to the video classification method shown in FIG. 3. With the video classification device, the video classification method described in any of the above embodiments can be executed.
图8是本申请一实施例提供的视频分类设备的示意图。如图8所示,该实施例的视频分类设备8包括:处理器80、存储器81以及存储在所述存储器81中并可在所述处理器80上运行的计算机程序82,例如视频分类程序。所述处理器80执行所述计算机程序82时实现上述各个视频分类方法实施例中的步骤。或者,所述处理器80执行所述计算机程序82时实现上述各装置实施例中各模块/单元的功能。Fig. 8 is a schematic diagram of a video classification device provided by an embodiment of the present application. As shown in FIG. 8, the video classification device 8 of this embodiment includes a processor 80, a memory 81, and a computer program 82 stored in the memory 81 and running on the processor 80, such as a video classification program. When the processor 80 executes the computer program 82, the steps in the foregoing embodiments of the video classification method are implemented. Alternatively, when the processor 80 executes the computer program 82, the function of each module/unit in the foregoing device embodiments is realized.
示例性的,所述计算机程序82可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器81中,并由所述处理器80执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序82在所述视频分类设备8中的执行过程。Exemplarily, the computer program 82 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 81 and executed by the processor 80 to complete This application. The one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program 82 in the video classification device 8.
所述视频分类设备8可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述视频分类设备可包括,但不仅限于,处理器80、存储器81。本领域技术人员可以理解,图8仅仅是视频分类设备8的示例,并不构成对视频分类设备8的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述视频分类设备还可以包括输入输出设备、网络接入设备、总线等。The video classification device 8 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The video classification device may include, but is not limited to, a processor 80 and a memory 81. Those skilled in the art can understand that FIG. 8 is only an example of the video classification device 8 and does not constitute a limitation on the video classification device 8. It may include more or less components than shown in the figure, or combine certain components, or different For example, the video classification device may also include input and output devices, network access devices, buses, and so on.
所称处理器80可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 80 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
所述存储器81可以是所述视频分类设备8的内部存储单元,例如视频分类设备8的硬盘或内存。所述存储器81也可以是所述视频分类设备8的外部存储设备,例如所述视频分类设备8上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器81还可以既包括所述视频分类设备8的内部存储单元也包括外部存储设备。所述存储器81用于存储所述计算机程序以及所述视频分类设备所需的其他程序和数据。所述存储器81还可以用于暂时地存储已经输出或者将要输出的数据。The memory 81 may be an internal storage unit of the video classification device 8, for example, a hard disk or a memory of the video classification device 8. The memory 81 may also be an external storage device of the video classification device 8, such as a plug-in hard disk equipped on the video classification device 8, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD card, Flash Card, etc. Further, the memory 81 may also include both an internal storage unit of the video classification device 8 and an external storage device. The memory 81 is used to store the computer program and other programs and data required by the video classification device. The memory 81 can also be used to temporarily store data that has been output or will be output.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above-mentioned functional units and modules is used as an example. In practical applications, the above-mentioned functions can be allocated to different functional units and modules as required. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist alone physically, or two or more units can be integrated into one unit. The above-mentioned integrated units can be hardware-based Formal realization can also be realized in the form of a software functional unit. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the foregoing system, reference may be made to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own focus. For parts that are not described in detail or recorded in an embodiment, reference may be made to related descriptions of other embodiments.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of this application.
在本申请所提供的实施例中,应该理解到,所揭露的装置/终端设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/终端设备实施例仅仅是示意性的,例如,所述 模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed device/terminal device and method may be implemented in other ways. For example, the device/terminal device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units. Or components can be combined or integrated into another system, or some features can be omitted or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括是电载波信号和电信信号。If the integrated module/unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, it can implement the steps of the foregoing method embodiments. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution media, etc. It should be noted that the content contained in the computer-readable medium can be appropriately added or deleted according to the requirements of the legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to the legislation and patent practice, the computer-readable medium Does not include electrical carrier signals and telecommunication signals.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims (14)

  1. 一种视频分类方法,其特征在于,所述方法包括:A video classification method, characterized in that the method includes:
    获取待分类视频,所述待分类视频包括多个视频帧;Acquiring a video to be classified, where the video to be classified includes a plurality of video frames;
    将所述待分类视频输入到已训练的视频分类模型中处理,输出所述待分类视频的分类结果;其中,所述视频分类模型包括特征提取层和全连接层,所述特征提取层用于通过二维卷积提取所述多个视频帧的空间特征信息,并通过池化提取所述多个视频帧的时间特征信息,以及融合所述空间特征信息和时间特征信息输出融合特征信息,所述全连接层用于对所述特征提取层输出的融合特征信息进行全连接处理,得到所述分类结果。Input the video to be classified into a trained video classification model for processing, and output the classification result of the video to be classified; wherein, the video classification model includes a feature extraction layer and a fully connected layer, and the feature extraction layer is used for Extract the spatial feature information of the multiple video frames through two-dimensional convolution, extract the temporal feature information of the multiple video frames through pooling, and fuse the spatial feature information and the temporal feature information to output the fused feature information, so The fully connected layer is used to perform fully connected processing on the fused feature information output by the feature extraction layer to obtain the classification result.
  2. 根据权利要求1所述的方法,其特征在于,所述特征提取层包括N个特征提取子层,N≥1,所述N个特征提取子层中第一个特征提取子层的输入信息为所述多个视频帧,前一个特征提取子层的输出信息为后一个特征提取子层的输入信息,第N个特征提取子层的输出信息为所述特征提取层输出的融合特征信息;所述N个特征提取子层中的每个特征提取子层包括大感受野上下文特征提取分支和小感受野核心特征提取分支,所述N个特征提取子层中的每个特征提取子层对输入信息的处理,包括:The method according to claim 1, wherein the feature extraction layer includes N feature extraction sublayers, N≥1, and the input information of the first feature extraction sublayer in the N feature extraction sublayers is For the multiple video frames, the output information of the previous feature extraction sublayer is the input information of the next feature extraction sublayer, and the output information of the Nth feature extraction sublayer is the fusion feature information output by the feature extraction layer; Each feature extraction sublayer in the N feature extraction sublayers includes a large receptive field context feature extraction branch and a small receptive field core feature extraction branch. Each feature extraction sublayer in the N feature extraction sublayers inputs Information processing, including:
    通过大感受野上下文特征提取分支对所述输入信息进行池化处理,提取所述输入信息的时间特征信息;Pooling the input information through the context feature extraction branch of the large receptive field, and extracting the time feature information of the input information;
    通过小感受野核心特征提取分支对所述输入信息进行二维卷积处理,提取所述输入信息的空间特征信息;Performing two-dimensional convolution processing on the input information through the core feature extraction branch of the small receptive field to extract the spatial feature information of the input information;
    对大感受野上下文特征提取分支提取到的时间特征信息和小感受野核心特征提取分支提取到的空间特征信息进行融合,得到输出信息。The temporal feature information extracted by the context feature extraction branch of the large receptive field and the spatial feature information extracted by the core feature extraction branch of the small receptive field are fused to obtain the output information.
  3. 根据权利要求2所述的方法,其特征在于,通过大感受野上下文特征提取分支对所述输入信息进行池化处理,提取所述输入信息的时间特征信息,包括:The method according to claim 2, wherein the pooling processing of the input information through the context feature extraction branch of the large receptive field to extract the time feature information of the input information comprises:
    通过大感受野上下文特征提取分支对所述输入信息进行三维池化处理,得到池化信息;Performing three-dimensional pooling processing on the input information through the context feature extraction branch of the large receptive field to obtain pooling information;
    通过大感受野上下文特征提取分支对所述池化信息进行二维卷积处理,得到时间特征信息。Performing two-dimensional convolution processing on the pooling information through the context feature extraction branch of the large receptive field to obtain temporal feature information.
  4. 根据权利要求3所述的方法,其特征在于,通过大感受野上下文特征提取分支对所述输入信息进行三维池化处理,得到池化信息,包括:The method according to claim 3, wherein the three-dimensional pooling process is performed on the input information through the context feature extraction branch of the large receptive field to obtain the pooling information, comprising:
    通过大感受野上下文特征提取分支中的三维池化核{t,K,K}对所述输入信息进行池化处理,得到池化信息,其中,t为时间方向的核大的小,且t小于或等于视频时长,K为池 化核在图像所在的二维空间的大小,所述三维池化核为单次池化计算时所选定的池化像素的大小。The input information is pooled by the three-dimensional pooling kernel {t, K, K} in the context feature extraction branch of the large receptive field to obtain pooling information, where t is the smaller the kernel size in the time direction, and t Less than or equal to the video duration, K is the size of the pooling kernel in the two-dimensional space where the image is located, and the three-dimensional pooling kernel is the size of the pooling pixel selected in a single pooling calculation.
  5. 根据权利要求4所述的方法,其特征在于,在所述特征提取层包括的N个三维池化核中,N个三维池化核的大小完全相同,或者N个三维池化核的大小完全不同,或者N个三维池化核中部分池化核的大小相同,所述三维池化核为单次池化计算时所选定的池化像素的大小。The method according to claim 4, wherein among the N three-dimensional pooling nuclei included in the feature extraction layer, the sizes of the N three-dimensional pooling nuclei are completely the same, or the sizes of the N three-dimensional pooling nuclei are completely the same. Different, or the size of some of the N three-dimensional pooling nuclei is the same, and the three-dimensional pooling nucleus is the size of the pooling pixel selected in a single pooling calculation.
  6. 根据权利要求4或5所述的方法,其特征在于,N个三维池化核的大小完全不同包括:The method according to claim 4 or 5, wherein the completely different sizes of the N three-dimensional pooling nuclei comprise:
    随着特征信息提取的先后,逐步增加所述三维池化核的大小。With the sequence of feature information extraction, the size of the three-dimensional pooling core is gradually increased.
  7. 根据权利要求6所述的方法,其特征在于,逐步增加所述三维池化核的大小,包括:The method according to claim 6, wherein gradually increasing the size of the three-dimensional pooling core comprises:
    逐步增加所述三维池化核的时间方向的大小;Gradually increasing the size of the three-dimensional pooling core in the time direction;
    或者,逐步增加所述三维池化核在视频帧所在的二维空间的维度的大小;Or, gradually increase the size of the dimension of the three-dimensional pooling core in the two-dimensional space where the video frame is located;
    或者,逐步增加所述三维池化核的时间方向的大小和视频帧所在的二维空间的维度的大小。Or, gradually increase the size of the time direction of the three-dimensional pooling core and the size of the two-dimensional space where the video frame is located.
  8. 根据权利要求3所述的方法,其特征在于,所述大感受野上下文特征提取分支中的二维卷积处理的卷积参数,与所述小感受野核心特征提取分支中的二维卷积处理的卷积参数相同。The method according to claim 3, wherein the convolution parameters of the two-dimensional convolution processing in the context feature extraction branch of the large receptive field are the same as those of the two-dimensional convolution process in the core feature extraction branch of the small receptive field. The processed convolution parameters are the same.
  9. 根据权利要求3所述的方法,其特征在于,所述池化处理的输入信息与池化处理的输出信息所对应的图像的尺寸一致。The method according to claim 3, wherein the input information of the pooling process is consistent with the size of the image corresponding to the output information of the pooling process.
  10. 根据权利要求9所述的方法,其特征在于,通过对输入的特征图像或视频帧,在时间维度或者空间维度进行填充,使得池化核对填充后的输入信息进行池化处理后,得到的输出信息对应的图像的尺寸,与输入信息对应的图像的尺寸一致。The method according to claim 9, characterized in that the input feature image or video frame is filled in the time dimension or the space dimension, so that the pooling check performs pooling processing on the filled input information, and the output is obtained The size of the image corresponding to the information is the same as the size of the image corresponding to the input information.
  11. 根据权利要求1的方法,其特征在于,融合所述空间特征信息和时间特征信息输出融合特征信息,包括:The method according to claim 1, wherein fusing the spatial characteristic information and the temporal characteristic information to output the fused characteristic information comprises:
    将所述空间特征信息的图像,与所述时间特征信息的图像叠加,生成所述融合特征信息。The image of the spatial feature information is superimposed on the image of the temporal feature information to generate the fusion feature information.
  12. 一种视频分类装置,其特征在于,所述装置包括:A video classification device, characterized in that the device includes:
    待分类视频获取单元,用于获取待分类视频,所述待分类视频包括多个视频帧;A to-be-classified video obtaining unit, configured to obtain a to-be-classified video, where the to-be-classified video includes a plurality of video frames;
    分类单元,用于将待分类视频获取单元所获取的所述待分类视频输入到已训练的视频分类模型中处理,输出所述待分类视频的分类结果;其中,所述视频分类模型包括特征提取层和全连接层,所述特征提取层用于通过二维卷积提取所述多个视频帧的空间特征信息, 并通过池化提取所述多个视频帧的时间特征信息,以及融合所述空间特征信息和时间特征信息输出融合特征信息,所述全连接层用于对所述特征提取层输出的融合特征信息进行全连接处理,得到所述分类结果。The classification unit is configured to input the to-be-classified video acquired by the to-be-classified video acquisition unit into a trained video classification model for processing, and output the classification result of the to-be-classified video; wherein, the video classification model includes feature extraction Layer and fully connected layer, the feature extraction layer is used to extract the spatial feature information of the multiple video frames through two-dimensional convolution, and extract the temporal feature information of the multiple video frames through pooling, and fuse the The spatial feature information and the temporal feature information output fusion feature information, and the fully connected layer is used to perform full connection processing on the fused feature information output by the feature extraction layer to obtain the classification result.
  13. 一种视频分类设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时,使得视频分类设备实现如权利要求1至11任一项所述方法的步骤。A video classification device includes a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to cause the video classification device to The steps of the method as claimed in any one of claims 1 to 11 are implemented.
  14. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至11任一项所述方法的步骤。A computer-readable storage medium storing a computer program, wherein the computer program implements the steps of the method according to any one of claims 1 to 11 when the computer program is executed by a processor.
PCT/CN2020/134995 2020-06-11 2020-12-09 Video classification method and apparatus, and device, and computer readable storage medium WO2021248859A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010531316.9A CN111859023A (en) 2020-06-11 2020-06-11 Video classification method, device, equipment and computer readable storage medium
CN202010531316.9 2020-06-11

Publications (2)

Publication Number Publication Date
WO2021248859A1 true WO2021248859A1 (en) 2021-12-16
WO2021248859A9 WO2021248859A9 (en) 2022-02-10

Family

ID=72986143

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/134995 WO2021248859A1 (en) 2020-06-11 2020-12-09 Video classification method and apparatus, and device, and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN111859023A (en)
WO (1) WO2021248859A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115130539A (en) * 2022-04-21 2022-09-30 腾讯科技(深圳)有限公司 Classification model training method, data classification device and computer equipment
CN116824641A (en) * 2023-08-29 2023-09-29 卡奥斯工业智能研究院(青岛)有限公司 Gesture classification method, device, equipment and computer storage medium
WO2024001139A1 (en) * 2022-06-30 2024-01-04 海信集团控股股份有限公司 Video classification method and apparatus and electronic device
CN114677704B (en) * 2022-02-23 2024-03-26 西北大学 Behavior recognition method based on three-dimensional convolution and space-time feature multi-level fusion

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859023A (en) * 2020-06-11 2020-10-30 中国科学院深圳先进技术研究院 Video classification method, device, equipment and computer readable storage medium
CN112580696A (en) * 2020-12-03 2021-03-30 星宏传媒有限公司 Advertisement label classification method, system and equipment based on video understanding
CN112597824A (en) * 2020-12-07 2021-04-02 深延科技(北京)有限公司 Behavior recognition method and device, electronic equipment and storage medium
CN112926472A (en) * 2021-03-05 2021-06-08 深圳先进技术研究院 Video classification method, device and equipment
CN113536898B (en) * 2021-05-31 2023-08-29 大连民族大学 Comprehensive feature capturing type time convolution network, video motion segmentation method, computer system and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635790A (en) * 2019-01-28 2019-04-16 杭州电子科技大学 A kind of pedestrian's abnormal behaviour recognition methods based on 3D convolution
CN109670446A (en) * 2018-12-20 2019-04-23 泉州装备制造研究所 Anomaly detection method based on linear dynamic system and depth network
US20190188379A1 (en) * 2017-12-18 2019-06-20 Paypal, Inc. Spatial and temporal convolution networks for system calls based process monitoring
CN110766096A (en) * 2019-10-31 2020-02-07 北京金山云网络技术有限公司 Video classification method and device and electronic equipment
CN110781830A (en) * 2019-10-28 2020-02-11 西安电子科技大学 SAR sequence image classification method based on space-time joint convolution
CN111859023A (en) * 2020-06-11 2020-10-30 中国科学院深圳先进技术研究院 Video classification method, device, equipment and computer readable storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10402697B2 (en) * 2016-08-01 2019-09-03 Nvidia Corporation Fusing multilayer and multimodal deep neural networks for video classification
CN107292247A (en) * 2017-06-05 2017-10-24 浙江理工大学 A kind of Human bodys' response method and device based on residual error network
CN107463949B (en) * 2017-07-14 2020-02-21 北京协同创新研究院 Video action classification processing method and device
CN108304926B (en) * 2018-01-08 2020-12-29 中国科学院计算技术研究所 Pooling computing device and method suitable for neural network
CN110032926B (en) * 2019-02-22 2021-05-11 哈尔滨工业大学(深圳) Video classification method and device based on deep learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190188379A1 (en) * 2017-12-18 2019-06-20 Paypal, Inc. Spatial and temporal convolution networks for system calls based process monitoring
CN109670446A (en) * 2018-12-20 2019-04-23 泉州装备制造研究所 Anomaly detection method based on linear dynamic system and depth network
CN109635790A (en) * 2019-01-28 2019-04-16 杭州电子科技大学 A kind of pedestrian's abnormal behaviour recognition methods based on 3D convolution
CN110781830A (en) * 2019-10-28 2020-02-11 西安电子科技大学 SAR sequence image classification method based on space-time joint convolution
CN110766096A (en) * 2019-10-31 2020-02-07 北京金山云网络技术有限公司 Video classification method and device and electronic equipment
CN111859023A (en) * 2020-06-11 2020-10-30 中国科学院深圳先进技术研究院 Video classification method, device, equipment and computer readable storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114677704B (en) * 2022-02-23 2024-03-26 西北大学 Behavior recognition method based on three-dimensional convolution and space-time feature multi-level fusion
CN115130539A (en) * 2022-04-21 2022-09-30 腾讯科技(深圳)有限公司 Classification model training method, data classification device and computer equipment
WO2024001139A1 (en) * 2022-06-30 2024-01-04 海信集团控股股份有限公司 Video classification method and apparatus and electronic device
CN116824641A (en) * 2023-08-29 2023-09-29 卡奥斯工业智能研究院(青岛)有限公司 Gesture classification method, device, equipment and computer storage medium
CN116824641B (en) * 2023-08-29 2024-01-09 卡奥斯工业智能研究院(青岛)有限公司 Gesture classification method, device, equipment and computer storage medium

Also Published As

Publication number Publication date
CN111859023A (en) 2020-10-30
WO2021248859A9 (en) 2022-02-10

Similar Documents

Publication Publication Date Title
WO2021248859A1 (en) Video classification method and apparatus, and device, and computer readable storage medium
WO2021043168A1 (en) Person re-identification network training method and person re-identification method and apparatus
WO2020199693A1 (en) Large-pose face recognition method and apparatus, and device
EP4156017A1 (en) Action recognition method and apparatus, and device and storage medium
WO2020177513A1 (en) Image processing method, device and apparatus, and storage medium
WO2022121485A1 (en) Image multi-tag classification method and apparatus, computer device, and storage medium
CN112070044B (en) Video object classification method and device
CN111368672A (en) Construction method and device for genetic disease facial recognition model
WO2021018251A1 (en) Image classification method and device
CN106803054B (en) Faceform's matrix training method and device
CN113537254B (en) Image feature extraction method and device, electronic equipment and readable storage medium
CN110222718A (en) The method and device of image procossing
WO2023179429A1 (en) Video data processing method and apparatus, electronic device, and storage medium
CN111177460B (en) Method and device for extracting key frame
CN111401267B (en) Video pedestrian re-identification method and system based on self-learning local feature characterization
CN113159200B (en) Object analysis method, device and storage medium
CN113033448B (en) Remote sensing image cloud-removing residual error neural network system, method and equipment based on multi-scale convolution and attention and storage medium
WO2021073311A1 (en) Image recognition method and apparatus, computer-readable storage medium and chip
WO2024041108A1 (en) Image correction model training method and apparatus, image correction method and apparatus, and computer device
WO2023142550A1 (en) Abnormal event detection method and apparatus, computer device, storage medium, computer program, and computer program product
CN115205613A (en) Image identification method and device, electronic equipment and storage medium
CN109389089B (en) Artificial intelligence algorithm-based multi-person behavior identification method and device
CN112183359A (en) Violent content detection method, device and equipment in video
CN112949571A (en) Method for identifying age, and training method and device of age identification model
CN112183299B (en) Pedestrian attribute prediction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20940364

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20940364

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20940364

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.07.2023)