WO2021139307A1 - 视频内容识别方法、装置、存储介质、以及计算机设备 - Google Patents

视频内容识别方法、装置、存储介质、以及计算机设备 Download PDF

Info

Publication number
WO2021139307A1
WO2021139307A1 PCT/CN2020/122152 CN2020122152W WO2021139307A1 WO 2021139307 A1 WO2021139307 A1 WO 2021139307A1 CN 2020122152 W CN2020122152 W CN 2020122152W WO 2021139307 A1 WO2021139307 A1 WO 2021139307A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
sub
feature
features
video
Prior art date
Application number
PCT/CN2020/122152
Other languages
English (en)
French (fr)
Inventor
李岩
纪彬
史欣田
康斌
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2022519175A priority Critical patent/JP7286013B2/ja
Priority to EP20911536.9A priority patent/EP3998549A4/en
Priority to KR1020227006378A priority patent/KR20220038475A/ko
Publication of WO2021139307A1 publication Critical patent/WO2021139307A1/zh
Priority to US17/674,688 priority patent/US11983926B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Definitions

  • This application relates to the field of computer technology, and in particular to a video content recognition method, device, storage medium, and computer equipment.
  • artificial intelligence technology has been researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, autonomous driving, drones , Robotics, intelligent medical care, intelligent customer service, etc., I believe that with the development of technology, artificial intelligence technology will be applied in more fields and play more and more important values.
  • a method for identifying video content, executed by a computer device includes:
  • the image feature is divided into multiple image sub-features, the multiple image sub-features are arranged in a preset order, and each image sub-feature includes each video frame in the corresponding Features on the channel;
  • a video content recognition device including:
  • An obtaining module configured to obtain a video frame set from a target video and extract image features corresponding to the video frame set, wherein the video frame set includes at least two video frames;
  • the dividing module is configured to divide the image feature into multiple image sub-features based on the multiple channels of the image feature, the multiple image sub-features are arranged in a preset order, and each image sub-feature includes every The characteristics of each video frame on the corresponding channel;
  • a determining module configured to determine the image sub-feature to be processed from the plurality of image sub-features based on the preset sequence
  • the fusion module is used to fuse the convolution processing result of the current image sub-feature to be processed with the previous image sub-feature, and perform convolution processing on the fused image feature to obtain the convolution processing corresponding to each image sub-feature to be processed Image feature
  • the stitching module is used for stitching multiple convolutional image features based on the multiple channels of the convolved image features to obtain the stitched image features;
  • the content determining module is configured to determine the video content corresponding to the target video based on the features of the spliced image.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the computer-readable instructions execute the steps of the above video content recognition method .
  • a computer device includes a memory and one or more processors.
  • the memory stores computer readable instructions.
  • the one or more processors execute the steps of the video content recognition method.
  • FIG. 1 is a schematic diagram of a scene of a video content recognition system provided by an embodiment of the present application
  • FIG. 2 is a flowchart of a video content recognition method provided by an embodiment of the present application
  • FIG. 3 is a flowchart of a video content recognition method provided by another embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of a hybrid convolution model provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of predicting video content corresponding to a target video provided by an embodiment of the present application
  • FIG. 6 is a schematic diagram of a model structure of a multiple information fusion model provided by an embodiment of the present application.
  • Fig. 7 is a logical schematic diagram of a multiple information fusion sub-model provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of image feature splitting provided by an embodiment of the present application.
  • Fig. 9 is a logical schematic diagram of a multiple information fusion sub-model provided by another embodiment of the present application.
  • Fig. 10 is a logical schematic diagram of a multiple information fusion sub-model provided by another embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a video content recognition device provided by an embodiment of the present application.
  • Fig. 12 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the computer execution referred to herein includes the operation of a computer processing unit that represents an electronic signal of data in a structured form. This operation converts the data or maintains it in a location in the computer's memory system, which can be reconfigured or otherwise changed the operation of the computer in a manner well known to testers in the field.
  • the data structure maintained by the data is the physical location of the memory, which has specific characteristics defined by the data format.
  • Testers in the field will understand that the various steps and operations described below can also be implemented in hardware.
  • module used herein can be regarded as a software object executed on the operating system.
  • the different components, modules, engines, and services described in this article can be regarded as implementation objects on the computing system.
  • the devices and methods described herein can be implemented in the form of software, and of course can also be implemented on hardware, and they are all within the protection scope of the present application.
  • the embodiment of the application provides a method for identifying video content.
  • the execution subject of the method for identifying the video content may be the video content identifying device provided in the embodiment of the application, or a computer device integrated with the video content identifying device, wherein the video content identifying
  • the device can be implemented in hardware or software.
  • the computer device can be a smart phone, a tablet computer, a palmtop computer, a notebook computer, or a desktop computer and other devices.
  • Computer equipment includes, but is not limited to, a computer, a network host, a single network server, a set of multiple network servers, or a cloud composed of multiple servers.
  • Figure 1 is a schematic diagram of an application scenario of a video content recognition method provided by an embodiment of this application.
  • the computer device can obtain a video frame set from a target video and extract The image feature corresponding to the video frame set, where the video frame set includes at least two video frames. Based on multiple channels of the image feature, the image feature is divided into multiple image sub-features, and the multiple image sub-features are arranged in a preset order , And each image sub-feature includes the feature of each video frame on the corresponding channel.
  • the image sub-feature to be processed is determined from the multiple image sub-features, and the current image sub-feature to be processed is compared with the previous image sub-feature.
  • the convolution processing results of the features are fused, and the fused image features are convolved to obtain the convolutional image feature corresponding to each sub-feature of the image to be processed.
  • the image features are spliced to obtain the spliced image characteristics, and based on the spliced image characteristics, the video content corresponding to the target video is determined.
  • the video content recognition method provided by the embodiment of the present application relates to the computer vision direction in the field of artificial intelligence.
  • the embodiments of this application may use video behavior recognition technology to extract image features corresponding to multiple video frames in the target video, divide the image features into multiple image sub-features, and then perform multiple convolution processing on the multiple image sub-features , And multiple fusion processing to increase the receptive field of image features in the time dimension, and then predict the video content corresponding to the target video.
  • artificial intelligence is the use of digital computers or digital computer-controlled machines to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results in theories, methods, technologies and application systems .
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • artificial intelligence software technology mainly includes computer vision technology, machine learning/deep learning and other directions.
  • Computer vision technology (Computer VisioT, CV) is a science that studies how to make machines "see”. Furthermore, it refers to machine vision that uses computers to replace human eyes to identify and measure targets, and further Image processing makes the image processed by computer into an image that is more suitable for human eyes to observe or transmit to the instrument for inspection.
  • Computer vision technology usually includes image processing, image recognition and other technologies, as well as common facial recognition, human posture recognition and other biometric recognition technologies.
  • FIG. 2 is a schematic flowchart of a video content recognition method provided by an embodiment of the application. The method may be executed by a computer device, which is specifically described by the following embodiments:
  • S201 Obtain a video frame set from the target video, and extract image features corresponding to the video frame set.
  • the video content corresponding to a certain video it is necessary to analyze the complete information in the video over a period of time to be able to more accurately determine the video content expressed by the video. For example, if the person in the video is swimming, if only a single video frame in the video is analyzed, it can only be determined that the video content of the video is a person swimming; if you take multiple videos from a video period in the video Frame analysis can determine more detailed content information such as the swimmer’s strokes in the video. Therefore, when identifying the video content corresponding to the video, multiple video frames in the video need to be acquired.
  • video A whose video content needs to be identified can be determined as the target video. Because it is necessary to comprehensively determine the video content corresponding to the target video based on the information in a period of video in the target video, the video content corresponding to the target video can be determined from video A. At least two video frames are obtained in the, and a video frame set is constructed according to the obtained multiple video frames.
  • the target video in order to ensure that the multiple video frames obtained from the target video can be more completely restored to the information of a period of time in the target video, therefore, the target video can be divided and each divided After sampling the target sub-video, multiple video frames are obtained.
  • the step of "obtaining a video frame set from the target video, and extracting image features corresponding to the video frame set" may include:
  • the feature of the video frame set is extracted to obtain the image feature corresponding to the video frame set.
  • video A whose video content needs to be identified can be determined as the target video, and video A can be divided into multiple target sub-videos, where each target sub-video is a video segment derived from video A . Then, a video frame is obtained from each target sub-video, that is, each video frame corresponds to a target sub-video, and a video frame set is constructed according to the obtained multiple video frames. Then, using feature extraction methods such as convolution operation, feature extraction is performed on the video frame set, and image features corresponding to the video frame set are extracted, where the image features include features corresponding to each video frame.
  • feature extraction methods such as convolution operation
  • the target video since in actual application process, it may only need to identify the video content of a certain video segment in the target video. For example, when the target video is movie A, it may only need to identify the 20th to the first in movie A. The 25-minute video segment corresponds to the video content. At this time, the target video segment that needs video recognition can be determined from the target video, and the target video segment is divided into multiple target sub-videos, and then the subsequent steps are performed.
  • the target sub-video may be randomly sampled to obtain the video frame corresponding to the target sub-video;
  • the first video frame in the target sub-video is used as the video frame corresponding to the target sub-video; for example, a video frame located at a certain moment in the middle of the entire target sub-video can be used as the target sub-video according to the video duration of the target sub-video Corresponding video frame, etc. That is, it is sufficient to ensure that different video frames come from different target sub-videos.
  • the target video in order to obtain a fixed-length sequence of video frames from a target video of variable duration, the target video may be divided according to the number of preset images of the acquired video frame.
  • the step of "dividing the target video into multiple target sub-videos" may include:
  • the target video is divided into multiple target sub-videos.
  • the target video may be equally divided into multiple targets with the same duration according to the preset number of images.
  • Sub-video for another example, you can also first determine the sub-video duration corresponding to the target sub-video that needs to be obtained, and divide the target video according to the sub-video duration. At this time, the target video with a longer video duration can be acquired more Video frames, and the target video with a shorter video duration can acquire fewer video frames, and so on.
  • video A that needs to identify the video content can be determined as the target video.
  • the video duration of video A is 24s, and the preset number of images is 8, then video A can be divided into sub-video durations evenly It is 8 target sub-videos of 3s, and random sampling is performed on each target sub-video to obtain a video frame set.
  • the video frame set includes 8 video frames obtained by sampling. Then, feature extraction can be performed on the video frame set to obtain the image feature corresponding to the video frame set.
  • S202 Based on the multiple channels of the image feature, divide the image feature into multiple image sub-features.
  • the number of channels corresponding to the features in deep learning can represent the number of convolution kernels in the convolutional layer. For example, if the input image features include 3 channels and the number of convolution kernels is 10, then 10 convolution kernels are used to compare the input image After the feature is subjected to convolution processing, the output image feature can be obtained, where the output image feature includes 10 channels, and the number of channels in the output image feature at this time is the same as the number of convolution kernels.
  • X can be used to represent image features
  • [T, C, H, W] can be used to represent the size of feature dimensions, where T represents the time dimension, that is, there are a total of T video frames in the video frame set; C represents the channel Number; H and W represent the spatial dimension of the feature. If the image feature is divided into 4 image sub-features, the feature dimension corresponding to each image sub-feature becomes
  • the multiple image sub-features are arranged in a preset order, and each image sub-feature includes the feature of each video frame on the corresponding channel.
  • a video frame set can be extracted from the target video.
  • the video frame set includes 8 video frames, and based on multiple convolution operations, the image feature X corresponding to the video frame set can be obtained.
  • the image feature X includes features corresponding to 8 video frames, and the image feature X corresponds to 256 channels arranged according to channel 1 to channel 256.
  • image sub-feature X1 is the image sub-feature corresponding to channel 1 to channel 64
  • image sub-feature X2 is the image sub-feature and image sub-feature corresponding to channel 65 to channel 128.
  • each image sub-feature includes features corresponding to 8 video frames.
  • the number of image sub-features that need to be acquired can be adjusted according to actual conditions, and the embodiment of the present application does not limit the number of image sub-features.
  • S203 Based on a preset sequence, determine the image sub-feature to be processed from the multiple image sub-features.
  • these convolutional image features include features corresponding to all video frames in the video frame set, and the features corresponding to each video frame also incorporate features of video frames adjacent to the corresponding video frame, that is, convolution Compared with the original image features to be processed, the post image features increase the receptive field and enrich the features.
  • the image sub-feature X1, image sub-feature X2, image sub-feature X3, and image sub-feature X4 are determined as the image sub-features to be processed.
  • the sub-features of the image to be processed can be adjusted according to actual application requirements. For example, when the preset sequence is different, the sub-features of the image to be processed determined from multiple image sub-features will also be different.
  • a convolution process can only achieve the effect of increasing the receptive field of limited multiples.
  • the initial features include the features of image 1, the features of image 2, and the features of image 3, and the one-dimensional volume
  • the size of the convolution kernel in the product is 3.
  • the processed feature can be obtained.
  • the processed feature includes the features corresponding to the 3 images, but for the image in the processed feature 2
  • the features of image 1 and image 3 are also fused in the features at this time.
  • the receptive field of the processed features in the time dimension becomes larger, but it can only achieve fusion. The effect of the characteristics of two adjacent images.
  • a video frame needs to go through a large number of local convolution operations before it can establish a connection with a long-distance video frame. Therefore, whether it is the information of the current video frame Whether it is transmitted to a long-distance video frame or a long-distance video frame feeds the signal back to the current video frame, it needs to go through a long-distance signal transmission process, and the effective information is easily weakened in the information transmission process, and leads to long-distance signal transmission. An effective time link cannot be established between two video frames.
  • a feature fusion mechanism can be used to fuse the features of the enlarged receptive field into the sub-features of the image to be processed that currently needs to be convolved, so that the current image to be processed has been added before the convolution process.
  • the receptive field of the sub-feature and then use the convolution processing to increase the receptive field of the feature again, and merge the feature of the increased receptive field into the sub-feature of the image to be processed that needs convolution processing next time, and so on.
  • Image features after convolution can include:
  • a number of image sub-features to be processed have been determined from the image sub-features X1, image sub-features X2, image sub-features X3, and image sub-features X4 arranged in order : Image sub-feature X2, image sub-feature X3, and image sub-feature X4.
  • the image sub-feature X2 can be determined as the initial image sub-feature to be processed, and the image sub-feature X2 can be convolved to obtain the convolved image feature corresponding to the image sub-feature X2
  • the image sub-feature X3 can be determined as the current image sub-feature to be processed, and the convolutional image feature corresponding to the image sub-feature X2 can be convolved using a connection method similar to the residual connection And the image sub-feature X3 is additively fused to obtain the fused image feature corresponding to the image sub-feature X3, and then the fused image feature corresponding to the image sub-feature X3 is subjected to convolution processing to obtain the convolved image feature corresponding to the image sub-feature X3
  • the image sub-feature X4 can be determined as the current image sub-feature to be processed, and the convolutional image feature corresponding to the image sub-feature X3 can be convolved using a connection method similar to the residual connection And the image sub-feature X4 is additively fused to obtain the fused image feature corresponding to the image sub-feature X4, and then the fused image feature corresponding to the image sub-feature X4 is subjected to convolution processing to obtain the convolved image feature corresponding to the image sub-feature X4 At this point, all the sub-features of the image to be processed have been subjected to convolution processing, and the convolutional image feature corresponding to each sub-feature of the image to be processed is obtained, which indicates that the loop step can be ended.
  • each image sub-feature includes features corresponding to the T video frames.
  • the fourth video frame (video frame 4) is taken as an example for illustration, as shown in Figure 10.
  • the image sub-feature X1, the image sub-feature X2, the image sub-feature X3, and the image sub-feature X4 all include the feature corresponding to the video frame 4.
  • the image sub-feature Image features after convolution corresponding to X2 In the video frame 4, the features of the video frame 3 and the video frame 5 will be merged.
  • the image sub-feature X2 corresponds to the image feature after convolution The receptive field increased once.
  • the image sub-feature X3 Convolve the image feature corresponding to the image sub-feature X2 And the image sub-feature X3 is subjected to additive fusion, and the convolution processing is performed on the fused image feature corresponding to the image sub-feature X3, and the convolutional image feature corresponding to the image sub-feature X3
  • the features of the video frame 2, the video frame 3, the video frame 5, and the video frame 6 are merged.
  • the image sub-feature X3 corresponds to the image feature after convolution
  • the receptive field has increased twice.
  • the image sub-feature X4 Convolve the image feature corresponding to the image sub-feature X3 And the image sub-feature X4 is additively fused, and the fused image feature corresponding to the image sub-feature X4 is subjected to convolution processing, and the convolutional image feature corresponding to the image sub-feature X4
  • the features of video frame 1, video frame 2, video frame 3, video frame 5, video frame 6, and video frame 7 are merged.
  • the image sub-feature X4 corresponds to the image feature after convolution
  • the receptive field is increased three times, then this feature can effectively establish contact with the long-distance video frame.
  • a hybrid convolution model may be used to perform convolution processing on the features to achieve the purpose of increasing the receptive field.
  • the step of "convolution processing on the initial image sub-feature to be processed to obtain the image feature after convolution" may include:
  • convolution processing is performed on the initial image sub-features to be processed to obtain convolutional image features.
  • the initial hybrid convolution model can be a (2+1)D convolution model, and the (2+1)D convolution model can include two parts, namely a one-dimensional convolution submodel and a two-dimensional convolution submodel. model.
  • the initial hybrid convolution model may include a one-dimensional convolution sub-model in the time dimension, the convolution kernel size of the one-dimensional convolution sub-model is 3, and a two-dimensional convolution sub-model in the spatial dimension.
  • the size of the convolution kernel of the dimensional convolution submodel is 3x3.
  • an initial hybrid convolution model can be determined.
  • the initial hybrid convolution model includes a one-dimensional convolution sub-model in the time dimension.
  • the one-dimensional convolution sub-model has a convolution kernel size of 3 and a spatial dimension.
  • the two-dimensional convolution submodel on the above, the size of the convolution kernel of the two-dimensional convolution submodel is 3x3. Since the image features have been divided into multiple image sub-features based on multiple channels, correspondingly, the initial hybrid convolution model also needs to be divided into multiple hybrid convolution models based on multiple channels, that is, the initial hybrid convolution The model performs convolution group to obtain multiple mixed convolution models.
  • the hybrid convolution model since the size of the convolution kernel does not change after the convolution group, as shown in Figure 4, the hybrid convolution model includes a one-dimensional convolution sub-model in the time dimension, and the convolution kernel of the one-dimensional convolution sub-model The size of the two-dimensional convolution submodel is 3 and the spatial dimension, and the size of the convolution kernel of the two-dimensional convolution submodel is 3x3.
  • the one-dimensional convolution sub-model in the initial hybrid convolution model has a convolution kernel size of 3.
  • the parameter size is CxCx3;
  • the initial hybrid convolution model For the two-dimensional convolution submodel in the size of the convolution kernel is 3x3.
  • the parameter size is CxCx3x3. Since the convolution group does not change the size of the convolution kernel, the size of the convolution kernel is still 3 for the one-dimensional convolution submodel in the hybrid convolution model.
  • the hybrid convolution model is aimed at the number of channels
  • the image sub-feature of, therefore, the size of the parameter is In the two-dimensional convolution sub-model of the hybrid convolution model, the size of the convolution kernel is still 3x3, but because the hybrid convolution model is aimed at the number of channels
  • the image sub-feature of, therefore, the size of the parameter is After the divided mixed convolution model is obtained, the mixed convolution model can be used to perform convolution processing on the initial image sub-features to be processed to obtain the convolutional image features.
  • the one-dimensional convolution sub-model and the two-dimensional convolution sub-model can be used to perform convolution processing on the features respectively.
  • the step of "convolution processing on the initial image sub-features to be processed based on the hybrid convolution model to obtain the image features after convolution" may include:
  • convolution processing is performed on the temporally convolved image features in the spatial dimension to obtain the convolved image features.
  • the initial image sub-feature to be processed is image sub-feature X2
  • the feature dimension is The feature dimension can be changed from Reorganized to Then use the one-dimensional convolution submodel with the convolution kernel size of 3 to process the time dimension T of the image subfeature X2 to obtain the image features after time convolution, where the parameter of the convolution operator is In this process, the spatial information of the image sub-feature X2 is ignored.
  • the image sub-feature X2 contains the feature information of T frames in total, and the feature dimension of each frame is Among them, using a convolution kernel with a size of 3 to perform convolution processing in the time dimension is equivalent to performing information fusion with the video frame t-1 adjacent to itself and the video frame t+1 for the video frame t.
  • the feature dimension of the image feature after time convolution is from Reorganized to And use the two-dimensional convolution sub-model with the convolution kernel size of 3x3 to process the spatial dimensions (H, W) of the image features after time convolution to obtain the image features after convolution.
  • the parameter of the convolution operator is In this process, the time information of the image feature after time convolution is ignored, which can be regarded as the image feature after time convolution including the feature of HxW pixels, and the dimension of each pixel feature is In this process, each pixel in the spatial dimension is fused with the pixel in the adjacent 3x3 spatial area.
  • the feature dimension can be changed from Revert to And get the image characteristics after convolution.
  • the parameter amount for one convolution operation is CxCx3, but using the one-dimensional convolution sub-model in the hybrid convolution model to perform the parameters of a convolution operation
  • the amount is Therefore, the sum of the parameter quantities for performing three convolution operations in the embodiment of this application is Compared with the direct application of the initial hybrid convolution model, the amount of parameters is reduced, but it can integrate the characteristics of a longer time range, and more completely consider and make judgments on the time information of the video.
  • the size of the convolution kernel can be adjusted according to actual application conditions.
  • the sizes of the convolution kernels corresponding to multiple sub-features of the image to be processed can also be made different, that is, for different sub-features of the image to be processed, convolution kernels of different sizes can be used Perform convolution processing to comprehensively consider the modeling capabilities on different time scales.
  • S205 Based on the multiple channels of the convolved image features, stitch the multiple convolutional image features to obtain the stitched image features.
  • multiple convolutional image features can be spliced together according to the channel, and the spliced image feature can be obtained.
  • the original image sub-features that need to be retained can also be determined from multiple image sub-features, so that the stitched image that is finally acquired can retain the original image sub-features. Processing characteristics. Specifically, the step of “splicing multiple convolutional image features based on the multiple channels of the convolved image features to obtain the spliced image features” may include:
  • the multiple convolution image features and the original image sub-features are spliced to obtain the spliced image feature.
  • the image sub-feature X1, the image sub-feature X2, the image sub-feature X3, and the image sub-feature X4, which are arranged in sequence, can be determined as the image sub-feature X1 that needs to be retained.
  • the image sub-feature X1 (i.e. ) Perform splicing to obtain the image feature X 0 after splicing.
  • the receptive field of each feature to be spliced is different.
  • the receptive field does not increase; the image sub-feature X2 undergoes convolution processing once, and the receptive field increases once; Feature X3 undergoes two convolution processing, and the receptive field is increased twice; image sub-feature X4 undergoes three convolution processing, and the receptive field is increased three times.
  • multiple information fusion models can be used to complete the steps of acquiring image features after stitching based on image features.
  • the multiple information fusion models include multiple information fusion sub-models, and two A two-dimensional convolutional layer with a convolution kernel size of 1 ⁇ 1, multiple information fusion sub-models can achieve the above: multiple channels based on image features, image features are divided into multiple image sub-features, based on a preset order , Determine the image sub-feature to be processed from multiple image sub-features, fuse the convolution processing result of the current image sub-feature to be processed and the previous image sub-feature, and perform convolution processing on the fused image features to obtain each
  • the convolutional image features corresponding to the sub-features of the image to be processed are spliced based on the multiple channels of the convolved image features, and the multiple convolutional image features are spliced to obtain the spliced image features.
  • multiple Temporal Aggregation (Multiple Temporal Aggregation, MTA) modules may also be stacked to achieve stronger and more stable long-term information modeling capabilities.
  • the embodiment of the present application may also include a training process for multiple information fusion modules.
  • the corresponding image feature can be expressed as X', and its feature dimension is [ N,T',C',H',W'], where N represents the size of the batch in a training batch during training, and T'represents the time dimension, that is, there are a total of T'video frames in the video frame set; C 'Represents the number of channels; H'and W'represent the spatial dimensions of the feature.
  • the fusion module is trained to obtain multiple information fusion modules. Among them, the entire training process is end-to-end, and the training of multiple information fusion modules is carried out together with the learning of video spatio-temporal features.
  • S206 Determine video content corresponding to the target video based on the features of the spliced image.
  • the purpose of the embodiments of this application is to identify the video content corresponding to the target video. Therefore, after the spliced image features are obtained, the spliced image features can be processed continuously, and the video frame set can be predicted The prediction score corresponding to each video frame is then used to average the prediction scores in multiple videos using a time averaging strategy, and the final prediction of the entire target video is obtained.
  • the step of "determining the video content corresponding to the target video based on the features of the spliced image" may include:
  • the video content corresponding to the target video is determined.
  • the content prediction probability corresponding to each video frame in the video frame set can be predicted based on the image characteristics after stitching.
  • the content prediction probability corresponding to the video frame can be known:
  • the video frame describes the possibility of each video content.
  • the time average strategy is used to fuse the content prediction probabilities corresponding to multiple video frames, and the video content prediction probability corresponding to the target video is obtained.
  • a histogram can be constructed accordingly, and the video content with the highest probability can be determined as the video content "backstroke" corresponding to the target video.
  • the video content recognition method of the embodiments of the present application can obtain the spliced image features that incorporate the long-term range features, it can be used as a basic video understanding technology, using the long-range feature-integrated image features. After stitching the image features, perform subsequent tasks such as re-election and personalized recommendation.
  • the video content identification method of the embodiment of the present application can also identify the video content of the target video, it can also be applied to specific video application scenarios, for example, it can be applied to review and filter including politics, violence, pornography, etc. The scene of the category video.
  • the embodiment of the present application can obtain a video frame set from the target video, and extract the image features corresponding to the video frame set, where the video frame set includes at least two video frames, based on the multiple channels of the image feature, the image
  • the feature is divided into multiple image sub-features, and the multiple image sub-features are arranged in a preset order, and each image sub-feature includes the feature of each video frame on the corresponding channel.
  • the image feature after convolution is based on multiple channels of the image feature after convolution, and multiple convolution image features are stitched to obtain the image feature after stitching. Based on the image feature after stitching, the video content corresponding to the target video is determined.
  • This solution can split an initial hybrid convolution model into multiple hybrid convolution models, and at the same time add a residual connection between the two hybrid convolution models, so that multiple hybrid convolution models form a level Structure.
  • the video feature will be processed through multiple convolutions to increase the receptive field in the time dimension, and the video feature of each frame can effectively establish a connection with the remote video frame.
  • this method does not add additional parameters, nor does it add complex calculations, so that the efficiency of video content recognition can be improved.
  • the specific process of the video content identification method of the embodiment of the present application may be as follows:
  • the network device obtains T video frames from the target video.
  • the network device can use sparse sampling to divide the target video into T target sub-videos on average. Then, randomly sample from each target sub-video to obtain the video frame corresponding to each target sub-video, so that the target video with variable duration is transformed into a fixed-length video frame sequence.
  • the network device extracts image features X corresponding to the T video frames.
  • a network device can use feature extraction methods such as several convolutions to extract the image feature X corresponding to the T video frames, and the image feature X includes feature information corresponding to each video frame.
  • feature extraction methods such as several convolutions to extract the image feature X corresponding to the T video frames, and the image feature X includes feature information corresponding to each video frame.
  • [T, C, H, W] can be used to represent the feature dimension, T represents the time dimension, that is, there are T video frames in total; C represents the number of channels; H and W represent the spatial dimensions of the feature.
  • the network device splits the image feature X into image sub feature X1, image sub feature X2, image sub feature X3, and image sub feature X4.
  • the network device can divide the image feature X into 4 image sub-features according to the multiple channels of the image feature X: image sub-feature X1, image sub-feature X2, image sub-feature X3, and image sub-feature X4, where the feature dimension corresponding to each image sub-feature becomes
  • an initial hybrid convolution module may be determined.
  • the initial hybrid convolution model includes a one-dimensional convolution sub-model in the time dimension and a two-dimensional convolution sub-model in the space dimension. Since image features have been divided into multiple image sub-features based on multiple channels, correspondingly, the initial hybrid convolution model also needs to be divided into multiple hybrid convolution models based on multiple channels.
  • the one-dimensional convolution sub-model in the initial hybrid convolution model has a convolution kernel size of 3.
  • the parameter size is CxCx3;
  • the initial hybrid convolution model For the two-dimensional convolution submodel in the size of the convolution kernel is 3x3.
  • the parameter size is CxCx3x3. Since the convolution group does not change the size of the convolution kernel, the size of the convolution kernel is still 3 for the one-dimensional convolution submodel in the hybrid convolution model.
  • the size of the parameter is In the two-dimensional convolution sub-model of the hybrid convolution model, the size of the convolution kernel is still 3x3, but because the hybrid convolution model is aimed at the number of channels The image sub-feature of, therefore, the size of the parameter is
  • the network device performs convolution processing on the image sub-feature X2 to obtain a convolved image feature corresponding to the image sub-feature X2.
  • the feature dimension of the image sub-feature X2 is Network equipment can change the characteristic dimension from Reorganized to Then use the one-dimensional convolution submodel with the convolution kernel size of 3 to process the time dimension T of the image subfeature X2 to obtain the image features after time convolution, where the parameter of the convolution operator is Then, the feature dimension of the image feature after time convolution is from Reorganized to And use the two-dimensional convolution sub-model with the convolution kernel size of 3x3 to process the spatial dimensions (H, W) of the image features after time convolution to obtain the image features after convolution. Among them, the parameter of the convolution operator is Finally, the feature dimension can be changed from Revert to And get the image feature after convolution corresponding to the image sub-feature X2
  • the network device performs additive fusion on the convolved image feature corresponding to the image sub-feature X2 and the image sub-feature X3 to obtain the fused image feature corresponding to the image sub-feature X3.
  • the network device performs convolution processing on the fused image feature corresponding to the image sub-feature X3 to obtain the convoluted image feature corresponding to the image sub-feature X3.
  • the network device performs additive fusion on the convolved image feature corresponding to the image sub-feature X3 and the image sub-feature X4 to obtain the fused image feature corresponding to the image sub-feature X4.
  • the network device performs convolution processing on the fused image feature corresponding to the image sub-feature X4 to obtain the convoluted image feature corresponding to the image sub-feature X4.
  • the network device splices the multiple convolutional image features and the image sub-feature X1 based on the multiple channels of the convolved image feature to obtain the spliced image feature.
  • the network device can divide the image sub-feature X2 corresponding to the convolved image feature according to the multiple channels of the convolved image feature Image feature after convolution corresponding to image sub-feature X3 Image feature after convolution corresponding to image sub-feature X4 And the image sub-feature X1 (i.e. ) Perform splicing to obtain the image feature X 0 after splicing. Then, apply stacked multiple information fusion modules to continue processing features to achieve stronger and more stable long-term information modeling capabilities.
  • the network device determines the video content corresponding to the target video based on the spliced image characteristics.
  • the network device can predict the content prediction probability corresponding to T video frames based on the image characteristics after splicing. Then, the time average strategy is used to fuse the content prediction probabilities corresponding to the T video frames, and the video content prediction probability corresponding to the target video is obtained. Then, based on the predicted probability of the video content, a histogram can be constructed accordingly, and the video content with the highest probability can be determined as the video content corresponding to the target video.
  • the embodiment of the present application can obtain T video frames from the target video through a network device, extract the image feature X corresponding to the T video frames, and split the image feature X into multiple channels based on the image feature X Image sub-feature X1, image sub-feature X2, image sub-feature X3, and image sub-feature X4, perform convolution processing on image sub-feature X2 to obtain the convolved image feature corresponding to image sub-feature X2, and then correspond to image sub-feature X2
  • the fused image feature corresponding to the image sub-feature X3 is obtained, and the fused image feature corresponding to the image sub-feature X3 is subjected to convolution processing to obtain the corresponding image sub-feature X3
  • the convolution image feature corresponding to the image sub feature X3 and the image sub feature X4 are additively fused to obtain the fused image feature
  • the multiple convolutional image features and the image sub-feature X1 are stitched to get the stitching
  • the image feature based on the spliced image feature, determine the video content corresponding to the target video.
  • This solution can split an initial hybrid convolution model into multiple hybrid convolution models, and at the same time add a residual connection between the two hybrid convolution models, so that multiple hybrid convolution models form a level Structure.
  • the video feature will be processed through multiple convolutions to increase the receptive field in the time dimension, and the video feature of each frame can effectively establish a connection with the remote video frame.
  • this method does not add additional parameters, nor does it add complex calculations, so that the efficiency of video content recognition can be improved.
  • the embodiments of the present application may also provide a video content recognition device.
  • the video content recognition device may be specifically integrated in a computer device.
  • the computer device may include a server, a terminal, etc., where the terminal may include : Mobile phone, tablet computer, notebook computer or personal computer (PC, PersoTal Computer), etc.
  • the video content recognition device may include an acquisition module 111, a division module 112, a determination module 113, a fusion module 114, a splicing module 115, and a content determination module 116, as follows:
  • the obtaining module 111 is configured to obtain a video frame set from a target video and extract image features corresponding to the video frame set, where the video frame set includes at least two video frames;
  • the dividing module 112 is configured to divide the image feature into multiple image sub-features based on the multiple channels of the image feature, the multiple image sub-features are arranged in a preset order, and each image sub-feature includes The characteristics of each video frame on the corresponding channel;
  • the determining module 113 is configured to determine the image sub-feature to be processed from the plurality of image sub-features based on the preset sequence;
  • the fusion module 114 is used to fuse the convolution processing result of the current image sub-feature to be processed with the previous image sub-feature, and perform convolution processing on the fused image feature to obtain the convolution corresponding to each image sub-feature to be processed Post-image features;
  • the stitching module 115 is configured to stitch multiple convolutional image features based on the multiple channels of the convolved image features to obtain the stitched image features;
  • the content determination module 116 is configured to determine the video content corresponding to the target video based on the spliced image characteristics.
  • the fusion module 114 may include a first determination sub-module, a convolution sub-module, a second determination sub-module, a fusion sub-module, an update sub-module, and a return sub-module, as follows:
  • the first determining sub-module is configured to determine the initial image sub-feature to be processed from the plurality of image sub-features to be processed based on the preset sequence;
  • the convolution sub-module is used to perform convolution processing on the initial image sub-features to be processed to obtain the image features after convolution;
  • the second determining sub-module is configured to determine the current image sub-feature to be processed from the plurality of image sub-features to be processed based on the preset order and the initial image sub-feature to be processed;
  • the fusion sub-module is used to fuse the current image sub-feature to be processed and the convolved image feature to obtain the fused image feature;
  • An update sub-module configured to update the merged image feature to the initial image sub-feature to be processed
  • the return sub-module is used to return to perform the steps of performing convolution processing on the initial image sub-feature to be processed to obtain the convolutional image feature, until the convolutional image feature corresponding to each image sub-feature to be processed is obtained.
  • the splicing module 115 may be specifically used for:
  • the multiple convolution image features and the original image sub-features are spliced to obtain the spliced image feature.
  • the acquisition module 111 may include a third determination sub-module, a division sub-module, a construction sub-module, and an extraction sub-module, as follows:
  • the third determining sub-module is used to determine the target video
  • the extraction sub-module is used to extract features of the video frame set to obtain image features corresponding to the video frame set.
  • the division sub-module may be specifically used for:
  • the target video is divided into multiple target sub-videos.
  • the convolution submodule may include a fourth determination submodule, a model division submodule, and a convolution processing submodule, as follows:
  • the fourth determining sub-module is used to determine the initial hybrid convolution model
  • the convolution processing sub-module is configured to perform convolution processing on the initial image sub-features to be processed based on the hybrid convolution model to obtain the image features after convolution.
  • the convolution processing sub-module may be specifically used for:
  • convolution processing is performed on the temporally convolved image features in the spatial dimension to obtain the convolved image features.
  • the content determination module 116 may be specifically configured to:
  • the video content corresponding to the target video is determined.
  • each of the above units can be implemented as an independent entity, or can be combined arbitrarily, and implemented as the same or several entities.
  • each of the above units please refer to the previous method embodiments, which will not be repeated here.
  • the embodiment of the present application can obtain a video frame set from the target video through the obtaining module 111, and extract the image features corresponding to the video frame set.
  • the video frame set includes at least two video frames, and the division module 112 is based on the image
  • the multiple channels of the feature the image feature is divided into multiple image sub-features, the multiple image sub-features are arranged in a preset order, and each image sub-feature includes the feature of each video frame on the corresponding channel, through the determination module 113 Determine the image sub-feature to be processed from a plurality of image sub-features based on a preset sequence, and use the fusion module 114 to fuse the convolution processing result of the current image sub-feature to be processed and the previous image sub-feature, and perform the fusion of the fused image
  • the feature is subjected to convolution processing to obtain the convolutional image feature corresponding to each sub-feature of the image to be processed.
  • the stitching module 115 stitches the multiple convolutional image features based on the multiple channels of the convolutional image feature to obtain the stitching
  • the content determination module 116 determines the video content corresponding to the target video based on the spliced image feature.
  • This solution can split an initial hybrid convolution model into multiple hybrid convolution models, and at the same time add a residual connection between the two hybrid convolution models, so that multiple hybrid convolution models form a level Structure.
  • the video feature will be processed through multiple convolutions to increase the receptive field in the time dimension, and the video feature of each frame can effectively establish a connection with the remote video frame.
  • this method does not add additional parameters, nor does it add complex calculations, so that the efficiency of video content recognition can be improved.
  • the embodiments of the present application also provide a computer device, which can integrate any video content recognition apparatus provided in the embodiments of the present application.
  • FIG. 12 shows a schematic structural diagram of a computer device involved in an embodiment of the present application, specifically:
  • the computer device may include one or more processing core processors 121, one or more computer-readable storage medium memory 122, power supply 123, input unit 124 and other components.
  • processing core processors 121 one or more computer-readable storage medium memory 122
  • computer-readable storage medium memory 122 one or more computer-readable storage medium memory 122
  • power supply 123 input unit 124
  • input unit 124 other components.
  • FIG. 12 does not constitute a limitation on the computer device, and may include more or fewer components than those shown in the figure, or a combination of certain components, or different component arrangements. among them:
  • the processor 121 is the control center of the computer device. It uses various interfaces and lines to connect the various parts of the entire computer device, runs or executes computer-readable instructions and/or modules stored in the memory 122, and calls the memory 122
  • the data inside performs various functions of the computer equipment and processes the data, so as to monitor the computer equipment as a whole.
  • the processor 121 may include one or more processing cores; preferably, the processor 121 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, application programs, etc. ,
  • the modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 121.
  • the memory 122 may be used to store computer-readable instructions and modules.
  • the processor 121 executes various functional applications and data processing by running the computer-readable instructions and modules stored in the memory 122.
  • the memory 122 may mainly include a storage area for computer-readable instructions and a storage area for data, where the storage area for computer-readable instructions may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.;
  • the data storage area can store data created according to the use of the computer equipment, etc.
  • the memory 122 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the memory 122 may further include a memory controller to provide the processor 121 with access to the memory 122.
  • the computer device also includes a power supply 123 for supplying power to various components.
  • the power supply 123 may be logically connected to the processor 121 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system.
  • the power supply 123 may also include any components such as one or more DC or AC power supplies, a recharging system, a power failure detection circuit, a power converter or inverter, and a power status indicator.
  • the computer device may further include an input unit 124, which can be used to receive inputted digital or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
  • an input unit 124 which can be used to receive inputted digital or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
  • the computer device may also include a display unit, etc., which will not be repeated here.
  • the processor 121 in the computer device will load the executable file corresponding to the process of one or more application programs into the memory 122 according to the following computer-readable instructions, and the processor 121 will Run the application programs stored in the memory 122 to realize various functions, as follows:
  • the video frame set includes at least two video frames. Based on multiple channels of the image feature, the image feature is divided into multiple image sub-features, The multiple image sub-features are arranged in a preset order, and each image sub-feature includes the feature of each video frame on the corresponding channel.
  • the image sub-feature to be processed is determined from the multiple image sub-features, and the The convolution processing results of the current image sub-feature to be processed and the previous image sub-feature are fused, and the fused image features are convolved to obtain the convolutional image feature corresponding to each image sub-feature to be processed, based on the convolution
  • the multiple convolution image features are spliced to obtain the spliced image feature, and the video content corresponding to the target video is determined based on the spliced image feature.
  • the embodiment of the present application can obtain a video frame set from the target video, and extract the image features corresponding to the video frame set, where the video frame set includes at least two video frames, based on the multiple channels of the image feature, the image
  • the feature is divided into multiple image sub-features, and the multiple image sub-features are arranged in a preset order, and each image sub-feature includes the feature of each video frame on the corresponding channel.
  • the image feature after convolution is based on multiple channels of the image feature after convolution, and multiple convolution image features are stitched to obtain the image feature after stitching. Based on the image feature after stitching, the video content corresponding to the target video is determined.
  • This solution can split an initial hybrid convolution model into multiple hybrid convolution models, and at the same time add a residual connection between the two hybrid convolution models, so that multiple hybrid convolution models form a level Structure.
  • the video feature will be processed through multiple convolutions to increase the receptive field in the time dimension, and the video feature of each frame can effectively establish a connection with the remote video frame.
  • this method does not add additional parameters, nor does it add complex calculations, so that the efficiency of video content recognition can be improved.
  • an embodiment of the present application provides a computer device in which multiple computer-readable instructions are stored, and the computer-readable instructions can be loaded by a processor to perform any video content recognition provided in the embodiments of the present application. Steps in the method.
  • the computer-readable instruction may perform the following steps:
  • the video frame set includes at least two video frames. Based on multiple channels of the image feature, the image feature is divided into multiple image sub-features, The multiple image sub-features are arranged in a preset order, and each image sub-feature includes the feature of each video frame on the corresponding channel.
  • the image sub-feature to be processed is determined from the multiple image sub-features, and the The convolution processing results of the current image sub-feature to be processed and the previous image sub-feature are fused, and the fused image features are convolved to obtain the convolutional image feature corresponding to each image sub-feature to be processed, based on the convolution
  • the multiple convolutional image features are spliced to obtain the spliced image feature, and the video content corresponding to the target video is determined based on the spliced image feature.
  • the storage medium may include: read only memory (ROM, Read OTly Memory), random access memory (RAM, RaTdom Access Memory), magnetic disk or optical disk, and so on.
  • ROM read only memory
  • RAM random access memory
  • RaTdom Access Memory magnetic disk or optical disk, and so on.
  • the instructions stored in the storage medium can execute the steps in any video content identification method provided in the embodiments of the present application, it can implement what is possible in any video content identification method provided in the embodiments of the present application.
  • the beneficial effects achieved refer to the previous embodiment for details, which will not be repeated here.
  • a computer-readable storage medium which stores a computer program computer readable instruction.
  • the processor executes the data processing in the above-mentioned blockchain network.
  • the steps of the data processing method in the blockchain network may be the steps in the data processing method in the blockchain network of each of the above embodiments.
  • a computer program product or computer readable instruction includes a computer readable instruction, and the computer readable instruction is stored in a computer readable storage medium.
  • the processor of the computer device reads the computer-readable instruction from the computer-readable storage medium, and the processor executes the computer-readable instruction, so that the computer device executes the steps in the foregoing method embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

一种视频内容识别方法,包括:从目标视频中获取视频帧集,并提取视频帧集对应的图像特征,基于图像特征的多个通道,将图像特征划分为多个图像子特征,基于预设顺序,从多个图像子特征中确定待处理图像子特征,将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征,基于卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征,基于拼接后图像特征,确定目标视频对应的视频内容。

Description

视频内容识别方法、装置、存储介质、以及计算机设备
本申请要求于2020年01月08日提交中国专利局,申请号为202010016375.2,申请名称为“一种视频内容识别方法、装置、存储介质、以及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,具体涉及一种视频内容识别方法、装置、存储介质、以及计算机设备。
背景技术
随着人工智能技术研究和进步,人工智能技术在多个领域展开研究和应用,例如常见的智能家居、智能穿戴设备、虚拟助理、智能音箱、智能营销、无人驾驶、自动驾驶、无人机、机器人、智能医疗、智能客服等,相信随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。
其中,随着以视频为载体的信息传播方式越来越流行,各种视频相关的应用也得到了极大的发展,因此,对于视频的相关技术提出了更高的要求,作为视频处理技术中的基础任务,识别视频中的内容得到了越来越多的关注。然而,目前,相关技术是利用大量的卷积操作,建立当前视频帧与远距离视频帧之间的联系,进而识别视频内容,这种视频内容识别方法效率较低。
发明内容
一种视频内容识别方法,由计算机设备执行,包括:
从目标视频中获取视频帧集,并提取所述视频帧集对应的图像特征,其中,所述视频帧集包括至少两个视频帧;
基于所述图像特征的多个通道,将所述图像特征划分为多个图像子特征,所述多个图像子特征按照预设顺序进行排列,且每个图像子特征包括每个视 频帧在相应通道上的特征;
基于所述预设顺序,从所述多个图像子特征中确定待处理图像子特征;
将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征;
基于所述卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征;及
基于所述拼接后图像特征,确定所述目标视频对应的视频内容。
一种视频内容识别装置,包括:
获取模块,用于从目标视频中获取视频帧集,并提取所述视频帧集对应的图像特征,其中,所述视频帧集包括至少两个视频帧;
划分模块,用于基于所述图像特征的多个通道,将所述图像特征划分为多个图像子特征,所述多个图像子特征按照预设顺序进行排列,且每个图像子特征包括每个视频帧在相应通道上的特征;
确定模块,用于基于所述预设顺序,从所述多个图像子特征中确定待处理图像子特征;
融合模块,用于将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征;
拼接模块,用于基于所述卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征;及
内容确定模块,用于基于所述拼接后图像特征,确定所述目标视频对应的视频内容。
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述视频内容识别方法的步骤。
一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器执行上述视频内容识别方法的步骤。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的视频内容识别系统的场景示意图;
图2是本申请一个实施例提供的视频内容识别方法的流程图;
图3是本申请另一个实施例提供的视频内容识别方法的流程图;
图4是本申请一个实施例提供的混合卷积模型的结构示意图;
图5是本申请一个实施例提供的预测目标视频对应视频内容的流程图;
图6是本申请一个实施例提供的多次信息融合模型的模型结构示意图;
图7是本申请一个实施例提供的多次信息融合子模型的逻辑示意图;
图8是本申请实施例提供的图像特征拆分示意图;
图9是本申请另一个实施例提供的多次信息融合子模型的逻辑示意图;
图10是本申请又一个实施例提供的多次信息融合子模型的逻辑示意图;
图11是本申请一个实施例提供的视频内容识别装置的结构示意图;
图12是本申请一个实施例提供的计算机设备的结构示意图。
具体实施方式
请参照图式,其中相同的组件符号代表相同的组件,本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例,其不应被视为限制本申请未在此详述的其它具体实施例。
在以下的说明中,本申请的具体实施例将参考由一部或多部计算机所执行的步骤及符号来说明,除非另有述明。因此,这些步骤及操作将有数次提到由计算机执行,本文所指的计算机执行包括了由代表了以一结构化型式中的数据的电子信号的计算机处理单元的操作。此操作转换该数据或将其维持在该计算机的内存系统中的位置处,其可重新配置或另外以本领域测试人员 所熟知的方式来改变该计算机的运作。该数据所维持的数据结构为该内存的实体位置,其具有由该数据格式所定义的特定特性。但是,本申请原理以上述文字来说明,其并不代表为一种限制,本领域测试人员将可了解到以下所述的多种步骤及操作亦可实施在硬件当中。
本文所使用的术语“模块”可看作为在该运算系统上执行的软件对象。本文所述的不同组件、模块、引擎及服务可看作为在该运算系统上的实施对象。而本文所述的装置及方法可以以软件的方式进行实施,当然也可在硬件上进行实施,均在本申请保护范围之内。
本申请中的术语“第一”、“第二”和“第三”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或模块的过程、方法、系统、产品或设备没有限定于已列出的步骤或模块,而是某些实施例还包括没有列出的步骤或模块,或某些实施例还包括对于这些过程、方法、产品或设备固有的其它步骤或模块。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
本申请实施例提供一种视频内容识别方法,该视频内容识别方法的执行主体可以是本申请实施例提供的视频内容识别装置,或者集成了该视频内容识别装置的计算机设备,其中该视频内容识别装置可以采用硬件或者软件的方式实现。其中,计算机设备可以是智能手机、平板电脑、掌上电脑、笔记本电脑、或者台式电脑等设备。计算机设备包括但不限于计算机、网络主机、单个网络服务器、多个网络服务器集或者多个服务器构成的云。
请参阅图1,图1为本申请实施例提供的视频内容识别方法的应用场景示意图,以视频内容识别装置集成在计算机设备中为例,计算机设备可以从目标视频中获取视频帧集,并提取视频帧集对应的图像特征,其中,视频帧集包括至少两个视频帧,基于图像特征的多个通道,将图像特征划分为多个图 像子特征,多个图像子特征按照预设顺序进行排列,且每个图像子特征包括每个视频帧在相应通道上的特征,基于预设顺序,从多个图像子特征中确定待处理图像子特征,将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征,基于卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征,基于拼接后图像特征,确定目标视频对应的视频内容。
本申请实施例提供的视频内容识别方法涉及人工智能领域中的计算机视觉方向。本申请实施例可以利用视频行为识别技术,提取目标视频中多个视频帧对应的图像特征,并将该图像特征划分为多个图像子特征,然后对多个图像子特征进行多次卷积处理、以及多次融合处理,以增大图像特征在时间维度的感受野,进而预测出目标视频对应的视频内容。
其中,人工智能(Artificial ITtelligeTce,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。其中,人工智能软件技术主要包括计算机视觉技术、机器学习/深度学习等方向。
其中,计算机视觉技术(Computer VisioT,CV)是一门研究如何使机器“看”的科学,更进一步的说,就是指通过计算机代替人眼对目标进行识别、测量等的机器视觉,并进一步进行图像处理,使图像经过计算机处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别等技术,还包括常见的人脸识别、人体姿态识别等生物特征识别技术。
请参阅图2,图2为本申请实施例提供的视频内容识别方法的流程示意图, 该方法可由计算机设备执行,具体通过如下实施例进行说明:
S201、从目标视频中获取视频帧集,并提取视频帧集对应的图像特征。
其中,在识别某个视频对应的视频内容时,需要对视频中一段时间内的完整信息进行分析,才能够更为准确地判断视频所表达的视频内容。比如,若视频中的人物正在游泳,若仅对视频中单张视频帧进行分析,则只能确定出该视频的视频内容为人物游泳;若从视频中的一段视频时间内,取多张视频帧进行分析,则可以确定出视频中游泳者的泳姿等更为详细的内容信息。因此,在识别视频对应的视频内容时,需要获取视频中的多个视频帧。
在实际应用中,比如,可以将需要识别视频内容的视频A确定为目标视频,由于需要根据目标视频中一段视频时间里的信息,综合判断该目标视频对应的视频内容,因此,可以从视频A中获取至少两个视频帧,并根据获取到的多个视频帧构建视频帧集。
在一实施例中,为了保证从目标视频中获取到的多个视频帧,能够较为完整的还原该目标视频中一段视频时间里的信息,因此,可以将目标视频进行划分,并对每个划分后的目标子视频进行采样,得到多个视频帧。具体地,步骤“从目标视频中获取视频帧集,并提取所述视频帧集对应的图像特征”,可以包括:
确定目标视频;
将所述目标视频划分为多个目标子视频;
从每个目标子视频中获取一个视频帧,并基于多个视频帧构建视频帧集;
提取所述视频帧集的特征,得到所述视频帧集对应的图像特征。
在实际应用中,比如,可以将需要识别视频内容的视频A确定为目标视频,并将视频A划分为多个目标子视频,其中,每个目标子视频都是来源于视频A的一个视频片段。然后,从每个目标子视频中都获取一个视频帧,也即每个视频帧都对应一个目标子视频,并根据获取到的多个视频帧构建视频帧集。然后利用卷积操作等特征提取方法,对该视频帧集进行特征提取,并提取得到该视频帧集对应的图像特征,其中,该图像特征中包括每个视频帧对应的特征。
在一实施例中,由于在实际应用过程中,可能仅需要识别目标视频中某 个视频片段的视频内容,比如,当目标视频为电影A时,可能仅需要识别电影A中第20分~第25分的视频片段对应的视频内容,此时,可以从目标视频中确定需要进行视频识别的目标视频片段,并将该目标视频片段划分为多个目标子视频,然后进行后续步骤。
在一实施例中,从目标子视频中获取一个视频帧的方法可以有多种,比如,可以通过对目标子视频进行随机采样,得到目标子视频对应的视频帧;又比如,还可以将该目标子视频中第一个视频帧,作为目标子视频对应的视频帧;又比如,还可以根据目标子视频的视频时长,将位于整个目标子视频中间某时刻的一个视频帧,作为目标子视频对应的视频帧,等等。也即只要保证不同的视频帧来自于不同的目标子视频即可。
在一实施例中,为了从时长不定的目标视频中,获取到固定长度的视频帧序列,可以根据需要获取到的视频帧的预设图像数量,对目标视频进行划分。具体地,步骤“将所述目标视频划分为多个目标子视频”,可以包括:
确定预设图像数量;
基于所述预设图像数量、以及所述目标视频的视频时长,确定每个目标子视频对应的子视频时长;
基于所述子视频时长,将所述目标视频划分为多个目标子视频。
在实际应用中,比如,为了从时长不定的目标视频中,获取到固定长度的视频帧序列,因此,可以首先确定需要获取到的视频帧序列的长度,也即确定需要获取到的视频帧的预设图像数量T。若目标视频的视频时长为m分钟,此时,可以确定需要获取到的每个目标子视频对应的子视频时长为
Figure PCTCN2020122152-appb-000001
分钟,然后,可以将整个目标视频按照子视频时长,平均划分为T个目标子视频。
在一实施例中,将目标视频划分为多个目标子视频的视频划分方法可以有多种,比如,可以如上所述,根据预设图像数量,将目标视频平均划分为多个时长相同的目标子视频;又比如,还可以首先确定需要获取的目标子视频对应的子视频时长,并根据该子视频时长对目标视频进行划分,此时,视 频时长较长的目标视频可以获取到较多个视频帧,而视频时长较短的目标视频可以获取到较少个视频帧,等等。
在实际应用中,比如,可以将需要识别视频内容的视频A确定为目标视频,此时,视频A的视频时长为24s,预设图像数量为8,则可以将视频A平均划分为子视频时长为3s的8个目标子视频,并对每个目标子视频进行随机采样,得到视频帧集,该视频帧集中包括采样得到的8个视频帧。然后可以对视频帧集进行特征提取,得到该视频帧集对应的图像特征。
S202、基于图像特征的多个通道,将图像特征划分为多个图像子特征。
其中,深度学习中特征对应的通道的数量可以表征卷积层中卷积核的数量,比如,输入图像特征包括3个通道,卷积核的数量为10,则利用10个卷积核对输入图像特征进行卷积处理后,可以得到输出图像特征,其中,该输出图像特征包括10个通道,此时输出图像特征中通道的数量与卷积核的数量相同。
在实际应用中,比如,可以利用X表示图像特征,利用[T,C,H,W]表示特征维度大小,其中,T代表时间维度,也即视频帧集中共有T个视频帧;C代表通道数;H和W代表特征的空间维度。若将图像特征划分为4个图像子特征,则每个图像子特征对应的特征维度变为
Figure PCTCN2020122152-appb-000002
在一实施例中,多个图像子特征按照预设顺序进行排列,且每个图像子特征包括每个视频帧在相应通道上的特征。比如,如图8所示,可以从目标视频中提取出视频帧集,该视频帧集中包括8个视频帧,并基于多次卷积操作,获取到该视频帧集对应的图像特征X,该图像特征X中包括8个视频帧对应的特征,并且该图像特征X对应着按照通道1~通道256进行排列的256个通道。那么可以确定需要获取的图像子特征的特征数量为4,然后将图像特征X对应的通道1~通道256平均分为4个部分:通道1~通道64、通道65~通道128、通道129~通道192、以及通道193~通道256,并根据划分结果,得到4个图像子特征:图像子特征X1、图像子特征X2、图像子特征X3、以及图像子特征X4。其中,这4个图像子特征按照预设顺序进行排列,图像子特征X1为通道1~通道64对 应的图像子特征、图像子特征X2为通道65~通道128对应的图像子特征、图像子特征X3为通道129~通道192对应的图像子特征、图像子特征X4为通道193~通道256对应的图像子特征。并且,每个图像子特征中都包括8个视频帧对应的特征。其中,需要获取的图像子特征的数量可以根据实际情况进行调整,本申请实施例不对图像子特征的数量进行限制。
S203、基于预设顺序,从多个图像子特征中确定待处理图像子特征。
其中,由于经过卷积处理后的特征,可以增大感受野,也即可以融合更长时间范围的特征,因此,需要从多个图像子特征中,选取出一部分图像子特征作为待处理图像子特征,这些待处理图像子特征需要进行卷积处理,并得到卷积后图像特征。其中,这些卷积后图像特征中都包括视频帧集中所有视频帧对应的特征,且每个视频帧对应的特征中还融合了与相应视频帧相邻的视频帧的特征,也即,卷积后图像特征相比于原始的待处理图像特征而言,增大了感受野,且丰富了特征。
在实际应用中,比如,如图8所示,获取到按顺序排列的图像子特征X1、图像子特征X2、图像子特征X3、以及图像子特征X4之后,可以根据预设顺序,将图像子特征X2、图像子特征X3、以及图像子特征X4,确定为待处理图像子特征。其中,待处理图像子特征可以根据实际应用的需要进行调整,比如,当预设顺序不同时,从多个图像子特征中确定出的待处理图像子特征也会不同。
S204、将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征。
其中,一次卷积处理只能起到增大有限倍数感受野的效果,比如,若初始特征中包括按顺序排列的图像1的特征、图像2的特征、以及图像3的特征,且一维卷积中卷积核的尺寸为3,则初始特征经过该卷积核的卷积处理后,可以得到处理后特征,该处理后特征中包括3张图像对应的特征,但是针对处理后特征里图像2对应的特征而言,此时的特征中还融合了图像1的特征和图像3的特征,相对于初始特征而言,处理后特征在时间维度的感受野变大,但是也仅能达到融合相邻两张图像的特征的效果。
因此,如果需要利用传统的方法融合长时间范围内的信息,则需要使用深度神经网络,堆叠多个卷积。但是这种方法会存在优化问题,在深度神经网络中,一个视频帧需要经过大量的局部卷积操作,才可以建立与远距离视频帧之间的联系,因此,无论是将当前视频帧的信息传递到远距离视频帧,还是远距离视频帧将信号反馈给当前视频帧,都需要经历长距离的信号传递过程,而有效的信息在信息传递过程中很容易被削弱,并导致在远距离的两个视频帧之间无法建立有效的时间联系。
因此,可以利用一种特征融合的机制,将已经增大感受野的特征融合至当前需要进行卷积处理的待处理图像子特征中,使得在卷积处理之前,就已经增加了当前待处理图像子特征的感受野,然后再利用卷积处理使得特征的感受野再一次增加,并将再一次增加感受野的特征融合至下一次需要进行卷积处理的待处理图像子特征中,这样循环下去,可以使得特征对应的时间维度的感受野连续增加,最后达到融合更长时间范围特征的目的。
在一实施例中,步骤“将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征”,可以包括:
基于所述预设顺序,从多个待处理图像子特征中,确定初始待处理图像子特征;
对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征;
基于所述预设顺序、以及所述初始待处理图像子特征,从所述多个待处理图像子特征中,确定当前待处理图像子特征;
将所述当前待处理图像子特征与所述卷积后图像特征进行融合,得到融合后图像特征;
将所述融合后图像特征更新为初始待处理图像子特征;
返回执行对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征的步骤,直至得到每个待处理图像子特征对应的卷积后图像特征。
在实际应用中,比如,如图9所示,已经从按顺序排列的图像子特征X1、图像子特征X2、图像子特征X3、以及图像子特征X4中,确定出多个待处理图像子特征:图像子特征X2、图像子特征X3、以及图像子特征X4。可以根据预 设顺序,将图像子特征X2确定为初始待处理图像子特征,并对图像子特征X2进行卷积处理,得到图像子特征X2对应的卷积后图像特征
Figure PCTCN2020122152-appb-000003
对图像子特征X2处理完毕后,可以将图像子特征X3确定为当前待处理图像子特征,并利用与残差连接类似的连接方式,将图像子特征X2对应的卷积后图像特征
Figure PCTCN2020122152-appb-000004
以及图像子特征X3进行加法融合,得到图像子特征X3对应的融合后图像特征,然后对图像子特征X3对应的融合后图像特征进行卷积处理,得到图像子特征X3对应的卷积后图像特征
Figure PCTCN2020122152-appb-000005
对图像子特征X3处理完毕后,可以将图像子特征X4确定为当前待处理图像子特征,并利用与残差连接类似的连接方式,将图像子特征X3对应的卷积后图像特征
Figure PCTCN2020122152-appb-000006
以及图像子特征X4进行加法融合,得到图像子特征X4对应的融合后图像特征,然后对图像子特征X4对应的融合后图像特征进行卷积处理,得到图像子特征X4对应的卷积后图像特征
Figure PCTCN2020122152-appb-000007
此时,所有的待处理图像子特征都已经进行卷积处理,并得到每个待处理图像子特征对应的卷积后图像特征,说明循环的步骤可以结束。
其中,若视频帧集中包括T个视频帧,则每个图像子特征中都包括T个视频帧所对应的特征,此处以第4个视频帧(视频帧4)为例进行说明,如图10所示,也即图像子特征X1、图像子特征X2、图像子特征X3、以及图像子特征X4中都包括视频帧4所对应的特征,对图像子特征X2进行卷积处理后,图像子特征X2对应的卷积后图像特征
Figure PCTCN2020122152-appb-000008
中,视频帧4的特征会融合视频帧3、以及视频帧5的特征,此时图像子特征X2对应卷积后图像特征
Figure PCTCN2020122152-appb-000009
的感受野增大了一次。
将图像子特征X2对应的卷积后图像特征
Figure PCTCN2020122152-appb-000010
以及图像子特征X3进行加法融合,并对图像子特征X3对应的融合后图像特征进行卷积处理后,图像子特征X3对应的卷积后图像特征
Figure PCTCN2020122152-appb-000011
中,视频帧4的特征会融合视频帧2、视频帧3、视频帧5、以及视频帧6的特征,此时图像子特征X3对应卷积后图像特征
Figure PCTCN2020122152-appb-000012
的感受野增大了两次。
将图像子特征X3对应的卷积后图像特征
Figure PCTCN2020122152-appb-000013
以及图像子特征X4进行加法融合,并对图像子特征X4对应的融合后图像特征进行卷积处理后,图像子特征X4对应的卷积后图像特征
Figure PCTCN2020122152-appb-000014
中,视频帧4的特征会融合视频帧1、视频帧 2、视频帧3、视频帧5、视频帧6、以及视频帧7的特征,此时图像子特征X4对应卷积后图像特征
Figure PCTCN2020122152-appb-000015
的感受野增大了三次,则该特征可以有效地与远距离的视频帧建立联系。
在一实施例中,可以利用混合卷积模型,对特征进行卷积处理,以达到增大感受野的目的。具体地,步骤“对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征”,可以包括:
确定初始混合卷积模型;
基于所述图像特征的多个通道,将所述初始混合卷积模型划分为多个混合卷积模型;
基于所述混合卷积模型,对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征。
其中,初始混合卷积模型可以为(2+1)D卷积模型,该(2+1)D卷积模型中可以包括两部分,分别为一维卷积子模型、以及二维卷积子模型。比如,初始混合卷积模型中可以包括时间维度上的一维卷积子模型,该一维卷积子模型的卷积核尺寸为3、以及空间维度上的二维卷积子模型,该二维卷积子模型的卷积核尺寸为3x3。利用(2+1)D卷积模型进行卷积处理,既能够实现对于时间特征的建模,又能够避免高昂的计算。
在实际应用中,比如,可以确定初始混合卷积模型,该初始混合卷积模型包括时间维度上的一维卷积子模型,一维卷积子模型的卷积核尺寸为3、以及空间维度上的二维卷积子模型,二维卷积子模型的卷积核尺寸为3x3。由于图像特征已经根据多个通道,划分为了多个图像子特征,因此,相应地,初始混合卷积模型也需要根据多个通道,划分为多个混合卷积模型,也即对初始混合卷积模型进行卷积分组,得到多个混合卷积模型。其中,由于卷积分组后卷积核尺寸不发生变化,因此,如图4所示,该混合卷积模型包括时间维度上的一维卷积子模型,一维卷积子模型的卷积核尺寸为3、以及空间维度上的二维卷积子模型,二维卷积子模型的卷积核尺寸为3x3。
其中,初始混合卷积模型中的一维卷积子模型,卷积核尺寸为3,该初始混合卷积模型针对通道数为C的图像特征时,参数量大小为CxCx3;初始混合卷积模型中的二维卷积子模型,卷积核尺寸为3x3,该初始混合卷积模型针对 通道数为C的图像特征时,参数量大小为CxCx3x3。由于卷积分组不改变卷积核的尺寸,因此,混合卷积模型中的一维卷积子模型,卷积核尺寸依然为3,但是,由于混合卷积模型针对的是通道数为
Figure PCTCN2020122152-appb-000016
的图像子特征,因此,参数量大小为
Figure PCTCN2020122152-appb-000017
混合卷积模型中的二维卷积子模型,卷积核尺寸依然为3x3,但是,由于混合卷积模型针对的是通道数为
Figure PCTCN2020122152-appb-000018
的图像子特征,因此,参数量大小为
Figure PCTCN2020122152-appb-000019
获取到划分后的混合卷积模型后,可以利用该混合卷积模型,对初始待处理图像子特征进行卷积处理,得到卷积后图像特征。
在一实施例中,获取到混合卷积模型后,就可以利用一维卷积子模型、以及二维卷积子模型,分别对特征进行卷积处理。具体地,步骤“基于所述混合卷积模型,对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征”,可以包括:
基于所述一维卷积子模型,在时间维度上对所述初始待处理图像子特征进行卷积处理,得到时间卷积后图像特征;
基于所述二维卷积子模型,在空间维度上对所述时间卷积后图像特征进行卷积处理,得到卷积后图像特征。
在实际应用中,比如,初始待处理图像子特征为图像子特征X2,且特征维度大小为
Figure PCTCN2020122152-appb-000020
可以将特征维度从
Figure PCTCN2020122152-appb-000021
重组为
Figure PCTCN2020122152-appb-000022
然后利用卷积核尺寸为3的一维卷积子模型,处理图像子特征X2的时间维度T,得到时间卷积后图像特征,其中,卷积算子的参数量为
Figure PCTCN2020122152-appb-000023
这一过程中,图像子特征X2的空间信息被忽略,可以看作图像子特征X2总共包含T帧的特征信息,且每一帧的特征维度为
Figure PCTCN2020122152-appb-000024
其中,在时间维度上利用尺寸为3的卷积 核进行卷积处理,相当于针对视频帧t,与和自己相邻的视频帧t-1、以及视频帧t+1进行信息融合。
然后,时间卷积后图像特征的特征维度从
Figure PCTCN2020122152-appb-000025
重组为
Figure PCTCN2020122152-appb-000026
并利用卷积核尺寸为3x3的二维卷积子模型,处理时间卷积后图像特征的空间维度(H,W),得到卷积后图像特征,其中,卷积算子的参数量为
Figure PCTCN2020122152-appb-000027
在这一过程中,时间卷积后图像特征的时间信息被忽略,可以看作时间卷积后图像特征包括HxW个像素点的特征,且每个像素点特征的维度是
Figure PCTCN2020122152-appb-000028
在这一过程中,空间维度上的每个像素点,都与相邻3x3空间区域内的像素点进行空间特征融合。最后,可以将特征维度从
Figure PCTCN2020122152-appb-000029
恢复为
Figure PCTCN2020122152-appb-000030
并得到卷积后图像特征。
其中,利用初始混合卷积模型中的一维卷积子模型,进行一次卷积操作的参数量为CxCx3,但是利用混合卷积模型中的一维卷积子模型,进行一次卷积操作的参数量为
Figure PCTCN2020122152-appb-000031
因此,本申请实施例中进行三次卷积操作的参数量总和为
Figure PCTCN2020122152-appb-000032
与直接应用初始混合卷积模型相比,参数量反而减少了,但是却能融和更长时间范围的特征,更加完整地对视频的时间信息进行考虑并作出判断。
在一实施例中,比如,在进行卷积处理的过程中,可以根据实际应用情况,对卷积核的尺寸进行调整。又比如,在进行卷积处理的过程中,还可以使得多个待处理图像子特征对应的卷积核的尺寸不同,也即针对不同的待处理图像子特征,可以利用不同尺寸的卷积核进行卷积处理,以综合考虑不同 时间尺度上的建模能力。
S205、基于卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征。
在实际应用中,比如,获取到每个待处理图像子特征对应的卷积后图像特征之后,可以根据通道,将多个卷积后图像特征拼接起来,并得到拼接后图像特征。
在一实施例中,由于希望获取到更为准确的特征,因此,还可以从多个图像子特征中确定出需要保留的原始图像子特征,使得最终获取到的拼接后图像中能够保留未经处理的特征。具体地,步骤“基于所述卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征”,可以包括:
基于所述预设顺序,从所述多个图像子特征中确定保留的原始图像子特征;
基于所述卷积后图像特征的多个通道,对多个卷积后图像特征、以及所述原始图像子特征进行拼接,得到拼接后图像特征。
在实际应用中,比如,如图9所示,可以从按顺序排列的图像子特征X1、图像子特征X2、图像子特征X3、以及图像子特征X4中,将图像子特征X1确定为需要保留的原始图像子特征。并将获取到的图像子特征X2对应的卷积后图像特征
Figure PCTCN2020122152-appb-000033
图像子特征X3对应的卷积后图像特征
Figure PCTCN2020122152-appb-000034
图像子特征X4对应的卷积后图像特征
Figure PCTCN2020122152-appb-000035
以及图像子特征X1(也即
Figure PCTCN2020122152-appb-000036
)进行拼接,得到拼接后图像特征X 0。其中,进行拼接的每个特征的感受野都不相同,图像子特征X1由于没有经过卷积处理,因此感受野没有增加;图像子特征X2经过一次卷积处理,感受野增加了一次;图像子特征X3经过两次卷积处理,感受野增加了两次;图像子特征X4经过三次卷积处理,感受野增加了三次。
在一实施例中,可以利用多次信息融合模型完成根据图像特征获取到拼接后图像特征的步骤,其中,如图6所示,多次信息融合模型中包括多次信息融合子模型、以及两个卷积核尺寸为1×1的二维卷积层,多次信息融合子模型可以实现上述:将基于图像特征的多个通道,将图像特征划分为多个图像子特征,基于预设顺序,从多个图像子特征中确定待处理图像子特征,将当前 待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征,基于卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征的步骤。
也即,将图像特征输入至时间信息融合模块中,即可得到输出的拼接后图像特征。其中,如图5所示,本申请实施例还可以堆叠多个多次信息融合模块(Multiple Temporal Aggregation,MTA),以实现更强、更稳定的长时间信息建模能力。
在一实施例中,本申请实施例还可以包括对多次信息融合模块的训练过程,比如,作为训练样本的目标样本视频,所对应的图像特征可以表示为X’,其特征维度大小为[N,T’,C’,H’,W’],其中,N代表训练时一个训练批次中batch的尺寸大小,T’代表时间维度,也即视频帧集中共有T’个视频帧;C’代表通道数;H’和W’代表特征的空间维度。可以将图像特征X’输入至未经训练的多次信息融合模块中,预测得到目标样本视频的预测视频内容,并基于已知的目标样本视频的实际视频内容,对未经训练的多次信息融合模块进行训练,得到多次信息融合模块。其中,整个训练过程是端到端的,多次信息融合模块的训练和视频时空特征的学习一同进行。
S206、基于拼接后图像特征,确定目标视频对应的视频内容。
在实际应用中,比如,本申请实施例的目的是识别出目标视频对应的视频内容,因此,获取到拼接后图像特征后,可以继续对该拼接后图像特征进行处理,并预测得到视频帧集中每个视频帧对应的预测分数,然后利用时间平均策略对多个视频中的预测分数进行平均,并得到对整个目标视频的最终预测。
在一实施例中,具体地,步骤“基于所述拼接后图像特征,确定所述目标视频对应的视频内容”,可以包括:
基于所述拼接后图像特征,预测得到视频帧集中每个视频帧对应的内容预测概率;
对多个视频帧对应的内容预测概率进行融合,得到所述目标视频对应的视频内容预测概率;
基于所述视频内容预测概率,确定所述目标视频对应的视频内容。
在实际应用中,比如,如图5所示,可以根据拼接后图像特征,对视频帧集中每个视频帧对应的内容预测概率进行预测,其中,根据视频帧对应的内容预测概率可以得知,该视频帧中描述每种视频内容的可能性。然后利用时间平均策略对多个视频帧对应的内容预测概率进行融合,并得到目标视频对应的视频内容预测概率。然后,根据该视频内容预测概率,可以相应地构建柱状图,并将其中概率最大的视频内容,确定为目标视频对应的视频内容“仰泳”。
在实际应用中,由于本申请实施例的视频内容识别方法,可以获取到融合了长时间范围特征的拼接后图像特征,因此,可以作为一种基础视频理解技术,利用融合了长时间范围特征的拼接后图像特征,进行后续的排重、个性化推荐等工作。又由于本申请实施例的视频内容识别方法,还可以识别出目标视频的视频内容,因此,还可以应用于特定的视频应用场景,比如,可以应用在审核和过滤包括涉政、暴力、色情等类别视频的场景中。
由上可知,本申请实施例可以从目标视频中获取视频帧集,并提取视频帧集对应的图像特征,其中,视频帧集包括至少两个视频帧,基于图像特征的多个通道,将图像特征划分为多个图像子特征,多个图像子特征按照预设顺序进行排列,且每个图像子特征包括每个视频帧在相应通道上的特征,基于预设顺序,从多个图像子特征中确定待处理图像子特征,将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征,基于卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征,基于拼接后图像特征,确定目标视频对应的视频内容。该方案可以通过将一个初始混合卷积模型拆分为多个混合卷积模型,同时在两两混合卷积模型之间加入残差连接形式的连接,使得多个混合卷积模型构成一种层次化的结构。视频特征就会通过多次卷积处理,增加时间维度上的感受野,且每一帧的视频特征都可以有效地与远距离的视频帧之间建立联系。同时,这种方法还不会增加额外的参数,也不会增加复杂的计算,从而能够提升视频内容识别的效率。
根据前面实施例所描述的方法,以下将以该视频内容识别装置具体集成在网络设备举例作进一步详细说明。
参考图3,本申请实施例的视频内容识别方法的具体流程可以如下:
S301、网络设备从目标视频中获取T个视频帧。
在实际应用中,比如,如图5所示,网络设备可以采用稀疏采样,将目标视频平均分为T个目标子视频。然后,从每个目标子视频中随机采样,得到每个目标子视频对应的视频帧,使得时长不定的目标视频,转变为了固定长度的视频帧序列。
S302、网络设备提取该T个视频帧对应的图像特征X。
在实际应用中,比如,网络设备可以利用若干次卷积等特征提取方式,提取该T个视频帧对应的图像特征X,图像特征X中包括每个视频帧对应的特征信息。其中,可以利用[T,C,H,W]表示特征维度大小,T代表时间维度,也即共有T个视频帧;C代表通道数;H和W代表特征的空间维度。
S303、网络设备基于图像特征X的多个通道,将图像特征X拆分为图像子特征X1、图像子特征X2、图像子特征X3、以及图像子特征X4。
在实际应用中,比如,如图7所示,网络设备可以根据图像特征X的多个通道,将图像特征X划分为4个图像子特征:图像子特征X1、图像子特征X2、图像子特征X3、以及图像子特征X4,其中,每个图像子特征对应的特征维度变为
Figure PCTCN2020122152-appb-000037
并且,可以确定初始混合卷积模块,该初始混合卷积模型包括时间维度上的一维卷积子模型,以及空间维度上的二维卷积子模型。由于图像特征已经根据多个通道,划分为了多个图像子特征,因此,相应地,初始混合卷积模型也需要根据多个通道,划分为多个混合卷积模型。
其中,初始混合卷积模型中的一维卷积子模型,卷积核尺寸为3,该初始混合卷积模型针对通道数为C的图像特征时,参数量大小为CxCx3;初始混合卷积模型中的二维卷积子模型,卷积核尺寸为3x3,该初始混合卷积模型针对 通道数为C的图像特征时,参数量大小为CxCx3x3。由于卷积分组不改变卷积核的尺寸,因此,混合卷积模型中的一维卷积子模型,卷积核尺寸依然为3,但是,由于混合卷积模型针对的是通道数为
Figure PCTCN2020122152-appb-000038
的图像子特征,因此,参数量大小为
Figure PCTCN2020122152-appb-000039
混合卷积模型中的二维卷积子模型,卷积核尺寸依然为3x3,但是,由于混合卷积模型针对的是通道数为
Figure PCTCN2020122152-appb-000040
的图像子特征,因此,参数量大小为
Figure PCTCN2020122152-appb-000041
S304、网络设备对图像子特征X2进行卷积处理,得到图像子特征X2对应的卷积后图像特征。
在实际应用中,比如,如图7所示,图像子特征X2的特征维度大小为
Figure PCTCN2020122152-appb-000042
网络设备可以将特征维度从
Figure PCTCN2020122152-appb-000043
重组为
Figure PCTCN2020122152-appb-000044
然后利用卷积核尺寸为3的一维卷积子模型,处理图像子特征X2的时间维度T,得到时间卷积后图像特征,其中,卷积算子的参数量为
Figure PCTCN2020122152-appb-000045
然后,时间卷积后图像特征的特征维度从
Figure PCTCN2020122152-appb-000046
重组为
Figure PCTCN2020122152-appb-000047
并利用卷积核尺寸为3x3的二维卷积子模型,处理时间卷积后图像特征的空间维度(H,W),得到卷积后图像特征,其中,卷积算子的参数量为
Figure PCTCN2020122152-appb-000048
最后,可以将特征维度从
Figure PCTCN2020122152-appb-000049
恢复为
Figure PCTCN2020122152-appb-000050
并得到图像子特征X2对应的卷积后图像特征
Figure PCTCN2020122152-appb-000051
S305、网络设备将图像子特征X2对应的卷积后图像特征、以及图像子特 征X3进行加法融合,得到图像子特征X3对应的融合后图像特征。
S306、网络设备对图像子特征X3对应的融合后图像特征进行卷积处理,得到图像子特征X3对应的卷积后图像特征。
S307、网络设备将图像子特征X3对应的卷积后图像特征、以及图像子特征X4进行加法融合,得到图像子特征X4对应的融合后图像特征。
S308、网络设备对图像子特征X4对应的融合后图像特征进行卷积处理,得到图像子特征X4对应的卷积后图像特征。
S309、网络设备基于卷积后图像特征的多个通道,对多个卷积后图像特征、以及图像子特征X1进行拼接,得到拼接后图像特征。
在实际应用中,比如,如图7所示,网络设备可以根据卷积后图像特征的多个通道,将图像子特征X2对应的卷积后图像特征
Figure PCTCN2020122152-appb-000052
图像子特征X3对应的卷积后图像特征
Figure PCTCN2020122152-appb-000053
图像子特征X4对应的卷积后图像特征
Figure PCTCN2020122152-appb-000054
以及图像子特征X1(也即
Figure PCTCN2020122152-appb-000055
)进行拼接,得到拼接后图像特征X 0。然后,应用堆叠的多个多次信息融合模块继续对特征进行处理,以实现更强、更为稳定的长时信息建模能力。
S310、网络设备基于拼接后图像特征,确定目标视频对应的视频内容。
在实际应用中,比如,网络设备可以根据拼接后图像特征,对T个视频帧对应的内容预测概率进行预测。然后利用时间平均策略对T个视频帧对应的内容预测概率进行融合,并得到目标视频对应的视频内容预测概率。然后,根据该视频内容预测概率,可以相应地构建柱状图,并将其中概率最大的视频内容,确定为目标视频对应的视频内容。
由上可知,本申请实施例可以通过网络设备从目标视频中获取T个视频帧,提取该T个视频帧对应的图像特征X,基于图像特征X的多个通道,将图像特征X拆分为图像子特征X1、图像子特征X2、图像子特征X3、以及图像子特征X4,对图像子特征X2进行卷积处理,得到图像子特征X2对应的卷积后图像特征,将图像子特征X2对应的卷积后图像特征、以及图像子特征X3进行加法融合,得到图像子特征X3对应的融合后图像特征,对图像子特征X3对应的融合后图像特征进行卷积处理,得到图像子特征X3对应的卷积后图像特征,将图像子特征X3对应的卷积后图像特征、以及图像子特征X4进行加法融合, 得到图像子特征X4对应的融合后图像特征,对图像子特征X4对应的融合后图像特征进行卷积处理,得到图像子特征X4对应的卷积后图像特征,基于卷积后图像特征的多个通道,对多个卷积后图像特征、以及图像子特征X1进行拼接,得到拼接后图像特征,基于拼接后图像特征,确定目标视频对应的视频内容。该方案可以通过将一个初始混合卷积模型拆分为多个混合卷积模型,同时在两两混合卷积模型之间加入残差连接形式的连接,使得多个混合卷积模型构成一种层次化的结构。视频特征就会通过多次卷积处理,增加时间维度上的感受野,且每一帧的视频特征都可以有效地与远距离的视频帧之间建立联系。同时,这种方法还不会增加额外的参数,也不会增加复杂的计算,从而能够提升视频内容识别的效率。
为了更好地实施以上方法,本申请实施例还可以提供一种视频内容识别装置,该视频内容识别装置具体可以集成在计算机设备中,该计算机设备可以包括服务器、终端等,其中,终端可以包括:手机、平板电脑、笔记本电脑或个人计算机(PC,PersoTal Computer)等。
例如,如图11所示,该视频内容识别装置可以包括获取模块111、划分模块112、确定模块113、融合模块114、拼接模块115和内容确定模块116,如下:
获取模块111,用于从目标视频中获取视频帧集,并提取所述视频帧集对应的图像特征,其中,所述视频帧集包括至少两个视频帧;
划分模块112,用于基于所述图像特征的多个通道,将所述图像特征划分为多个图像子特征,所述多个图像子特征按照预设顺序进行排列,且每个图像子特征包括每个视频帧在相应通道上的特征;
确定模块113,用于基于所述预设顺序,从所述多个图像子特征中确定待处理图像子特征;
融合模块114,用于将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征;
拼接模块115,用于基于所述卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征;
内容确定模块116,用于基于所述拼接后图像特征,确定所述目标视频对 应的视频内容。
在一实施例中,所述融合模块114可以包括第一确定子模块、卷积子模块、第二确定子模块、融合子模块、更新子模块和返回子模块,如下:
第一确定子模块,用于基于所述预设顺序,从多个待处理图像子特征中,确定初始待处理图像子特征;
卷积子模块,用于对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征;
第二确定子模块,用于基于所述预设顺序、以及所述初始待处理图像子特征,从所述多个待处理图像子特征中,确定当前待处理图像子特征;
融合子模块,用于将所述当前待处理图像子特征与所述卷积后图像特征进行融合,得到融合后图像特征;
更新子模块,用于将所述融合后图像特征更新为初始待处理图像子特征;
返回子模块,用于返回执行对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征的步骤,直至得到每个待处理图像子特征对应的卷积后图像特征。
在一实施例中,所述拼接模块115可以具体用于:
基于所述预设顺序,从所述多个图像子特征中确定保留的原始图像子特征;
基于所述卷积后图像特征的多个通道,对多个卷积后图像特征、以及所述原始图像子特征进行拼接,得到拼接后图像特征。
在一实施例中,所述获取模块111可以包括第三确定子模块、划分子模块、构建子模块和提取子模块,如下:
第三确定子模块,用于确定目标视频;
划分子模块,用于将所述目标视频划分为多个目标子视频;
构建子模块,用于从每个目标子视频中获取一个视频帧,并基于多个视频帧构建视频帧集;
提取子模块,用于提取所述视频帧集的特征,得到所述视频帧集对应的图像特征。
在一实施例中,所述划分子模块可以具体用于:
确定预设图像数量;
基于所述预设图像数量、以及所述目标视频的视频时长,确定每个目标子视频对应的子视频时长;
基于所述子视频时长,将所述目标视频划分为多个目标子视频。
在一实施例中,所述卷积子模块可以包括第四确定子模块、模型划分子模块和卷积处理子模块,如下:
第四确定子模块,用于确定初始混合卷积模型;
模型划分子模块,用于基于所述图像特征的多个通道,将所述初始混合卷积模型划分为多个混合卷积模型;
卷积处理子模块,用于基于所述混合卷积模型,对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征。
在一实施例中,所述卷积处理子模块可以具体用于:
基于所述一维卷积子模型,在时间维度上对所述初始待处理图像子特征进行卷积处理,得到时间卷积后图像特征;
基于所述二维卷积子模型,在空间维度上对所述时间卷积后图像特征进行卷积处理,得到卷积后图像特征。
在一实施例中,所述内容确定模块116可以具体用于:
基于所述拼接后图像特征,预测得到视频帧集中每个视频帧对应的内容预测概率;
对多个视频帧对应的内容预测概率进行融合,得到所述目标视频对应的视频内容预测概率;
基于所述视频内容预测概率,确定所述目标视频对应的视频内容。
具体实施时,以上各个单元可以作为独立的实体来实现,也可以进行任意组合,作为同一或若干个实体来实现,以上各个单元的具体实施可参见前面的方法实施例,在此不再赘述。
由上可知,本申请实施例可以通过获取模块111从目标视频中获取视频帧集,并提取视频帧集对应的图像特征,其中,视频帧集包括至少两个视频帧,通过划分模块112基于图像特征的多个通道,将图像特征划分为多个图像子特征,多个图像子特征按照预设顺序进行排列,且每个图像子特征包括每个视 频帧在相应通道上的特征,通过确定模块113基于预设顺序,从多个图像子特征中确定待处理图像子特征,通过融合模块114将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征,通过拼接模块115基于卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征,通过内容确定模块116基于拼接后图像特征,确定目标视频对应的视频内容。该方案可以通过将一个初始混合卷积模型拆分为多个混合卷积模型,同时在两两混合卷积模型之间加入残差连接形式的连接,使得多个混合卷积模型构成一种层次化的结构。视频特征就会通过多次卷积处理,增加时间维度上的感受野,且每一帧的视频特征都可以有效地与远距离的视频帧之间建立联系。同时,这种方法还不会增加额外的参数,也不会增加复杂的计算,从而能够提升视频内容识别的效率。
本申请实施例还提供一种计算机设备,该计算机设备可以集成本申请实施例所提供的任一种视频内容识别装置。
例如,如图12所示,其示出了本申请实施例所涉及的计算机设备的结构示意图,具体来讲:
该计算机设备可以包括一个或者一个以上处理核心的处理器121、一个或一个以上计算机可读存储介质的存储器122、电源123和输入单元124等部件。本领域技术人员可以理解,图12中示出的计算机设备结构并不构成对计算机设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
处理器121是该计算机设备的控制中心,利用各种接口和线路连接整个计算机设备的各个部分,通过运行或执行存储在存储器122内的计算机可读指令和/或模块,以及调用存储在存储器122内的数据,执行计算机设备的各种功能和处理数据,从而对计算机设备进行整体监控。可选的,处理器121可包括一个或多个处理核心;优选的,处理器121可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器121中。
存储器122可用于存储计算机可读指令以及模块,处理器121通过运行存储在存储器122的计算机可读指令以及模块,从而执行各种功能应用以及数据处理。存储器122可主要包括存储计算机可读指令区和存储数据区,其中,存储计算机可读指令区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据计算机设备的使用所创建的数据等。此外,存储器122可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器122还可以包括存储器控制器,以提供处理器121对存储器122的访问。
计算机设备还包括给各个部件供电的电源123,优选的,电源123可以通过电源管理系统与处理器121逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源123还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
该计算机设备还可包括输入单元124,该输入单元124可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。
尽管未示出,计算机设备还可以包括显示单元等,在此不再赘述。具体在本实施例中,计算机设备中的处理器121会按照如下的计算机可读指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器122中,并由处理器121来运行存储在存储器122中的应用程序,从而实现各种功能,如下:
从目标视频中获取视频帧集,并提取视频帧集对应的图像特征,其中,视频帧集包括至少两个视频帧,基于图像特征的多个通道,将图像特征划分为多个图像子特征,多个图像子特征按照预设顺序进行排列,且每个图像子特征包括每个视频帧在相应通道上的特征,基于预设顺序,从多个图像子特征中确定待处理图像子特征,将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征,基于卷积后图像特征的多个通道, 对多个卷积后图像特征进行拼接,得到拼接后图像特征,基于拼接后图像特征,确定目标视频对应的视频内容。
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
由上可知,本申请实施例可以从目标视频中获取视频帧集,并提取视频帧集对应的图像特征,其中,视频帧集包括至少两个视频帧,基于图像特征的多个通道,将图像特征划分为多个图像子特征,多个图像子特征按照预设顺序进行排列,且每个图像子特征包括每个视频帧在相应通道上的特征,基于预设顺序,从多个图像子特征中确定待处理图像子特征,将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征,基于卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征,基于拼接后图像特征,确定目标视频对应的视频内容。该方案可以通过将一个初始混合卷积模型拆分为多个混合卷积模型,同时在两两混合卷积模型之间加入残差连接形式的连接,使得多个混合卷积模型构成一种层次化的结构。视频特征就会通过多次卷积处理,增加时间维度上的感受野,且每一帧的视频特征都可以有效地与远距离的视频帧之间建立联系。同时,这种方法还不会增加额外的参数,也不会增加复杂的计算,从而能够提升视频内容识别的效率。
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过计算机可读指令来完成,或通过计算机可读指令控制相关的硬件来完成,该计算机可读指令可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。
为此,本申请实施例提供一种计算机设备,其中存储有多条计算机可读指令,该计算机可读指令能够被处理器进行加载,以执行本申请实施例所提供的任一种视频内容识别方法中的步骤。例如,该计算机可读指令可以执行如下步骤:
从目标视频中获取视频帧集,并提取视频帧集对应的图像特征,其中,视频帧集包括至少两个视频帧,基于图像特征的多个通道,将图像特征划分为多个图像子特征,多个图像子特征按照预设顺序进行排列,且每个图像子 特征包括每个视频帧在相应通道上的特征,基于预设顺序,从多个图像子特征中确定待处理图像子特征,将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征,基于卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征,基于拼接后图像特征,确定目标视频对应的视频内容。
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
其中,该存储介质可以包括:只读存储器(ROM,Read OTly Memory)、随机存取记忆体(RAM,RaTdom Access Memory)、磁盘或光盘等。
由于该存储介质中所存储的指令,可以执行本申请实施例所提供的任一种视频内容识别方法中的步骤,因此,可以实现本申请实施例所提供的任一种视频内容识别方法所能实现的有益效果,详见前面的实施例,在此不再赘述。
在一个实施例中,提供了一种计算机可读存储介质,存储有计算机程序计算机可读指令,计算机程序计算机可读指令被处理器执行时,使得处理器执行上述区块链网络中的数据处理方法的步骤。此处区块链网络中的数据处理方法的步骤可以是上述各个实施例的区块链网络中的数据处理方法中的步骤。
在一个实施例中,提供了一种计算机程序产品或计算机可读指令,该计算机程序产品或计算机可读指令包括计算机可读指令,该计算机可读指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机可读指令,处理器执行该计算机可读指令,使得该计算机设备执行上述各方法实施例中的步骤。
以上对本申请实施例所提供的一种视频内容识别方法、装置、存储介质、以及电子设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种视频内容识别方法,由计算机设备执行,所述方法包括:
    从目标视频中获取视频帧集,并提取所述视频帧集对应的图像特征,其中,所述视频帧集包括至少两个视频帧;
    基于所述图像特征的多个通道,将所述图像特征划分为多个图像子特征,所述多个图像子特征按照预设顺序进行排列,且每个图像子特征包括每个视频帧在相应通道上的特征;
    基于所述预设顺序,从所述多个图像子特征中确定待处理图像子特征;
    将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征;
    基于所述卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征;及
    基于所述拼接后图像特征,确定所述目标视频对应的视频内容。
  2. 根据权利要求1所述的视频内容识别方法,其特征在于,所述将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征包括:
    基于所述预设顺序,从多个待处理图像子特征中,确定初始待处理图像子特征;
    对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征;
    基于所述预设顺序、以及所述初始待处理图像子特征,从所述多个待处理图像子特征中,确定当前待处理图像子特征;
    将所述当前待处理图像子特征与所述卷积后图像特征进行融合,得到融合后图像特征;
    将所述融合后图像特征更新为初始待处理图像子特征;及
    返回执行所述对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征的步骤,直至得到每个待处理图像子特征对应的卷积后图像特征。
  3. 根据权利要求1所述的视频内容识别方法,其特征在于,所述基于所 述卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征包括:
    基于所述预设顺序,从所述多个图像子特征中确定保留的原始图像子特征;及
    基于所述卷积后图像特征的多个通道,对多个卷积后图像特征、以及所述原始图像子特征进行拼接,得到拼接后图像特征。
  4. 根据权利要求1所述的视频内容识别方法,其特征在于,所述从目标视频中获取视频帧集,并提取所述视频帧集对应的图像特征包括:
    确定目标视频;
    将所述目标视频划分为多个目标子视频;
    从每个目标子视频中获取一个视频帧,并基于多个视频帧构建视频帧集;及
    提取所述视频帧集的特征,得到所述视频帧集对应的图像特征。
  5. 根据权利要求4所述的视频内容识别方法,其特征在于,所述将所述目标视频划分为多个目标子视频包括:
    确定预设图像数量;
    基于所述预设图像数量、以及所述目标视频的视频时长,确定每个目标子视频对应的子视频时长;及
    基于所述子视频时长,将所述目标视频划分为多个目标子视频。
  6. 根据权利要求2所述的视频内容识别方法,其特征在于,所述对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征包括:
    确定初始混合卷积模型;
    基于所述图像特征的多个通道,将所述初始混合卷积模型划分为多个混合卷积模型;及
    基于所述混合卷积模型,对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征。
  7. 根据权利要求6所述的视频内容识别方法,其特征在于,所述混合卷积模型中包括一维卷积子模型、以及二维卷积子模型;
    所述基于所述混合卷积模型,对所述初始待处理图像子特征进行卷积处 理,得到卷积后图像特征包括:
    基于所述一维卷积子模型,在时间维度上对所述初始待处理图像子特征进行卷积处理,得到时间卷积后图像特征;及
    基于所述二维卷积子模型,在空间维度上对所述时间卷积后图像特征进行卷积处理,得到卷积后图像特征。
  8. 根据权利要求1所述的视频内容识别方法,其特征在于,所述基于所述拼接后图像特征,确定所述目标视频对应的视频内容包括:
    基于所述拼接后图像特征,预测得到视频帧集中每个视频帧对应的内容预测概率;
    对多个视频帧对应的内容预测概率进行融合,得到所述目标视频对应的视频内容预测概率;及
    基于所述视频内容预测概率,确定所述目标视频对应的视频内容。
  9. 一种视频内容识别装置,包括:
    获取模块,用于从目标视频中获取视频帧集,并提取所述视频帧集对应的图像特征,其中,所述视频帧集包括至少两个视频帧;
    划分模块,用于基于所述图像特征的多个通道,将所述图像特征划分为多个图像子特征,所述多个图像子特征按照预设顺序进行排列,且每个图像子特征包括每个视频帧在相应通道上的特征;
    确定模块,用于基于所述预设顺序,从所述多个图像子特征中确定待处理图像子特征;
    融合模块,用于将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的卷积后图像特征;
    拼接模块,用于基于所述卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征;及
    内容确定模块,用于基于所述拼接后图像特征,确定所述目标视频对应的视频内容。
  10. 根据权利要求9所述的视频内容识别装置,其特征在于,所述融合模块包括第一确定子模块、卷积子模块、第二确定子模块、融合子模块、更新 子模块和返回子模块,其中:
    所述第一确定子模块,用于基于所述预设顺序,从多个待处理图像子特征中,确定初始待处理图像子特征;
    所述卷积子模块,用于对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征;
    所述第二确定子模块,用于基于所述预设顺序、以及所述初始待处理图像子特征,从所述多个待处理图像子特征中,确定当前待处理图像子特征;
    所述融合子模块,用于将所述当前待处理图像子特征与所述卷积后图像特征进行融合,得到融合后图像特征;
    所述更新子模块,用于将所述融合后图像特征更新为初始待处理图像子特征;
    所述返回子模块,用于返回执行对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征的步骤,直至得到每个待处理图像子特征对应的卷积后图像特征。
  11. 根据权利要求9所述的视频内容识别装置,其特征在于,所述拼接模块,具体用于基于所述预设顺序,从所述多个图像子特征中确定保留的原始图像子特征,基于所述卷积后图像特征的多个通道,对多个卷积后图像特征、以及所述原始图像子特征进行拼接,得到拼接后图像特征。
  12. 根据权利要求9所述的视频内容识别装置,其特征在于,所述获取模块包括第三确定子模块、划分子模块、构建子模块和提取子模块,其中:
    所述第三确定子模块,用于确定目标视频;
    所述划分子模块,用于将所述目标视频划分为多个目标子视频;
    所述构建子模块,用于从每个目标子视频中获取一个视频帧,并基于多个视频帧构建视频帧集;
    所述提取子模块,用于提取所述视频帧集的特征,得到所述视频帧集对应的图像特征。
  13. 根据权利要求12所述的视频内容识别装置,其特征在于,所述划分子模块,具体用于确定预设图像数量,基于所述预设图像数量、以及所述目标视频的视频时长,确定每个目标子视频对应的子视频时长,基于所述子视 频时长,将所述目标视频划分为多个目标子视频。
  14. 根据权利要求10所述的视频内容识别装置,其特征在于,所述卷积子模块包括第四确定子模块、模型划分子模块和卷积处理子模块,其中:
    所述第四确定子模块,用于确定初始混合卷积模型;
    所述模型划分子模块,用于基于所述图像特征的多个通道,将所述初始混合卷积模型划分为多个混合卷积模型;
    所述卷积处理子模块,用于基于所述混合卷积模型,对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征。
  15. 根据权利要求14所述的视频内容识别装置,其特征在于,所述卷积处理子模块,具体用于基于所述一维卷积子模型,在时间维度上对所述初始待处理图像子特征进行卷积处理,得到时间卷积后图像特征,基于所述二维卷积子模型,在空间维度上对所述时间卷积后图像特征进行卷积处理,得到卷积后图像特征。
  16. 根据权利要求9所述的视频内容识别装置,其特征在于,所述内容确定模块,具体用于基于所述拼接后图像特征,预测得到视频帧集中每个视频帧对应的内容预测概率,对多个视频帧对应的内容预测概率进行融合,得到所述目标视频对应的视频内容预测概率,基于所述视频内容预测概率,确定所述目标视频对应的视频内容。
  17. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
    从目标视频中获取视频帧集,并提取所述视频帧集对应的图像特征,其中,所述视频帧集包括至少两个视频帧;
    基于所述图像特征的多个通道,将所述图像特征划分为多个图像子特征,所述多个图像子特征按照预设顺序进行排列,且每个图像子特征包括每个视频帧在相应通道上的特征;
    基于所述预设顺序,从所述多个图像子特征中确定待处理图像子特征;
    将当前待处理图像子特征与上一个图像子特征的卷积处理结果进行融合,并对融合后图像特征进行卷积处理,得到每个待处理图像子特征对应的 卷积后图像特征;
    基于所述卷积后图像特征的多个通道,对多个卷积后图像特征进行拼接,得到拼接后图像特征;及
    基于所述拼接后图像特征,确定所述目标视频对应的视频内容。
  18. 根据权利要求17所述的计算机可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器还执行以下步骤:
    基于所述预设顺序,从多个待处理图像子特征中,确定初始待处理图像子特征;
    对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征;
    基于所述预设顺序、以及所述初始待处理图像子特征,从所述多个待处理图像子特征中,确定当前待处理图像子特征;
    将所述当前待处理图像子特征与所述卷积后图像特征进行融合,得到融合后图像特征;
    将所述融合后图像特征更新为初始待处理图像子特征;及
    返回执行所述对所述初始待处理图像子特征进行卷积处理,得到卷积后图像特征的步骤,直至得到每个待处理图像子特征对应的卷积后图像特征。
  19. 根据权利要求17所述的计算机可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器还执行以下步骤:
    基于所述预设顺序,从所述多个图像子特征中确定保留的原始图像子特征;及
    基于所述卷积后图像特征的多个通道,对多个卷积后图像特征、以及所述原始图像子特征进行拼接,得到拼接后图像特征。
  20. 一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器执行如权利要求1至8中任一项所述方法的步骤。
PCT/CN2020/122152 2020-01-08 2020-10-20 视频内容识别方法、装置、存储介质、以及计算机设备 WO2021139307A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2022519175A JP7286013B2 (ja) 2020-01-08 2020-10-20 ビデオコンテンツ認識方法、装置、プログラム及びコンピュータデバイス
EP20911536.9A EP3998549A4 (en) 2020-01-08 2020-10-20 METHOD AND APPARATUS FOR RECOGNIZING VIDEO CONTENT, STORAGE MEDIA, AND COMPUTER DEVICE
KR1020227006378A KR20220038475A (ko) 2020-01-08 2020-10-20 비디오 콘텐츠 인식 방법 및 장치, 저장 매체, 및 컴퓨터 디바이스
US17/674,688 US11983926B2 (en) 2020-01-08 2022-02-17 Video content recognition method and apparatus, storage medium, and computer device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010016375.2 2020-01-08
CN202010016375.2A CN111241985B (zh) 2020-01-08 2020-01-08 一种视频内容识别方法、装置、存储介质、以及电子设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/674,688 Continuation US11983926B2 (en) 2020-01-08 2022-02-17 Video content recognition method and apparatus, storage medium, and computer device

Publications (1)

Publication Number Publication Date
WO2021139307A1 true WO2021139307A1 (zh) 2021-07-15

Family

ID=70865796

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/122152 WO2021139307A1 (zh) 2020-01-08 2020-10-20 视频内容识别方法、装置、存储介质、以及计算机设备

Country Status (6)

Country Link
US (1) US11983926B2 (zh)
EP (1) EP3998549A4 (zh)
JP (1) JP7286013B2 (zh)
KR (1) KR20220038475A (zh)
CN (1) CN111241985B (zh)
WO (1) WO2021139307A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241985B (zh) 2020-01-08 2022-09-09 腾讯科技(深圳)有限公司 一种视频内容识别方法、装置、存储介质、以及电子设备
CN111950424B (zh) * 2020-08-06 2023-04-07 腾讯科技(深圳)有限公司 一种视频数据处理方法、装置、计算机及可读存储介质
CN112100075B (zh) * 2020-09-24 2024-03-15 腾讯科技(深圳)有限公司 一种用户界面回放方法、装置、设备及存储介质
CN112464831B (zh) * 2020-12-01 2021-07-30 马上消费金融股份有限公司 视频分类方法、视频分类模型的训练方法及相关设备
CN113326767A (zh) * 2021-05-28 2021-08-31 北京百度网讯科技有限公司 视频识别模型训练方法、装置、设备以及存储介质
CN113971402A (zh) * 2021-10-22 2022-01-25 北京字节跳动网络技术有限公司 内容识别方法、装置、介质及电子设备
CN114360073A (zh) * 2022-01-04 2022-04-15 腾讯科技(深圳)有限公司 一种图像识别方法及相关装置
CN117278776A (zh) * 2023-04-23 2023-12-22 青岛尘元科技信息有限公司 多通道视频内容实时比对方法和装置、设备及存储介质
CN116704206B (zh) * 2023-06-12 2024-07-12 中电金信软件有限公司 图像处理方法、装置、计算机设备和存储介质
CN117690128B (zh) * 2024-02-04 2024-05-03 武汉互创联合科技有限公司 胚胎细胞多核目标检测系统、方法和计算机可读存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110182469A1 (en) * 2010-01-28 2011-07-28 Nec Laboratories America, Inc. 3d convolutional neural networks for automatic human action recognition
CN108319905A (zh) * 2018-01-25 2018-07-24 南京邮电大学 一种基于长时程深度时空网络的行为识别方法
CN108388876A (zh) * 2018-03-13 2018-08-10 腾讯科技(深圳)有限公司 一种图像识别方法、装置以及相关设备
US20190244028A1 (en) * 2018-02-06 2019-08-08 Mitsubishi Electric Research Laboratories, Inc. System and Method for Detecting Objects in Video Sequences
CN110210311A (zh) * 2019-04-30 2019-09-06 杰创智能科技股份有限公司 一种基于通道特征融合稀疏表示的人脸识别方法
CN110348537A (zh) * 2019-07-18 2019-10-18 北京市商汤科技开发有限公司 图像处理方法及装置、电子设备和存储介质
CN111241985A (zh) * 2020-01-08 2020-06-05 腾讯科技(深圳)有限公司 一种视频内容识别方法、装置、存储介质、以及电子设备

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9432702B2 (en) * 2014-07-07 2016-08-30 TCL Research America Inc. System and method for video program recognition
KR102301232B1 (ko) * 2017-05-31 2021-09-10 삼성전자주식회사 다채널 특징맵 영상을 처리하는 방법 및 장치
KR102415509B1 (ko) * 2017-11-10 2022-07-01 삼성전자주식회사 얼굴 인증 방법 및 장치
CN108288035A (zh) * 2018-01-11 2018-07-17 华南理工大学 基于深度学习的多通道图像特征融合的人体动作识别方法
CN108520247B (zh) * 2018-04-16 2020-04-28 腾讯科技(深圳)有限公司 对图像中的对象节点的识别方法、装置、终端及可读介质
CN110557679B (zh) 2018-06-01 2021-11-19 中国移动通信有限公司研究院 一种视频内容识别方法、设备、介质和系统
CN108846355B (zh) * 2018-06-11 2020-04-28 腾讯科技(深圳)有限公司 图像处理方法、人脸识别方法、装置和计算机设备
CN110866526A (zh) * 2018-08-28 2020-03-06 北京三星通信技术研究有限公司 图像分割方法、电子设备及计算机可读存储介质
JP7391883B2 (ja) * 2018-09-13 2023-12-05 インテル コーポレイション 顔認識のための圧縮-拡張深さ方向畳み込みニューラルネットワーク
CN110210278A (zh) * 2018-11-21 2019-09-06 腾讯科技(深圳)有限公司 一种视频目标检测方法、装置及存储介质
CN109376696B (zh) * 2018-11-28 2020-10-23 北京达佳互联信息技术有限公司 视频动作分类的方法、装置、计算机设备和存储介质
CN111382833A (zh) * 2018-12-29 2020-07-07 佳能株式会社 多层神经网络模型的训练和应用方法、装置及存储介质
CN109829392A (zh) * 2019-01-11 2019-05-31 平安科技(深圳)有限公司 考场作弊识别方法、系统、计算机设备和存储介质
CN110020639B (zh) * 2019-04-18 2021-07-23 北京奇艺世纪科技有限公司 视频特征提取方法及相关设备
CN113992848A (zh) * 2019-04-22 2022-01-28 深圳市商汤科技有限公司 视频图像处理方法及装置
KR102420104B1 (ko) * 2019-05-16 2022-07-12 삼성전자주식회사 영상 처리 장치 및 그 동작방법
CN110210344A (zh) * 2019-05-20 2019-09-06 腾讯科技(深圳)有限公司 视频动作识别方法及装置、电子设备、存储介质
CN110287875B (zh) * 2019-06-25 2022-10-21 腾讯科技(深圳)有限公司 视频目标的检测方法、装置、电子设备和存储介质
CN112215332B (zh) * 2019-07-12 2024-05-14 华为技术有限公司 神经网络结构的搜索方法、图像处理方法和装置
CN110348420B (zh) * 2019-07-18 2022-03-18 腾讯科技(深圳)有限公司 手语识别方法、装置、计算机可读存储介质和计算机设备
CN110472531B (zh) * 2019-07-29 2023-09-01 腾讯科技(深圳)有限公司 视频处理方法、装置、电子设备及存储介质
CN112446834A (zh) * 2019-09-04 2021-03-05 华为技术有限公司 图像增强方法和装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110182469A1 (en) * 2010-01-28 2011-07-28 Nec Laboratories America, Inc. 3d convolutional neural networks for automatic human action recognition
CN108319905A (zh) * 2018-01-25 2018-07-24 南京邮电大学 一种基于长时程深度时空网络的行为识别方法
US20190244028A1 (en) * 2018-02-06 2019-08-08 Mitsubishi Electric Research Laboratories, Inc. System and Method for Detecting Objects in Video Sequences
CN108388876A (zh) * 2018-03-13 2018-08-10 腾讯科技(深圳)有限公司 一种图像识别方法、装置以及相关设备
CN110210311A (zh) * 2019-04-30 2019-09-06 杰创智能科技股份有限公司 一种基于通道特征融合稀疏表示的人脸识别方法
CN110348537A (zh) * 2019-07-18 2019-10-18 北京市商汤科技开发有限公司 图像处理方法及装置、电子设备和存储介质
CN111241985A (zh) * 2020-01-08 2020-06-05 腾讯科技(深圳)有限公司 一种视频内容识别方法、装置、存储介质、以及电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3998549A4

Also Published As

Publication number Publication date
US11983926B2 (en) 2024-05-14
KR20220038475A (ko) 2022-03-28
JP2022554068A (ja) 2022-12-28
EP3998549A1 (en) 2022-05-18
CN111241985A (zh) 2020-06-05
JP7286013B2 (ja) 2023-06-02
US20220172477A1 (en) 2022-06-02
CN111241985B (zh) 2022-09-09
EP3998549A4 (en) 2022-11-23

Similar Documents

Publication Publication Date Title
WO2021139307A1 (zh) 视频内容识别方法、装置、存储介质、以及计算机设备
US11967151B2 (en) Video classification method and apparatus, model training method and apparatus, device, and storage medium
EP4145353A1 (en) Neural network construction method and apparatus
US20220028031A1 (en) Image processing method and apparatus, device, and storage medium
CN111541943B (zh) 视频处理方法、视频操作方法、装置、存储介质和设备
US20220222796A1 (en) Image processing method and apparatus, server, and storage medium
CN112232164A (zh) 一种视频分类方法和装置
CN111292262B (zh) 图像处理方法、装置、电子设备以及存储介质
CN112215171B (zh) 目标检测方法、装置、设备及计算机可读存储介质
CN110991380A (zh) 人体属性识别方法、装置、电子设备以及存储介质
WO2021103731A1 (zh) 一种语义分割方法、模型训练方法及装置
CN113704531A (zh) 图像处理方法、装置、电子设备及计算机可读存储介质
US20220157046A1 (en) Image Classification Method And Apparatus
CN112527115A (zh) 用户形象生成方法、相关装置及计算机程序产品
CN111832592A (zh) Rgbd显著性检测方法以及相关装置
US20220207913A1 (en) Method and device for training multi-task recognition model and computer-readable storage medium
CN111242019A (zh) 视频内容的检测方法、装置、电子设备以及存储介质
CN113838134B (zh) 图像关键点检测方法、装置、终端和存储介质
WO2022001364A1 (zh) 一种提取数据特征的方法和相关装置
CN112101109B (zh) 人脸关键点检测模型训练方法、装置、电子设备和介质
CN110163049B (zh) 一种人脸属性预测方法、装置及存储介质
CN112862840B (zh) 图像分割方法、装置、设备及介质
CN113706390A (zh) 图像转换模型训练方法和图像转换方法、设备及介质
CN112950641A (zh) 图像处理方法及装置、计算机可读存储介质和电子设备
US20230368520A1 (en) Fast object detection in video via scale separation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20911536

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20227006378

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2020911536

Country of ref document: EP

Effective date: 20220209

ENP Entry into the national phase

Ref document number: 2022519175

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE