WO2021114688A1 - 基于深度学习的视频处理方法及装置 - Google Patents

基于深度学习的视频处理方法及装置 Download PDF

Info

Publication number
WO2021114688A1
WO2021114688A1 PCT/CN2020/105991 CN2020105991W WO2021114688A1 WO 2021114688 A1 WO2021114688 A1 WO 2021114688A1 CN 2020105991 W CN2020105991 W CN 2020105991W WO 2021114688 A1 WO2021114688 A1 WO 2021114688A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
features
feature
video
deep learning
Prior art date
Application number
PCT/CN2020/105991
Other languages
English (en)
French (fr)
Inventor
孟凡宇
Original Assignee
苏宁云计算有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏宁云计算有限公司 filed Critical 苏宁云计算有限公司
Priority to CA3164081A priority Critical patent/CA3164081A1/en
Publication of WO2021114688A1 publication Critical patent/WO2021114688A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes

Definitions

  • the present invention relates to the technical field of computer vision, in particular to a video processing method and device based on deep learning.
  • the existing technical solution usually adopts the following steps to perform: 1) extract the frame of the video; 2) use the deep learning method for each frame to extract the features of each frame; 3) train the classifier to extract the label, the above-mentioned video processing
  • the method is time-consuming and inaccurate.
  • the embodiment of the present invention provides a A video processing method and device based on deep learning effectively improve the accuracy of the video tag algorithm, reduce the amount of feature processing, save a lot of labor, and at the same time increase the algorithm processing speed and reduce time consumption.
  • the technical solution is as follows:
  • a video processing method based on deep learning includes:
  • Extract image frames in the video extract the image features of each frame through deep learning, perform splitting operations according to the similarity between frames, filter out redundant information in the shots, and obtain at least one splitting shot;
  • extracting image frames in the video and performing a splitting operation to obtain at least one splitting lens includes:
  • Extract the image frame in the video extract the image in the image frame
  • Extracting image abstract features for each frame of image by using a deep learning method where the image abstract features include extracting features of a fully connected layer of a backbone network through deep learning;
  • Filtering is performed according to the similarity between the feature frames of each frame of image, and at least one split lens after the filtering is extracted.
  • extracting the multi-dimensional image features carrying the global image information of the video in each lense includes:
  • the multi-dimensional image features that carry the video global image information in each lense are extracted.
  • extracting multi-dimensional image features carrying video global image information in each shot including:
  • Convolutional neural network collects local features and performs spatial pooling.
  • performing feature fusion on the multi-dimensional image features of each splitter lens to obtain a feature fusion result includes:
  • Feature fusion is performed on the multi-dimensional image features of each lense by weighted average, LSTM, NetVALD, and Dbow in at least one feature fusion manner to obtain a feature fusion result.
  • performing feature fusion on the multi-dimensional image features of each lense by using at least one feature fusion method of weighted average, LSTM, NetVALD, and Dbow to obtain a feature fusion result includes:
  • LSTM uses recurrent neural networks to model temporal relationships and extract features with temporal information in the video; and/or,
  • NetVALD uses the aggregated local features to represent the global feature code, and for each point on each feature map, calculates the sum of the differences of the remaining corresponding cluster center points; and/or,
  • Dbow aggregates the list of feature descriptors into a compact representation for feature complementation.
  • the method further includes:
  • a preset multi-label classifier is used to perform label classification.
  • the method further includes:
  • a classifier is trained for each label, so as to obtain the preset multi-label classifier.
  • a video processing device based on deep learning includes:
  • Extraction and splitting module used to extract image frames in the video, and extract image features through deep learning, perform lens segmentation operations according to the similarity between image features of each frame, filter out redundant frames in the lens, and obtain at least one splitter lens ;
  • the feature extraction module is used to extract the multi-dimensional image features carrying video image information in each lens
  • the feature fusion module is used to perform feature fusion on the multi-dimensional image features of each splitter lens to obtain a feature fusion result.
  • extraction and splitting module is used for:
  • Extract the image frame in the video extract the image in the image frame
  • Extracting image abstract features for each frame of image by using a deep learning method where the image abstract features include extracting features of a fully connected layer of a backbone network through deep learning;
  • Filtering is performed according to the similarity between the feature frames of each frame of image, and at least one split lens after the filtering is extracted.
  • the feature extraction module is used for:
  • the multi-dimensional image features that carry the video global image information in each lense are extracted.
  • extracting multi-dimensional image features carrying video global image information in each shot including:
  • Convolutional neural network collects local features and performs spatial pooling.
  • the feature fusion module is used for:
  • Feature fusion is performed on the multi-dimensional image features of each lense by weighted average, LSTM, NetVALD, and Dbow in at least one feature fusion manner to obtain a feature fusion result.
  • performing feature fusion on the multi-dimensional image features of each lense by using at least one feature fusion method of weighted average, LSTM, NetVALD, and Dbow to obtain a feature fusion result includes:
  • LSTM uses recurrent neural networks to model temporal relationships and extract features with temporal information in the video; and/or,
  • NetVALD uses the aggregated local features to represent the global feature code, and for each point on each feature map, calculates the sum of the differences of the remaining corresponding cluster center points; and/or,
  • Dbow aggregates the list of feature descriptors into a compact representation for feature complementation.
  • the deep learning-based video processing device further includes a label classification module configured to perform label classification by using a preset multi-label classifier according to the feature vector obtained by the feature fusion result.
  • the label classification module is further configured to train a classifier for each label according to the feature vector obtained by the feature fusion result and its label classification result, so as to obtain the preset multi-label classifier.
  • the extracted multi-dimensional image features are feature fusion, and the traditional methods of lbp and hog features are effectively used to complement the features extracted by deep learning, which improves the robustness of the features extracted by the algorithm, and effectively improves the accuracy and recall rate of the video tag algorithm.
  • FIG. 1 is a flowchart of a video processing method based on deep learning provided by an embodiment of the present invention
  • FIG. 2 is a flowchart of sub-steps of step 101 in Figure 1;
  • FIG. 3 is an overall flowchart of a video processing method based on deep learning provided by an embodiment of the present invention
  • FIG. 4 is a detailed flowchart of a video processing method based on deep learning provided by an embodiment of the present invention.
  • Fig. 5 is a schematic structural diagram of a video processing device based on deep learning provided by an embodiment of the present invention.
  • the embodiment of the present invention provides a The video processing method and device based on deep learning, by extracting image frames in the video and performing splitting operations to remove redundant information of similar frames, and perform feature fusion on the extracted multi-dimensional image features, which effectively improves the performance of the video labeling algorithm. Accuracy reduces the amount of feature processing, saves a lot of labor, improves algorithm processing speed, and reduces time consumption.
  • Fig. 1 is a flowchart of a video processing method based on deep learning provided by an embodiment of the present invention.
  • Fig. 2 is a flowchart of sub-steps of step 101 in Fig. 1.
  • the video processing method based on deep learning provided by the embodiment of the present invention includes the following steps:
  • Extract image frames from the video extract image features through deep learning, perform a shot segmentation operation according to the similarity between image features of each frame, filter out redundant frames in the shot, and obtain at least one split shot.
  • the above 101 step further includes the following sub-steps:
  • the image abstract features include extracting features of a fully connected layer of a backbone network through deep learning, for example, features extracted from a fully connected layer of a resent network.
  • Deep learning features generally use the feature extraction model trained on the imagenet data set.
  • the deep learning model generally uses inception V3, of course, other commonly used models, such as densenet, VGG, resnet, etc., can also be used.
  • the similarity threshold is determined according to the calculated similarity and the number of desired sub-mirrors.
  • step 101 may be implemented in other ways in addition to the manner described in the foregoing steps, and the embodiment of the present invention does not limit the specific manner.
  • At least one feature extraction method of LBP, HOG, and deep learning network is used to extract the multi-dimensional image features that carry the video global image information in each shot.
  • the following methods can be used:
  • LBP is used for image retrieval; and/or, HOG is used to calculate and count the gradient direction histogram of the local area of the image to form features; and/or, the local features are collected through a convolutional neural network, and spatial pooling is performed.
  • LBP mainly uses its rotation invariance to improve the shooting angle of the image in the image retrieval to the scene in the image, and the retrieval recall rate is not high.
  • HOG constructs features by calculating and counting the gradient direction histogram of the local area of the image. In an image, the appearance and shape of the local target can be well described by the direction density distribution of the gradient or edge. Since HOG operates on the local grid cells of the image, it can maintain good invariance to the geometric and optical deformations of the image, and these two types of deformations will only appear in a larger space.
  • Convolutional neural network CNN collects local features by learning convolution and performs spatial pooling to achieve non-deep learning image learning. The continuous application of convolutional layers can produce low-level semantic information aggregation in a wide space. And expand to form hierarchical features of higher-level information.
  • step 102 may also be implemented in other manners, and the embodiment of the present invention does not limit the specific manner.
  • Feature fusion is performed on the multi-dimensional image features of each lense by weighted average, LSTM, NetVALD, and Dbow in at least one feature fusion manner to obtain a feature fusion result.
  • LSTM weighted average
  • NetVALD NetVALD
  • Dbow Dbow in at least one feature fusion manner to obtain a feature fusion result.
  • the overall information of the video image is obtained by performing a weighted summation of the different frame information of the multi-dimensional image feature, that is, performing a weighted average. and / or,
  • LSTM uses a recurrent neural network to model temporal relationships and extract features with temporal information in the video; and/or,
  • NetVALD uses the aggregated local features to represent the global feature code. For each point on each feature map, the sum of the differences of the remaining corresponding cluster center points is calculated, so the result V is a k*D matrix, which is Each feature map must calculate a difference with all cluster centers, but only the difference calculated with the nearest cluster is retained. VLAD saves the distance between each feature point and the nearest cluster center, and adds As a new coding feature, it makes the feature more robust and effectively reduces the feature dimension; and/or,
  • Dbow aggregates the list of feature descriptors into a compact representation for feature complementation.
  • the advantage of BOW aggregation over NetVLAD is that given a fixed number of clusters, it aggregates the list of feature descriptors into a more compact representation.
  • the disadvantage is that significantly more clusters are required to obtain the richness of aggregate descriptors. Indicates that it can complement NetVLAD.
  • the deep learning-based video processing method provided by the embodiment of the present invention further includes the following steps: according to the feature vector obtained by the feature fusion result, a preset multi-label classifier is used to perform label classification.
  • the preset multi-label classifier here can adopt any possible multi-label classifier in the prior art, which is not particularly limited in the embodiment of the present invention.
  • a softmax classifier is used, the input for training is the feature that needs to be fused, the label of the classifier is a single-label binary classification, 0 or 1, there are more than 4000 classifiers in total for multi-label classification.
  • a classifier is trained for each label, and a preset multi-label classifier with better classification effect is obtained through training.
  • 3 and 4 are the overall flow chart and detailed flow chart of the deep learning-based video processing method provided by the embodiments of the present invention, and show a preferred implementation of the deep learning-based video processing process.
  • FIG. 5 is a schematic structural diagram of a deep learning-based video processing device provided by an embodiment of the present invention.
  • the deep learning-based video processing device 2 provided by an embodiment of the present invention includes an extraction splitting module 21 and a feature extraction module 22 And feature fusion module 23.
  • the extraction and splitting module 21 is used to extract image frames in the video, and extract image features through deep learning, perform shot segmentation operations according to the similarity between image features of each frame, filter out redundant frames in the shot, and obtain at least one Split lens
  • the feature extraction module 22 is used to extract multi-dimensional image features carrying video image information in each lense; the feature fusion module 23 is used to perform feature fusion on the multi-dimensional image features of each lense to obtain the feature fusion result .
  • the extraction and splitting module 21 is used to: extract image frames in the video, extract images in the image frames; extract image abstract features through deep learning for each frame of image, and the image abstract features include extracting the full connection of the backbone network through deep learning Layer features; calculate the similarity between each frame of image feature frames according to the image abstract features; filter according to the similarity between each frame of image feature frames, and extract at least one filtered lens.
  • the feature extraction module 22 is used to extract multi-dimensional image features that carry global image information of the video in each lense through at least one feature extraction method of LBP, HOG, and deep learning network.
  • Using at least one feature extraction method in LBP, HOG, and deep learning network further, extracting the multi-dimensional image features carrying the video global image information in each shot, including: using LBP for image retrieval; and/or,
  • the feature is constructed by HOG calculation and the histogram of the gradient direction of the local area of the image is counted; and/or the local feature is collected through a convolutional neural network, and spatial pooling is performed.
  • the feature fusion module 23 is configured to perform feature fusion on the multi-dimensional image features of each lense by using at least one feature fusion method of weighted average, LSTM, NetVALD, and Dbow to obtain a feature fusion result. Further, performing feature fusion on the multi-dimensional image features of each lense by means of at least one feature fusion method of weighted average, LSTM, NetVALD, and Dbow to obtain the feature fusion result includes: performing feature fusion on the multi-dimensional image feature Weighted summation of the different frame information to obtain the overall information of the video image; and/or, LSTM uses a recurrent neural network to model the time relationship, and extracts features with time information in the video; and/or, NetVALD uses the aggregated Local features represent global feature codes. For each point on each feature map, calculate the difference sum of the remaining corresponding cluster center points; and/or, Dbow aggregates the feature descriptor list into a compact representation for feature complementation .
  • the above-mentioned deep learning-based video processing device further includes a label classification module 24, and the label classification module 24 is configured to perform label classification by using a preset multi-label classifier according to the feature vector obtained by the feature fusion result.
  • the label classification module 24 is also used to train a classifier for each label according to the feature vector obtained by the feature fusion result and the label classification result, so as to obtain a preset multi-label classifier.
  • the video processing device based on deep learning triggers a video processing service
  • only the division of the above-mentioned functional modules is used as an example for illustration. In actual applications, the above-mentioned functions can be assigned differently according to needs.
  • the function module is completed, that is, the internal structure of the device is divided into different function modules to complete all or part of the functions described above.
  • the deep learning-based video processing device provided in the above-mentioned embodiment and the deep learning-based video processing method embodiment belong to the same concept. For the specific implementation process, please refer to the method embodiment, which will not be repeated here.
  • the video processing method and device based on deep learning provided by the embodiments of the present invention have the following beneficial effects compared with the prior art:
  • the program can be stored in a computer-readable storage medium.
  • the storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.
  • These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于深度学习的视频处理方法及装置,属于计算机视觉技术领域。所述方法包括:首先抽取视频中的图像帧,然后通过深度学习方法提取图像帧特征,进而根据每帧图像特征间相似度进行镜头切分过滤镜头内冗余信息操作,获取至少一个分镜镜头;提取每个分镜镜头中携带视频图像信息的多维度图像特征;对所述每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果。本发明提供的基于深度学习的视频处理方法及装置,有效提高了视频标签算法的准确性,减少特征处理量,节省大量的人工,同时提高了算法处理速度,降低了时间消耗。

Description

基于深度学习的视频处理方法及装置 技术领域
本发明涉及计算机视觉技术领域,特别涉及一种基于深度学习的视频处理方法及装置。
背景技术
目前现有技术方案通常采用以下几个步骤来进行:1)将视频进行帧提取;2)对每帧利用深度学习方法对每帧进行特征提取;3)训练分类器进行标签抽取,上述视频处理方法存在时间消耗大和精度不准的情况。
发明内容
现有技术中尽管对每帧图像特征进行抽取,然而并没有有效利用视频的相似性,也没有与传统特征进行融合提高算法准确性,为了解决现有技术的问题,本发明实施例提供了一种基于深度学习的视频处理方法及装置,有效提高了视频标签算法的准确性,减少特征处理量,节省大量的人工,同时提高了算法处理速度,降低了时间消耗。所述技术方案如下:
一方面,提供了一种基于深度学习的视频处理方法,所述方法包括:
抽取视频中的图像帧,并通过深度学习提取每帧图像特征,根据帧间相似度进行分镜操作,过滤掉镜头内冗余信息,获取至少一个分镜镜头;
提取每个分镜镜头中携带视频图像信息的多维度图像特征;
对所述每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果。
进一步地,抽取视频中的图像帧并进行分镜操作,获取至少一个分镜镜头,包括:
抽取视频中的图像帧,提取图像帧中的图像;
对每帧图像通过深度学习方法提取图像抽象特征,所述图像抽象特征包括通过深度学习提取主干网络的全连接层特征;
根据所述图像抽象特征计算每帧图像特征帧间相似度;
根据所述每帧图像特征帧间相似度进行过滤,提取过滤后的至少一个分镜镜头。
进一步地,提取每个分镜镜头中携带视频全局图像信息的多维度图像特征,包括:
通过LBP、HOG、深度学习网络中的至少一种特征提取方式,提取每个分镜镜头中携带视频全局图像信息的多维度图像特征。
进一步地,通过LBP、HOG、深度学习网络中的至少一种特征提取方式,提取每个分镜镜头中携带视频全局图像信息的多维度图像特征,包括:
利用LBP进行图像检索;和/或,
通过HOG计算和统计图像局部区域的梯度方向直方图构成特征;和/或,
通过卷积神经网络汇集局部特征,并进行空间上的池化。
进一步地,对所述每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果,包括:
通过加权平均、LSTM、NetVALD、Dbow中的至少一种特征融合方式对所述每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果。
进一步地,通过加权平均、LSTM、NetVALD、Dbow中的至少一种特征融合方式对所述每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果,包括:
通过对所述多维度图像特征的不同帧信息进行加权求和,取得视频图像的整体信息;和/或,
LSTM利用递归神经网络对时间关系进行建模,提取视频中具有时间信息的特征;和/或,
NetVALD利用聚合后的局部特征表示全局特征编码,对于每一张feature map上的每个点,分别求其余对应的簇中心点的差值和;和/或,
Dbow通过特征描述符列表聚合成紧凑的表示进行特征互补。
进一步地,所述方法还包括:
根据所述特征融合结果得到的特征向量,利用预设多标签分类器进行标签分类。
进一步地,所述方法还包括:
根据所述特征融合结果得到的特征向量及其标签分类结果,对每个标签训练一个分类器,从而获得所述预设多标签分类器。
另一方面,提供了一种基于深度学习的视频处理装置,所述装置包括:
抽取分镜模块,用于抽取视频中的图像帧,并通过深度学习提取图像特征,根据每帧图像特征间相似度进行镜头切分操作,过滤掉镜头内冗余帧,获取至少一个分镜镜头;
特征提取模块,用于提取每个分镜镜头中携带视频图像信息的多维度图像特征;
特征融合模块,用于对所述每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果。
进一步地,所述抽取分镜模块用于:
抽取视频中的图像帧,提取图像帧中的图像;
对每帧图像通过深度学习方法提取图像抽象特征,所述图像抽象特征包括通过深度学习提取主干网络的全连接层特征;
根据所述图像抽象特征计算每帧图像特征帧间相似度;
根据所述每帧图像特征帧间相似度进行过滤,提取过滤后的至少一个分镜镜头。
进一步地,所述特征提取模块用于:
通过LBP、HOG、深度学习网络中的至少一种特征提取方式,提取每个分 镜镜头中携带视频全局图像信息的多维度图像特征。
进一步地,通过LBP、HOG、深度学习网络中的至少一种特征提取方式,提取每个分镜镜头中携带视频全局图像信息的多维度图像特征,包括:
利用LBP进行图像检索;和/或,
通过HOG计算和统计图像局部区域的梯度方向直方图构成特征;和/或,
通过卷积神经网络汇集局部特征,并进行空间上的池化。
进一步地,所述特征融合模块用于:
通过加权平均、LSTM、NetVALD、Dbow中的至少一种特征融合方式对所述每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果。
进一步地,通过加权平均、LSTM、NetVALD、Dbow中的至少一种特征融合方式对所述每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果,包括:
通过对所述多维度图像特征的不同帧信息进行加权求和,取得视频图像的整体信息;和/或,
LSTM利用递归神经网络对时间关系进行建模,提取视频中具有时间信息的特征;和/或,
NetVALD利用聚合后的局部特征表示全局特征编码,对于每一张feature map上的每个点,分别求其余对应的簇中心点的差值和;和/或,
Dbow通过特征描述符列表聚合成紧凑的表示进行特征互补。
进一步地,所述基于深度学习的视频处理装置还包括标签分类模块,所述标签分类模块用于:根据所述特征融合结果得到的特征向量,利用预设多标签分类器进行标签分类。
进一步地,所述标签分类模块还用于:根据所述特征融合结果得到的特征向量及其标签分类结果,对每个标签训练一个分类器,从而获得所述预设多标签分类器。
本发明实施例提供的技术方案带来的有益效果是:
通过抽取视频中的图像帧并进行分镜操作去除相似帧的冗余信息,减少特征处理量,节省大量的人工,同时提高了算法处理速度,降低了时间消耗;并通过多种特征提取方式对提取的多维度图像特征进行特征融合,有效利用传统方法的lbp和hog特征对深度学习提取的特征进行补足,提高算法提取特征的鲁邦性,有效提高了视频标签算法的准确性和召回率。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例提供的基于深度学习的视频处理方法流程图;
图2是图1中101步骤的子步骤流程图;
图3是本发明实施例提供的基于深度学习的视频处理方法的总体流程图;
图4是本发明实施例提供的基于深度学习的视频处理方法的详细流程图;
图5是本发明实施例提供的基于深度学习的视频处理装置结构示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。在本发明的描述中,“多个”的含义是两个以上,除非另有明确具体的限定。
根据发明人发现的以下现有技术现状:尽管对每帧图像特征进行抽取,但并没有有效利用视频的相似性,也没有与传统特征进行融合提高算法准确性, 本发明实施例提供了一种基于深度学习的视频处理方法及装置,通过抽取视频中的图像帧并进行分镜操作去除相似帧的冗余信息,并对提取的多维度图像特征进行特征融合,既有效提高了视频标签算法的准确性,又减少特征处理量,节省大量的人工,同时提高了算法处理速度,降低了时间消耗。
下面结合具体实施例及附图,对本发明实施例提供的基于深度学习的视频处理方法及装置详细说明。
图1是本发明实施例提供的基于深度学习的视频处理方法流程图。图2是图1中101步骤的子步骤流程图。如图1所示,本发明实施例提供的基于深度学习的视频处理方法,包括以下步骤:
101、抽取视频中的图像帧,并通过深度学习提取图像特征,根据每帧图像特征间相似度进行镜头切分操作,过滤掉镜头内冗余帧,获取至少一个分镜镜头。
具体地,如图2所示,上述101步骤进一步包括以下子步骤:
1011、抽取视频中的图像帧,提取图像帧中的图像。这里可以用任何代码库进行视频帧的抽取,比如opencv等。
1012、对每帧图像通过深度学习方法提取图像抽象特征,图像抽象特征包括通过深度学习提取主干网络的全连接层特征,例如,resent网络的全连接层提取的特征。深度学习特征一般采用imagenet数据集上训练的提特征模型。深度学习模型一般采用inception V3,当然也可以采用其它常用的模型,如densenet、VGG、resnet等。
1013、根据图像抽象特征计算每帧图像特征帧间相似度。示例性地,根据特征计算特征相关的欧式距离,进而判断每帧是否相似及相似度大小。
1014、根据每帧图像特征帧间相似度进行过滤,提取过滤后的至少一个分镜镜头。示例性地,根据计算的相似度和想要分镜的数量做归集,确定相似度阈值。
值得注意的是,步骤101的过程,除了上述步骤所述的方式之外,还可以 通过其他方式实现该过程,本发明实施例对具体的方式不加以限定。
102、提取每个分镜镜头中携带视频图像信息的多维度图像特征。
具体地,通过LBP、HOG、深度学习网络中的至少一种特征提取方式,提取每个分镜镜头中携带视频全局图像信息的多维度图像特征。优选地,可以采用以下方式:
利用LBP进行图像检索;和/或,通过HOG计算和统计图像局部区域的梯度方向直方图构成特征;和/或,通过卷积神经网络汇集局部特征,并进行空间上的池化。
LBP主要是利用其旋转不变性,提高图像检索中的图像对于图像中景物的拍摄角度带来的,检索召回率不高的问题。HOG是通过计算和统计图像局部区域的梯度方向直方图来构成特征,在一副图像中,局部目标的表象和形状(appearance and shape)能够被梯度或边缘的方向密度分布很好地描述。由于HOG是在图像的局部方格单元上操作,所以它对图像几何的和光学的形变都能保持很好的不变性,这两种形变只会出现在更大的空间领域上。卷积神经网络(CNN)则是通过学习卷积来汇集局部特征,并进行空间上的池化来实现非深度学习图像学习,卷积层的连续应用能产生在广泛空间内聚合低级语义信息,并扩展形成更高级信息的层次化特征。
值得注意的是,步骤102的过程,除了上述步骤所述的方式之外,还可以通过其他方式实现该过程,本发明实施例对具体的方式不加以限定。
103、对每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果。
通过加权平均、LSTM、NetVALD、Dbow中的至少一种特征融合方式对所述每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果。优选地,可以采用以下方式:
通过对所述多维度图像特征的不同帧信息进行加权求和,即进行加权平均,取得视频图像的整体信息。和/或,
LSTM利用递归神经网络对时间关系进行建模,提取视频中具有时间信息的 特征;和/或,
NetVALD利用聚合后的局部特征表示全局特征编码,对于每一张feature map上的每个点,分别求其余对应的簇中心点的差值和,所以结果V是一个k*D的矩阵,也就是每一张feature map都要与所有的簇中心计算一个差值,但只保留与最近的簇计算的差值,VLAD保存的是每个特征点与离它最近的聚类中心的距离,并将其作为新的编码特征,使得特征更加鲁棒,并有效降低特征维度;和/或,
Dbow通过特征描述符列表聚合成紧凑的表示进行特征互补。BOW聚合优于NetVLAD的好处在于,在给定固定数量的聚类的情况下,它将特征描述符列表聚合成更紧凑的表示,缺点是需要明显更多的聚类来获得聚合描述符的丰富表示,能够与NetVLAD进行互补。
另外,优选地,本发明实施例提供的基于深度学习的视频处理方法还包括以下步骤:根据特征融合结果得到的特征向量,利用预设多标签分类器进行标签分类。这里的预设多标签分类器可以采用现有技术中任何可能的多标签分类器,本发明实施例不对其特别限定。示例性地,采用softmax分类器,训练的输入为需要进行融合的特征,分类器的标签为单个标签的二分类,0或1,一共有4000多个分类器,进行多标签分类。
进一步优选地,根据特征融合结果得到的特征向量及其标签分类结果,对每个标签训练一个分类器,通过训练获得分类效果更好的预设多标签分类器。
图3和图4是本发明实施例提供的基于深度学习的视频处理方法的总体流程图和详细流程图,示出了进行基于深度学习的视频处理过程的一种优选实施方式。
本发明实施例还提供了一种基于深度学习的视频处理装置。图5是本发明实施例提供的基于深度学习的视频处理装置结构示意图,如图5所示,本发明实施例提供的基于深度学习的视频处理装置2包括抽取分镜模块21、特征提取模块22和特征融合模块23。
其中,抽取分镜模块21,用于抽取视频中的图像帧,并通过深度学习提取图像特征,根据每帧图像特征间相似度进行镜头切分操作,过滤掉镜头内冗余帧,获取至少一个分镜镜头;
特征提取模块22,用于提取每个分镜镜头中携带视频图像信息的多维度图像特征;特征融合模块23,用于对每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果。
具体地,抽取分镜模块21用于:抽取视频中的图像帧,提取图像帧中的图像;对每帧图像通过深度学习提取图像抽象特征,图像抽象特征包括通过深度学习提取主干网络的全连接层特征;根据图像抽象特征计算每帧图像特征帧间相似度;根据每帧图像特征帧间相似度进行过滤,提取过滤后的至少一个分镜镜头。
特征提取模块22用于:通过LBP、HOG、深度学习网络中的至少一种特征提取方式,提取每个分镜镜头中携带视频全局图像信息的多维度图像特征。通过LBP、HOG、深度学习网络中的至少一种特征提取方式,进一步地,提取每个分镜镜头中携带视频全局图像信息的多维度图像特征,包括:利用LBP进行图像检索;和/或,通过HOG计算和统计图像局部区域的梯度方向直方图构成特征;和/或,通过卷积神经网络汇集局部特征,并进行空间上的池化。
特征融合模块23用于:通过加权平均、LSTM、NetVALD、Dbow中的至少一种特征融合方式对每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果。进一步地,通过加权平均、LSTM、NetVALD、Dbow中的至少一种特征融合方式对所述每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果,包括:通过对多维度图像特征的不同帧信息进行加权求和,取得视频图像的整体信息;和/或,LSTM利用递归神经网络对时间关系进行建模,提取视频中具有时间信息的特征;和/或,NetVALD利用聚合后的局部特征表示全局特征编码,对于每一张feature map上的每个点,分别求其余对应的簇中心点的差值和;和/或,Dbow通过特征描述符列表聚合成紧凑的表示进行特征互 补。
另外,优选地,上述基于深度学习的视频处理装置还包括标签分类模块24,标签分类模块24用于:根据特征融合结果得到的特征向量,利用预设多标签分类器进行标签分类。标签分类模块24还用于:根据特征融合结果得到的特征向量及其标签分类结果,对每个标签训练一个分类器,从而获得预设多标签分类器。
需要说明的是:上述实施例提供的基于深度学习的视频处理装置在触发视频处理业务时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的基于深度学习的视频处理装置与基于深度学习的视频处理方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
上述所有可选技术方案,可以采用任意结合形成本发明的可选实施例,在此不再一一赘述。
综上所述,本发明实施例提供的基于深度学习的视频处理方法及装置,相比现有技术,具有以下有益效果:
通过抽取视频中的图像帧并进行分镜操作去除相似帧的冗余信息,并通过多种特征提取方式对提取的多维度图像特征进行特征融合,既有效提高了视频标签算法的准确性,又减少特征处理量,节省大量的人工,同时提高了算法处理速度,降低了时间消耗。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
本申请实施例中是参照根据本申请实施例中实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实 现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请实施例中的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例中范围的所有变更和修改。
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种基于深度学习的视频处理方法,其特征在于,所述方法包括:
    抽取视频中的图像帧,并通过深度学习提取每帧图像特征,根据帧间相似度进行分镜操作,过滤掉镜头内冗余信息,获取至少一个分镜镜头;
    提取每个分镜镜头中携带视频图像信息的多维度图像特征;
    对所述每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果。
  2. 根据权利要求1所述的方法,其特征在于,抽取视频中的图像帧并进行分镜操作,获取至少一个分镜镜头,包括:
    抽取视频中的图像帧,提取图像帧中的图像;
    对每帧图像通过深度学习提取图像抽象特征,所述图像抽象特征包括通过深度学习提取主干网络的全连接层特征;
    根据所述图像抽象特征计算每帧图像特征帧间相似度;
    根据所述每帧图像特征帧间相似度进行过滤,提取过滤后的至少一个分镜镜头。
  3. 根据权利要求1所述的方法,其特征在于,提取每个分镜镜头中携带视频图像信息的多维度图像特征,包括:
    通过LBP、HOG、深度学习网络中的至少一种特征提取方式,提取每个分镜镜头中携带视频全局图像信息的多维度图像特征。
  4. 根据权利要求3所述的方法,其特征在于,通过LBP、HOG、深度学习网络中的至少一种特征提取方式,提取每个分镜镜头中携带视频全局图像信息的多维度图像特征,包括:
    利用LBP进行图像检索;和/或,
    通过HOG计算和统计图像局部区域的梯度方向直方图构成特征;和/或,
    通过卷积神经网络汇集局部特征,并进行空间上的池化。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,对所述每个分镜 镜头的多维度图像特征进行特征融合,获取特征融合结果,包括:
    通过加权平均、LSTM、NetVALD、Dbow中的至少一种特征融合方式对所述每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果。
  6. 根据权利要求5所述的方法,其特征在于,通过加权平均、LSTM、NetVALD、Dbow中的至少一种特征融合方式对所述每个分镜镜头的多维度图像特征进行特征融合,获取特征融合结果,包括:
    通过对所述多维度图像特征的不同帧信息进行加权求和,取得视频图像的整体信息;和/或,
    LSTM利用递归神经网络对时间关系进行建模,提取视频中具有时间信息的特征;和/或,
    NetVALD利用聚合后的局部特征表示全局特征编码,对于每一张feature map上的每个点,分别求其余对应的簇中心点的差值和;和/或,
    Dbow通过特征描述符列表聚合成紧凑的表示进行特征互补。
  7. 根据权利要求1至4任一项所述的方法,其特征在于,所述方法还包括:
    根据所述特征融合结果得到的特征向量,利用预设多标签分类器进行标签分类。
  8. 根据权利要求7所述的方法,其特征在于,所述方法还包括:
    根据所述特征融合结果得到的特征向量及其标签分类结果,对每个标签训练一个分类器,从而获得所述预设多标签分类器。
  9. 一种基于深度学习的视频处理装置,其特征在于,所述装置包括:
    抽取分镜模块,用于抽取视频中的图像帧,并通过深度学习提取图像特征,根据每帧图像特征间相似度进行镜头切分操作,过滤掉镜头内冗余帧,获取至少一个分镜镜头;
    特征提取模块,用于提取每个分镜镜头中携带视频图像信息的多维度图像特征;
    特征融合模块,用于对所述每个分镜镜头的多维度图像特征进行特征融合, 获取特征融合结果。
  10. 根据权利要求9所述的装置,其特征在于,所述特征提取模块用于:通过LBP、HOG、深度学习网络中的至少一种特征提取方式,提取每个分镜镜头中携带视频全局图像信息的多维度图像特征。
PCT/CN2020/105991 2019-12-10 2020-07-30 基于深度学习的视频处理方法及装置 WO2021114688A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA3164081A CA3164081A1 (en) 2019-12-10 2020-07-30 Video processing method and device based on deep learning

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911261511.8 2019-12-10
CN201911261511.8A CN111126197B (zh) 2019-12-10 2019-12-10 基于深度学习的视频处理方法及装置

Publications (1)

Publication Number Publication Date
WO2021114688A1 true WO2021114688A1 (zh) 2021-06-17

Family

ID=70498238

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/105991 WO2021114688A1 (zh) 2019-12-10 2020-07-30 基于深度学习的视频处理方法及装置

Country Status (3)

Country Link
CN (1) CN111126197B (zh)
CA (1) CA3164081A1 (zh)
WO (1) WO2021114688A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792600A (zh) * 2021-08-10 2021-12-14 武汉光庭信息技术股份有限公司 一种基于深度学习的视频抽帧方法和系统
CN114077681A (zh) * 2022-01-19 2022-02-22 腾讯科技(深圳)有限公司 一种图像数据处理方法、装置、计算机设备及存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111126197B (zh) * 2019-12-10 2023-08-25 苏宁云计算有限公司 基于深度学习的视频处理方法及装置
CN111601162B (zh) * 2020-06-08 2022-08-02 北京世纪好未来教育科技有限公司 视频切分方法、装置和计算机存储介质
CN112784056B (zh) * 2020-12-31 2021-11-23 北京视连通科技有限公司 一种基于视频智能识别及智能语义搜索的短视频生成方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050195331A1 (en) * 2004-03-05 2005-09-08 Kddi R&D Laboratories, Inc. Classification apparatus for sport videos and method thereof
CN101650728A (zh) * 2009-08-26 2010-02-17 北京邮电大学 视频高层特征检索系统及其实现
CN106446015A (zh) * 2016-08-29 2017-02-22 北京工业大学 一种基于用户行为偏好的视频内容访问预测与推荐方法
CN108038414A (zh) * 2017-11-02 2018-05-15 平安科技(深圳)有限公司 基于循环神经网络的人物性格分析方法、装置及存储介质
CN111126197A (zh) * 2019-12-10 2020-05-08 苏宁云计算有限公司 基于深度学习的视频处理方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103716586A (zh) * 2013-12-12 2014-04-09 中国科学院深圳先进技术研究院 一种基于三维空间场景的监控视频融合系统和方法
CN104363385B (zh) * 2014-10-29 2017-05-10 复旦大学 一种图像融合的基于行的硬件实现方法
CN109325141B (zh) * 2018-07-26 2020-11-10 北京市商汤科技开发有限公司 图像检索方法及装置、电子设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050195331A1 (en) * 2004-03-05 2005-09-08 Kddi R&D Laboratories, Inc. Classification apparatus for sport videos and method thereof
CN101650728A (zh) * 2009-08-26 2010-02-17 北京邮电大学 视频高层特征检索系统及其实现
CN106446015A (zh) * 2016-08-29 2017-02-22 北京工业大学 一种基于用户行为偏好的视频内容访问预测与推荐方法
CN108038414A (zh) * 2017-11-02 2018-05-15 平安科技(深圳)有限公司 基于循环神经网络的人物性格分析方法、装置及存储介质
CN111126197A (zh) * 2019-12-10 2020-05-08 苏宁云计算有限公司 基于深度学习的视频处理方法及装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792600A (zh) * 2021-08-10 2021-12-14 武汉光庭信息技术股份有限公司 一种基于深度学习的视频抽帧方法和系统
CN113792600B (zh) * 2021-08-10 2023-07-18 武汉光庭信息技术股份有限公司 一种基于深度学习的视频抽帧方法和系统
CN114077681A (zh) * 2022-01-19 2022-02-22 腾讯科技(深圳)有限公司 一种图像数据处理方法、装置、计算机设备及存储介质
CN114077681B (zh) * 2022-01-19 2022-04-12 腾讯科技(深圳)有限公司 一种图像数据处理方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN111126197B (zh) 2023-08-25
CN111126197A (zh) 2020-05-08
CA3164081A1 (en) 2021-06-17

Similar Documents

Publication Publication Date Title
WO2021114688A1 (zh) 基于深度学习的视频处理方法及装置
Li et al. A free lunch for unsupervised domain adaptive object detection without source data
Wang et al. Semi-supervised video object segmentation with super-trajectories
Caelles et al. One-shot video object segmentation
CN110263659B (zh) 一种基于三元组损失和轻量级网络的指静脉识别方法及系统
CN106599836B (zh) 多人脸跟踪方法及跟踪系统
Wang et al. Robust deep co-saliency detection with group semantic
Shi et al. Multiscale multitask deep NetVLAD for crowd counting
CN103886619B (zh) 一种融合多尺度超像素的目标跟踪方法
Ren et al. A novel squeeze YOLO-based real-time people counting approach
CN111079539B (zh) 一种基于异常追踪的视频异常行为检测方法
Zhang et al. Coarse-to-fine object detection in unmanned aerial vehicle imagery using lightweight convolutional neural network and deep motion saliency
CN113221770B (zh) 基于多特征混合学习的跨域行人重识别方法及系统
CN110009662B (zh) 人脸跟踪的方法、装置、电子设备及计算机可读存储介质
CN107563299A (zh) 一种利用ReCNN融合上下文信息的行人检测方法
CN111428664A (zh) 一种基于人工智能深度学习技术的计算机视觉的实时多人姿态估计方法
CN115115698A (zh) 设备的位姿估计方法及相关设备
Yang et al. A method of pedestrians counting based on deep learning
CN109002808B (zh) 一种人体行为识别方法及系统
CN112446417B (zh) 基于多层超像素分割的纺锤形果实图像分割方法及系统
İmamoğlu et al. Saliency detection by forward and backward cues in deep-CNN
Panigrahi et al. DSM-IDM-YOLO: Depth-wise separable module and inception depth-wise module based YOLO for pedestrian detection
CN108664968A (zh) 一种基于文本选取模型的无监督文本定位方法
CN110796650A (zh) 图像质量的评估方法及装置、电子设备、存储介质
Anoopa et al. Advanced video anomaly detection using 2D CNN and stacked LSTM with deep active learning-based model: 10.48129/kjs. splml. 19159

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20899709

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3164081

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20899709

Country of ref document: EP

Kind code of ref document: A1