WO2020019397A1 - 一种视频深度分析的方法与系统 - Google Patents

一种视频深度分析的方法与系统 Download PDF

Info

Publication number
WO2020019397A1
WO2020019397A1 PCT/CN2018/102901 CN2018102901W WO2020019397A1 WO 2020019397 A1 WO2020019397 A1 WO 2020019397A1 CN 2018102901 W CN2018102901 W CN 2018102901W WO 2020019397 A1 WO2020019397 A1 WO 2020019397A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
time
features
depth analysis
layer
Prior art date
Application number
PCT/CN2018/102901
Other languages
English (en)
French (fr)
Inventor
肖东晋
张立群
Original Assignee
阿依瓦(北京)技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿依瓦(北京)技术有限公司 filed Critical 阿依瓦(北京)技术有限公司
Publication of WO2020019397A1 publication Critical patent/WO2020019397A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/254Analysis of motion involving subtraction of images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the invention belongs to the field of artificial intelligence and computer vision, and particularly relates to a method and system for video depth analysis.
  • embodiments of the present invention use a new video depth analysis system and method.
  • the system and method can significantly improve the accuracy and stability of motion video analysis and recognition, while providing convenience for hardware implementation.
  • An embodiment of the present invention provides a video depth analysis system, including:
  • a feature network that extracts features from the original video and adds a time stamp to the features
  • a transfer module that generates a spatiotemporal feature tensor based on the time-stamped features, the spatiotemporal feature tensor includes a time dimension and a space feature dimension.
  • the video depth analysis system further includes a decision module, and the decision module processes the spatiotemporal feature tensor to generate a final decision.
  • the decision module is a decision neural network, and the decision neural network:
  • One or more convolutional layers are One or more convolutional layers
  • a fully-connected network that receives the analysis results of the convolutional layer, the pooling layer, and the non-linear layer and makes a decision.
  • the feature network includes multiple cascaded processing layers, and the latter processing layer is used to further process the results processed by the previous processing layer to extract higher-level features, each processing The layers include:
  • a convolution layer which is used to extract spatial local features in a single frame image or time-adjacent multi-frame images to be processed through a convolution operation
  • a non-linear excitation layer for processing the input convolution calculation results with a non-linear function
  • the pooling layer is used to implement downsampling on a single frame image or time-adjacent multiple frames to be processed.
  • the feature network includes a time stamping module for time stamping the spatio-temporal features output by the processing layer, and the time stamp is one of absolute time, relative time, and frame number. Or more.
  • the feature network is a plurality of feature networks working in parallel, and simultaneously processes single-frame or partial multi-frame video data corresponding to multiple local times.
  • the transfer module includes:
  • Time-domain difference module used to use difference operation to strengthen the differences between features at different times
  • Feature domain dimensionality reduction module to reduce the dimensionality of features
  • Spatio-temporal feature tensor generation module is used to integrate features with time stamps into spatio-temporal feature tensor.
  • Another embodiment of the present invention provides a video depth analysis method, including:
  • Features are extracted from the original video data through a feature network, and the extracted features are time-stamped;
  • the spatio-temporal feature tensor including a temporal dimension and a spatial feature dimension
  • the decision module processes the spatiotemporal feature tensor to generate the final decision.
  • extracting features from the original video data through a feature network includes:
  • the down-sampling is performed on a single frame image to be processed or a time-adjacent multiple frame image.
  • the time stamp is one or more of an absolute time, a relative time, and a frame number.
  • the method before generating the spatiotemporal feature tensor according to the feature generation set with time stamps, the method further includes:
  • the dimensions of the features are reduced by linear or non-linear dimensionality reduction.
  • the spatio-temporal feature tensor is a two-dimensional image, and the two dimensions are a feature domain and a time domain, respectively.
  • Each column of the two-dimensional image corresponds to a time stamp.
  • the element corresponds to a reduced eigenvalue.
  • processing a spatio-temporal feature tensor through a decision module to generate a final decision includes extracting specific geometric features from the spatio-temporal tensor.
  • a video depth analysis system is provided in another embodiment of the present invention, including:
  • a processing unit which is configured to execute the method for video depth analysis described above.
  • FIG. 1 shows a schematic block diagram of a video depth analysis system according to an embodiment of the present invention.
  • FIG. 2 shows a structural block diagram of a feature network according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a switching module 300 according to an embodiment of the present invention.
  • FIG. 4 shows a flowchart of a video depth analysis method according to an embodiment of the present invention.
  • An object of the present invention is to provide a new video depth analysis method and system.
  • This method is different from the traditional method in that its core is composed of two cascaded neural networks, that is, a feature network and a decision module, and a transfer module.
  • the feature network first processes the original video data, and the obtained result is input to the switching module as the spatio-temporal information feature of the original video.
  • the transfer module integrates the features to obtain the spatiotemporal feature tensor, and then inputs it to the decision neural network for further processing, and finally obtains the analysis and recognition results of the motion video.
  • FIG. 1 shows a schematic block diagram of a video depth analysis system according to an embodiment of the present invention.
  • the video depth analysis system may include: a data storage 100, a feature network 101, a switching module 102, and a decision module 103.
  • the data memory 100 is used for storing original video data.
  • the original video data uses frames as the minimum storage unit.
  • the stored video data is accessed in chronological order and distributed to one or more feature networks 101 for processing. During the visit, you can choose to distribute one frame for each feature network, or several frames adjacent in time.
  • the data memory 100 may be a volatile memory or a non-volatile memory.
  • Volatile memory may include ROM, PROM, EPROM, EEPROM, flash ROM, FRAM, MRAM, RRAM, PCRAM, and so on.
  • Non-volatile memory may include RAM, SRAM, DRAM, SDRAM, DDR-SDRAM, and so on.
  • the feature network 101 is used for spatio-temporal feature extraction, and processes raw video data.
  • the feature network 101 uses one or more convolutional layers and corresponding pooling and non-linear layers to analyze single-frame video data, or locally adjacent multi-frame video data.
  • Feature networks are used to extract features from raw video data.
  • the features here have time localization characteristics, corresponding to each frame or several frames of video data adjacent in time. All extracted features will be time-stamped for subsequent temporal fusion operations.
  • the video depth analysis system may include one or more feature networks 101 at the same time.
  • FIG. 2 shows a structural block diagram of a feature network according to an embodiment of the present invention.
  • the feature network may include multiple cascaded processing layers 210-1, 210-2, ..., 210 -N.
  • the processing layers 210-1, 210-2, ..., 210-N have similar structures.
  • the first processing layer 210-1 may include a convolution layer 201, a non-linear layer 202, and a pooling layer 203.
  • the second processing layer 210-2 may include a convolution layer 204, a non-linear layer 205, and a pooling layer 206.
  • the Nth processing layer 210-N may include a convolution layer, a non-linear layer, and a pooling layer.
  • the convolution layer, non-linear layer, and pooling layer in each processing layer are basically similar to the convolution layer, non-linear layer, and pooling layer in other processing layers.
  • the latter processing layer is used to further process the results processed by the previous processing layer to extract higher-level features.
  • the convolution layer 201 is used to extract a spatial local feature in a single frame image or a temporally adjacent multiple frame image to be processed through a convolution operation.
  • the setting of the convolution kernel is similar to the standard convolutional neural network.
  • the coefficient value of the convolution kernel is obtained through training.
  • the non-linear excitation layer 202 is used to simulate the neuron behavior of the living body, and the input convolution calculation result is processed by a non-linear function.
  • Non-linear functions come in many forms and are similar to existing convolutional neural networks. The parameters of the non-linear function are obtained through training.
  • the pooling layer 203 is configured to perform downsampling on a single frame image to be processed or a temporally adjacent multiple frame image.
  • the purpose of downsampling is to expand the spatial scale of features and achieve global feature extraction.
  • the basic structure of the convolution layer 204, the non-linear layer 205, and the pooling layer 206 is similar to the operation and the convolution layer 201, the non-linear layer 202, and the pooling layer 203.
  • the result processed by the chemical layer 203 is further processed to extract higher-level features. Similar to the standard convolutional neural network, multiple convolutional layers can be set in the feature network, as well as the corresponding non-linear excitation layer and pooling layer.
  • the feature network also includes a time stamping module 207, which is used to time stamp the spatiotemporal features output by the processing layers.
  • the purpose of the time stamp is to provide a time reference for subsequent time-domain fusion operations.
  • the time stamp here can take many forms, including the index of the number of image frames and the actual sampling time.
  • the space-time features marked with time are placed in the space-time feature information set.
  • Feature networks use one or more convolutional layers and corresponding pooling and non-linear layers to analyze single-frame video data, or locally adjacent multi-frame video data, and extract spatial feature information of video data corresponding to local time from it. .
  • the processing result of each frame or locally adjacent multi-frame video data by the feature network is marked with appropriate time stamps to form a video space feature vector with time stamps.
  • the time stamp here may be the absolute time of feature extraction or the relative time with the time when the first frame of video data is processed is zero, or it may be the frame number of the video data or another form of time stamp.
  • the feature network in the video depth analysis system may be single or multiple. If there are multiple feature networks in the system, these feature networks can be configured to work in parallel and simultaneously process single-frame or partial multi-frame video data corresponding to multiple local times to improve processing efficiency.
  • the switching module 102 is configured to convert the video features with time stamps into a multi-dimensional tensor form required for subsequent decision-making operations to form a spatio-temporal feature tensor.
  • the spatio-temporal feature tensor can be a two-dimensional matrix, in which case the two dimensions represent the spatial feature dimension and the time dimension, respectively.
  • the spatio-temporal feature tensor can also be a three-dimensional data cube. At this time, the three dimensions represent the spatial feature dimension, the time dimension, and other feature dimensions, such as color features.
  • the spatio-temporal feature tensor may also be a data tensor of other dimensions.
  • Each point of this tensor corresponds to a specific moment in the original video description, and a specific position of the information feature value. Since this is a multi-dimensional tensor, it can be directly processed by the deep network to extract different levels, Spatiotemporal characteristics of different semantics.
  • the switching module 102 may further perform time-domain difference and feature-domain dimension reduction operations as needed to improve system performance.
  • FIG. 3 is a schematic structural diagram of a switching module 300 according to an embodiment of the present invention.
  • the switching module 300 may include a time domain difference module 301, a feature domain dimension reduction module 302, and a spatio-temporal feature tensor generation module 303.
  • the time domain difference module 301 is used to use a difference operation to strengthen the differences between the spatiotemporal features at different times to facilitate subsequent decisions. Differences can take many forms, including first, second, and higher orders. Regardless of the form of the difference, the object of the difference here is the spatiotemporal features output by the feature network. The difference between the spatiotemporal features is essentially different from the difference between the frames of the original image. The difference result will be added as a new spatio-temporal feature to the spatio-temporal feature information set.
  • the feature domain dimensionality reduction module 302 is used to reduce the information redundancy of the feature domain and improve the information density of the feature domain. There are several methods for dimension reduction, such as linear and non-linear dimension reduction.
  • Linear dimensionality reduction refers to the selection of a linear operator that has nothing to do with spatio-temporal features.
  • the input dimension of the operator is the same as the spatio-temporal feature. New space-time characteristics to achieve the purpose of dimensionality reduction.
  • the common linear dimensionality reduction has a random projection, that is, a linear operator using random coefficients to reduce the dimension.
  • Non-linear dimension reduction refers to the selection of a linear operator related to spatiotemporal features.
  • the input dimension of the operator is the same as the spatiotemporal feature dimension, and its output dimension is smaller than the input dimension.
  • the operator is applied to each spatiotemporal feature to obtain a dimensional comparison.
  • the common linear dimensionality reduction is principal component analysis (PCA).
  • PCA principal component analysis
  • the linear operator used here is designed to correspond to the more significant components in the spatiotemporal features to achieve the purpose of dimensionality reduction.
  • the spatio-temporal feature tensor generation module 303 is configured to integrate the feature information with time stamps into a two-dimensional image, that is, a two-dimensional spatio-temporal feature map.
  • the two dimensions are a feature domain and a time domain, respectively.
  • Each column of the image corresponds to a time stamp, and each element of the column corresponds to a feature value after dimensionality reduction.
  • the decision module 103 directly processes the spatiotemporal feature tensor to generate a final decision.
  • the decision module 103 uses one or more convolutional layers, and corresponding pooling layers and non-linear layers to analyze the spatiotemporal feature tensor, and inputs the results to the fully connected network for discriminative decision.
  • the decision module is used for temporal domain feature fusion and high-level semantic feature extraction.
  • the decision module 103 analyzes and infers the spatiotemporal feature tensor. Through training under supervised conditions, the ability to determine the type of behavior presented in the video is formed.
  • the decision module looks for a specific spatiotemporal pattern that exists in the original video data. This spatio-temporal mode is reflected in multiple original video frames in a specific form and is interrelated. Therefore, for each such spatiotemporal pattern, there will be a specific geometric feature corresponding to it in the spatiotemporal feature tensor.
  • the role of the decision-making module is to extract this specific geometric feature from the spatiotemporal feature tensor, and to complete the analysis and discrimination task of the video image.
  • FIG. 4 shows a flowchart of a video depth analysis method according to an embodiment of the present invention.
  • step 410 features are extracted from the original video data through a feature network, and the extracted features are time-stamped.
  • the feature network is a convolutional neural network, using one or more convolutional layers, and corresponding pooling and non-linear layers to pair single-frame video data, or locally adjacent multi-frame video data.
  • An analysis is performed to extract spatial feature information of video data corresponding to local time.
  • extracting features from the original video data through a feature network includes: extracting spatial local features in a single frame image or time-adjacent multi-frame images through a convolution operation, and using a non-linearity for the input convolution calculation result.
  • the function performs processing, and downsampling is performed on a single frame image to be processed or a plurality of temporally adjacent frames.
  • a spatio-temporal feature tensor is generated based on the time-tagged feature generation set.
  • the spatio-temporal feature tensor can be a two-dimensional matrix, in which case the two dimensions represent the spatial feature dimension and the time dimension, respectively.
  • the spatio-temporal feature tensor can also be a three-dimensional data cube. At this time, the three dimensions represent the spatial feature dimension, the time dimension, and other feature dimensions, such as color features.
  • the spatio-temporal feature tensor may also be a data tensor of other dimensions. Each point of this tensor corresponds to a specific moment in the original video description, and a specific position of the information feature value. Since this is a multi-dimensional tensor, it can be directly processed by the deep network to extract different levels, Spatiotemporal characteristics of different semantics.
  • the spatiotemporal feature tensor Before generating the spatiotemporal feature tensor based on the feature generation set with time stamps, it also includes: using a difference operation to strengthen the differences between features at different times; and / or reducing the dimension of the features. Reduce the dimensionality of features by linear or nonlinear dimensionality reduction.
  • Differences can take many forms, including first, second, and higher orders. Regardless of the form of the difference, the object of the difference here is the spatiotemporal features output by the feature network.
  • the difference between the spatiotemporal features is essentially different from the difference between the frames of the original image.
  • the difference result will be added as a new spatio-temporal feature to the spatio-temporal feature information set.
  • step 430 the spatio-temporal feature tensor is processed by the decision module to generate a final decision.
  • the decision module looks for a specific spatiotemporal pattern existing in the original video data. This spatio-temporal mode is reflected in multiple original video frames in a specific form and is interrelated. Therefore, for each such spatiotemporal pattern, there will be a specific geometric feature corresponding to it in the spatiotemporal feature tensor.
  • the role of the decision-making module is to extract this specific geometric feature from the spatiotemporal feature tensor, and to complete the analysis and discrimination task of the video image.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种视频深度分析系统,包括:特征网络,所述特征网络从原始视频提取特征,并在所述特征上添加时间标记;以及转接模块,所述转接模块基于带有时间标记的特征生成时空特征张量,所述时空特征张量包括时间维度和空间特征维度。

Description

一种视频深度分析的方法与系统 技术领域
本发明属于人工智能与计算机视觉领域,具体涉及一种用于视频深度分析的方法与系统。
背景技术
近年来,以卷积神经网络为代表的深度学习方法在图像分析与物件识别方面取得了令人瞩目的成绩。对单帧图像的深度分析已经能够以较高的成功概率和稳定性,从图像中获取特定目标的信息,包括存在与否,所处位置以及状态变化等。这使得车辆识别、人脸识别等应用的大规模部署具备了相当的技术基础。
相比于针对单帧图像的物件识别,涉及到多帧图像所构成的运动视频的联合分析技术还很不成熟。人们已经意识到,运动视频分析的关键在于如何将时间轴上不同点的信息加以协同利用。但是具体怎么做,才能够将信息沿时间轴有效地融合起来,仍然存在现实的困难。尽管出现了诸如3d卷积、多帧判决等方法,但是其或者拘泥于局部分析,无法获得全局特征;或者只能照顾到时间轴上的关键点,无法形成完整的连续采样。因此,效果距离实用要求尚有较大差距。
因此,本领域需要一种新型的用于视频分析的方法和系统,至少部分地解决现有技术中存在的问题。
发明内容
为了解决上述问题,本发明的实施例使用新的视频深度分析系统和方法。该系统和方法可以显著提升运动视频分析与识别的准确率与稳定性,同时为硬件实现提供方便。
本发明的一个实施例提供一种视频深度分析系统,包括:
特征网络,所述特征网络从原始视频提取特征,并在所述特征上添加时间标记;以及
转接模块,所述转接模块基于带有时间标记的特征生成时空特征张量,所述时空特征张量包括时间维度和空间特征维度。
在本发明的一个实施例中,该视频深度分析系统还包括决策模块,所述决策模块处理时空特征张量生成最终决策。
在本发明的一个实施例中,所述决策模块是决策神经网络,所述决策神经网络:
一个或者多个卷积层;
与每一个卷积层对应的池化层和非线性层;以及
全连接网络,所述全连接网络接收卷积层、池化层和非线性层的分析结果,并进行判别决策。
在本发明的一个实施例中,所述特征网络包括多个级联的处理层,后一处理层用于对前一处理层处理的结果作进一步处理,提取更高层次的特征,每个处理层包括:
卷积层,用于通过卷积操作提取待处理的单帧图像或者时间相邻多帧图像中的空间局部特征;
非线性激励层,用于对输入的卷积计算结果用非线性函数进行处理;以及
池化层,用于对待处理的单帧图像或者时间相邻多帧图像实施降采样。
在本发明的一个实施例中,所述特征网络包括时间标记模块,用于给所述处理层输出的时空特征打上时间标记,所述时间标记是绝对时间、相对时间和帧序号中的一种或多种。
在本发明的一个实施例中,所述特征网络是并行工作的多个特征网络,同时对与多个局部时间相对应的单帧或者局部多帧视频数据进行处理。
在本发明的一个实施例中,所述转接模块包括:
时域差分模块,用于使用差分操作来强化不同时间的特征之间的差异;
特征域降维模块,用于减少特征的维度;以及
时空特征张量生成模块,用于将带有时间标记的特征整合为时空特征张量。
本发明的另一个实施例提供一种视频深度分析方法,包括:
通过特征网络从原始视频数据中提取特征,并对所提取的特征打上时间标记;
根据带有时间标记的特征生成集合生成时空特征张量,所述时空特征张量包括时间维度和空间特征维度;以及
通过决策模块处理时空特征张量生成最终决策。
在本发明的另一个实施例中,通过特征网络从原始视频数据中提取特征包括:
通过卷积操作提取待处理的单帧图像或者时间相邻多帧图像中的空间局部特征;
对输入的卷积计算结果用非线性函数进行处理;以及
对待处理的单帧图像或者时间相邻多帧图像实施降采样。
在本发明的另一个实施例中,所述时间标记是绝对时间、相对时间和帧序号中的一种或多种。
在本发明的另一个实施例中,在根据带有时间标记的特征生成集合生成时空特征张量之前,还包括:
使用差分操作来强化不同时间的特征之间的差异;和/或
减少特征的维度。
在本发明的另一个实施例中,通过线性降维或非线性降维减少特征的维度。
在本发明的另一个实施例中,所述时空特征张量是二维图像,两个维度分别为特征域与时间域,二维图像的每一列对应于一个的时间标记,该列的每一个元素对应于一个降维后的特征值。
在本发明的另一个实施例中,通过决策模块处理时空特征张量生成最终决策包括从所述时空张量中提取特定的几何特征。
本发明的又一个实施例中提供一种视频深度分析系统,包括:
数据存储器,用于存储原始视频数据;以及
处理单元,所述处理单元用于执行上述视频深度分析的方法。
附图说明
为了进一步阐明本发明的各实施例的以上和其它优点和特征,将参考附图来呈现本发明的各实施例的更具体的描述。可以理解,这些附图只描绘本发明的典型实施例,因此将不被认为是对其范围的限制。在附图中,为了清楚明了,相同或相应的部件将用相同或类似的标记表示。
图1示出根据本发明的一个实施例的视频深度分析系统的示意框图。
图2示出根据本发明的一个实施例的特征网络的结构框图。
图3示出根据本发明的一个实施例的转接模块300的结构示意图。
图4示出根据本发明的一个实施例的视频深度分析方法的流程图。
具体实施方式
在以下的描述中,参考各实施例对本发明进行描述。然而,本领域的技术人员将认识到可在没有一个或多个特定细节的情况下或者与其它替换和/或附加方法、材料或组件一起实施各实施例。在其它情形中,未示出或未详细描述公知的结构、材料或操作以免使本发明的各实施例的诸方面晦涩。类似地,为了解释的目的,阐述了特定数量、材料和配置,以便提供对本发明的实施例的全面理解。然而,本发明可在没有特定细节的情况下实施。此外,应理解附图中示出的各实施例是说明性表示且不一定按比例绘制。
在本说明书中,对“一个实施例”或“该实施例”的引用意味着结合该实施例描述的特定特征、结构或特性被包括在本发明的至少一个实施例中。在本说明书各处中出现的短语“在一个实施例中”并不一定全部指代同一实施例。
在以下的描述中,参考各实施例对本发明进行描述。然而,本领域的技术人员将认识到可在没有一个或多个特定细节的情况下或者与其它替换和/或附加方法、材料或组件一起实施各实施例。在其它情形中,未示出或未详细描述公知的结构、材料或操作以免使本发明的各实施例的诸方面晦涩。类似地,为了解释的目的,阐述了特定数量、材料和配置,以便提供对本发明的实施例的全面理解。然而,本发明可在没有特定细节的情况下实施。此外,应理解附图中示出的各实施例是说明性表示且不一定按比例绘制。
在本说明书中,对“一个实施例”或“该实施例”的引用意味着结合该实 施例描述的特定特征、结构或特性被包括在本发明的至少一个实施例中。在本说明书各处中出现的短语“在一个实施例中”并不一定全部指代同一实施例。
本发明的目的在于提供一种新的视频深度分析方法及其系统。该方法不同于传统方法之处在于,其核心由两张级联的神经网络,即特征网络及决策模块,以及一个转接模块共同构成。特征网络首先对原始视频数据进行处理,所得到的结果作为原始视频的时空信息特征被输入到转接模块。转接模块将特征整合后得到时空特征张量,并输入到决策神经网络进行进一步处理,并最终得到运动视频的分析与识别结果。
图1示出根据本发明的一个实施例的视频深度分析系统的示意框图。参见图1,该视频深度分析系统可包括:数据存储器100、特征网络101、转接模块102和决策模块103。
数据存储器100用于存储原始视频数据。原始视频数据以帧为最小存储单位。存储的视频数据被按照时间顺序被访问,并分发到一个或者多个特征网络101进行处理。访问时可以选择为每一个特征网络分发一帧,或者分发时间上相邻的若干帧。在本发明的具体实施例中,数据存储器100可以是易失性存储器或非易失性存储器。易失性存储器可包括ROM、PROM、EPROM、EEPROM、闪存ROM、FRAM、MRAM、RRAM、PCRAM等。非易失性存储器可包括RAM、SRAM、DRAM、SDRAM、DDR-SDRAM等等。
特征网络101用于时空特征提取,针对运动视频原始数据进行处理。特征网络101使用一个或者多个卷积层,以及对应的池化与非线性层对单帧视频数据,或者是局部相邻多帧视频数据进行分析。特征网络用于从原始视频数据中提取特征。这里的特征具有时间局部化特性,相应于每一帧或者时间上相邻的若干帧视频数据。所有提取的特征都将带有时间标记,用于后续的时间融合操作。
该视频深度分析系统可以同时包括一个或多个特征网络101。
图2示出根据本发明的一个实施例的特征网络的结构框图。如图2所示,特征网络可包括多个级联的处理层210-1、210-2、…、210-N。处理层210-1、210-2、…、210-N结构类似。第一处理层210-1可包括卷积层201、非线性层202以及池化层203。第二处理层210-2可包括卷积层204、非线性层205以及 池化层206。第N处理层210-N可包括卷积层、非线性层以及池化层。每个处理层中的卷积层、非线性层以及池化层与其他处理层中的卷积层、非线性层以及池化层基本类似。后一处理层用于对前一处理层处理的结果作进一步处理,提取更高层次的特征。
下文中具体介绍卷积层、非线性层以及池化层的具体结构、功能和工作方式。卷积层201用于通过卷积操作提取待处理的单帧图像或者时间相邻多帧图像中的空间局部特征。卷积核的设置与标准的卷积神经网络类似,卷积核的系数值通过训练得到。
非线性激励层202用于模拟生物体的神经元行为,对输入的卷积计算结果用非线性函数进行处理。非线性函数有多种形式,与现有的卷积神经网络相类似。非线性函数的参数通过训练得到。
池化层203用于对待处理的单帧图像或者时间相邻多帧图像实施降采样。降采样的目的在于扩大特征的空间尺度,实现全局化特征提取。降采样可以有多种方式,与标准的卷积神经网络相类似。
卷积层204、非线性层205以及池化层206的基本构造与操作和卷积层201、非线性层202以及池化层203类似,用于对卷积层201、非线性层202以及池化层203处理的结果作进一步处理,提取更高层次的特征。和标准的卷积神经网络类似,特征网络中可以设置多个卷积层,以及相应的非线性激励层和池化层。
特征网络还包括时间标记模块207,用于给上述若干处理层输出的时空特征打上时间标记。时间标记的作用在于为后续的时间域融合操作提供时间基准。这里的时间标记可以有多种形式,包括图像帧的数目索引和实际的采样时间等。
打上时间标记的时空特征被置于时空特征信息集合中。
需要强调,与现有卷积神经网络不同的是,特征网络中没有用于分类判决的全连接网络层。特征网络的作用是提取局部时间的视频特征并做时间标记。这里并不做任何判决。
特征网络使用一个或者多个卷积层,以及对应的池化与非线性层对单帧视频数据,或者是局部相邻多帧视频数据进行分析,从中提取对应于局部时间的 视频数据空间特征信息。特征网络对每一帧,或者局部相邻多帧视频数据的处理结果被打上适当的时间标记,形成带有时间标记的视频空间特征矢量。这里的时间标记可以是特征提取的绝对时间或者以第一帧视频数据处理完毕时间为零点的相对时间,也可以是视频数据的帧序号或者其他形式的时间标记。
在本发明的实施例中,视频深度分析系统中的特征网络可以是单个,也可以是多个。如果系统内存在多个特征网络,那么这些特征网络可以配置为并行工作模式,同时对与多个局部时间相对应的单帧或者局部多帧视频数据进行处理,以提高处理效率。
返回图1,转接模块102用于将带有时间标记的视频特征转换为后续决策操作所需要的多维张量形式,形成时空特征张量。该时空特征张量可以是二维矩阵,此时两个维度分别代表空间特征维度和时间维度。该时空特征张量也可以是三维数据立方体,此时三个维度分别代表空间特征维度、时间维度以及其它特征维度,例如色彩特征等。该时空特征张量也可以是其它维度的数据张量。该张量的每一个点,都对应着原始视频述中某一特定时刻,某一特定位置的信息特征值,由于这是一个多维张量,因此能够被深度网络直接处理,从中提取不同层次、不同语义的时空特征。
在本发明的一个实施例中,转接模块102还可以根据需要,进行时间域差分以及特征域降维操作,以提升系统效能。
图3示出根据本发明的一个实施例的转接模块300的结构示意图。如图3所示,转接模块300可包括时域差分模块301、特征域降维模块302以及时空特征张量生成模块303。
时域差分模块301用于使用差分操作来强化不同时间的时空特征之间的差异,以利于后续判决。差分可以有一阶、二阶和高阶等多种形式。无论是何种差分形式,这里差分的对象是特征网络输出的时空特征。时空特征之间的差分与原始图像的帧间的差分有本质不同。差分的结果将被作为新的时空特征,加入到时空特征信息集合中。
特征域降维模块302用于降低特征域的信息冗余,提高特征域的信息密度。降维可以采取多种方法,例如线性降维和非线性降维。
线性降维是指选取和时空特征无关的线性算子,该算子的输入维度与时空 特征维度相同,其输出维度小于输入维度,将该算子作用在每一个时空特征上,得到维度较小的新的时空特征,达成降维的目的。常见的线性降维有随机投影,即使用随机系数的线性算子来降维。
非线性降维是指选取和时空特征相关的线性算子,该算子的输入维度与时空特征维度相同,其输出维度小于输入维度,将该算子作用在每一个时空特征上,得到维度较小的新的时空特征,达成降维的目的。常见的线性降维有主成分分析(PCA),这里使用的线性算子对应于时空特征中较为显著的分量进行设计,达成降维的目的。
值得指出的是,时间差分和特征域降维都不是必须的。根据所处理的视频图像的复杂度不同,可以选择在系统中加载这两个模块来提升性能,也可以选择不加载这两个模块,直接将转换模块的输入交给时空特征张量生成模块303。
在本发明的一个实施例中,时空特征张量生成模块303用于将带有时间标记的特征信息整合为二维图像,即二维时空特征图。例如,这里的两个维度分别为特征域与时间域,该图像的每一列对应于一个的时间标记,该列的每一个元素对应于一个降维后的特征值。
返回图1,决策模块103直接处理时空特征张量生成最终决策。决策模块103使用一个或者多个卷积层,以及对应的池化层与非线性层对时空特征张量进行分析,并将结果输入到全连接网络进行判别决策。决策模块用于时间域特征融合和高层语义特征提取。决策模块103针对时空特征张量进行分析和推断。通过在有监督的条件下的训练,形成对视频中所呈现的行为类型的判决能力。
决策模块寻找原始视频数据中所存在的特定的时空模式。该时空模式以某种特定形式在多个原始视频帧上有所体现,并产生相互关联。因此,每一个这样的时空模式,在时空特征张量中都将存在某种特定的几何特征与之相对应。决策模块的作用就在于将这种特定的几何特征从时空特征张量中提取出来,并籍此完成视频图像的分析判别任务。
图4示出根据本发明的一个实施例的视频深度分析方法的流程图。
首先,在步骤410,通过特征网络从原始视频数据中提取特征,并对所提取的特征打上时间标记。
在本发明的一个实施例中,特征网络是卷积神经网络,使用一个或者多个卷积层,以及对应的池化与非线性层对单帧视频数据,或者是局部相邻多帧视频数据进行分析,从中提取对应于局部时间的视频数据空间特征信息。具体而言,通过特征网络从原始视频数据中提取特征包括:通过卷积操作提取待处理的单帧图像或者时间相邻多帧图像中的空间局部特征,对输入的卷积计算结果用非线性函数进行处理,对待处理的单帧图像或者时间相邻多帧图像实施降采样。
在步骤420,根据带有时间标记的特征生成集合生成时空特征张量。
该时空特征张量可以是二维矩阵,此时两个维度分别代表空间特征维度和时间维度。该时空特征张量也可以是三维数据立方体,此时三个维度分别代表空间特征维度、时间维度以及其它特征维度,例如色彩特征等。该时空特征张量也可以是其它维度的数据张量。该张量的每一个点,都对应着原始视频述中某一特定时刻,某一特定位置的信息特征值,由于这是一个多维张量,因此能够被深度网络直接处理,从中提取不同层次、不同语义的时空特征。
在根据带有时间标记的特征生成集合生成时空特征张量之前,还包括:使用差分操作来强化不同时间的特征之间的差异;和/或减少特征的维度。通过线性降维或非线性降维减少特征的维度。
差分可以有一阶、二阶和高阶等多种形式。无论是何种差分形式,这里差分的对象是特征网络输出的时空特征。时空特征之间的差分与原始图像的帧间的差分有本质不同。差分的结果将被作为新的时空特征,加入到时空特征信息集合中。
在步骤430,通过决策模块处理时空特征张量生成最终决策。
在本发明的一个实施例中,决策模块寻找原始视频数据中所存在的特定的时空模式。该时空模式以某种特定形式在多个原始视频帧上有所体现,并产生相互关联。因此,每一个这样的时空模式,在时空特征张量中都将存在某种特定的几何特征与之相对应。决策模块的作用就在于将这种特定的几何特征从时空特征张量中提取出来,并籍此完成视频图像的分析判别任务。
尽管上文描述了本发明的各实施例,但是,应该理解,它们只是作为示例 来呈现的,而不作为限制。对于相关领域的技术人员显而易见的是,可以对其做出各种组合、变型和改变而不背离本发明的精神和范围。因此,此处所公开的本发明的宽度和范围不应被上述所公开的示例性实施例所限制,而应当仅根据所附权利要求书及其等同替换来定义。

Claims (12)

  1. 一种视频深度分析系统,包括:
    特征网络,所述特征网络从原始视频提取特征,并在所述特征上添加时间标记;以及
    转接模块,所述转接模块基于带有时间标记的特征生成时空特征张量,所述时空特征张量包括时间维度和空间特征维度。
  2. 如权利要求1所述的视频深度分析系统,还包括决策模块,所述决策模块处理时空特征张量生成最终决策。
  3. 如权利要求2所述的视频深度分析系统,其特征在于,所述决策模块是决策神经网络,所述决策神经网络:
    一个或者多个卷积层;
    与每一个卷积层对应的池化层和非线性层;以及
    全连接网络,所述全连接网络接收卷积层、池化层和非线性层的分析结果,并进行判别决策。
  4. 如权利要求1所述的视频深度分析系统,其特征在于,所述特征网络包括多个级联的处理层,后一处理层用于对前一处理层处理的结果作进一步处理,提取更高层次的特征,每个处理层包括:
    卷积层,用于通过卷积操作提取待处理的单帧图像或者时间相邻多帧图像中的空间局部特征;
    非线性激励层,用于对输入的卷积计算结果用非线性函数进行处理;以及
    池化层,用于对待处理的单帧图像或者时间相邻多帧图像实施降采样。
  5. 如权利要求4所述的视频深度分析系统,其特征在于,所述特征网络包括时间标记模块,用于给所述处理层输出的时空特征打上时间标记,所述时间标记是绝对时间、相对时间和帧序号中的一种或多种。
  6. 如权利要求1所述的视频深度分析系统,其特征在于,所述特征网络是并行工作的多个特征网络,同时对与多个局部时间相对应的单帧或者局部多帧视频数据进行处理。
  7. 如权利要求1所述的视频深度分析系统,其特征在于,所述转接模块 包括:
    时域差分模块,用于使用差分操作来强化不同时间的特征之间的差异;
    特征域降维模块,用于减少特征的维度;以及
    时空特征张量生成模块,用于将带有时间标记的特征整合为时空特征张量。
  8. 一种视频深度分析方法,包括:
    通过特征网络从原始视频数据中提取特征,并对所提取的特征打上时间标记;
    根据带有时间标记的特征生成集合生成时空特征张量,所述时空特征张量包括时间维度和空间特征维度;以及
    通过决策模块处理时空特征张量生成最终决策。
  9. 如权利要求8所述的视频深度分析方法,其特征在于,通过特征网络从原始视频数据中提取特征包括:
    通过卷积操作提取待处理的单帧图像或者时间相邻多帧图像中的空间局部特征;
    对输入的卷积计算结果用非线性函数进行处理;以及
    对待处理的单帧图像或者时间相邻多帧图像实施降采样。
  10. 如权利要求8所述的视频深度分析方法,其特征在于,所述时间标记是绝对时间、相对时间和帧序号中的一种或多种。
  11. 如权利要求8所述的视频深度分析方法,其特征在于,在根据带有时间标记的特征生成集合生成时空特征张量之前,还包括:
    使用差分操作来强化不同时间的特征之间的差异;和/或
    减少特征的维度。
  12. 如权利要求11所述的视频深度分析方法,其特征在于,通过线性降维或非线性降维减少特征的维度。
PCT/CN2018/102901 2018-07-27 2018-08-29 一种视频深度分析的方法与系统 WO2020019397A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810839597.7 2018-07-27
CN201810839597.7A CN108961317A (zh) 2018-07-27 2018-07-27 一种视频深度分析的方法与系统

Publications (1)

Publication Number Publication Date
WO2020019397A1 true WO2020019397A1 (zh) 2020-01-30

Family

ID=64465062

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102901 WO2020019397A1 (zh) 2018-07-27 2018-08-29 一种视频深度分析的方法与系统

Country Status (2)

Country Link
CN (1) CN108961317A (zh)
WO (1) WO2020019397A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112241673B (zh) * 2019-07-19 2022-11-22 浙江商汤科技开发有限公司 视频处理方法及装置、电子设备和存储介质
US20230124075A1 (en) * 2021-10-15 2023-04-20 Habib Hajimolahoseini Methods, systems, and media for computer vision using 2d convolution of 4d video data tensors

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326939A (zh) * 2016-08-31 2017-01-11 深圳市诺比邻科技有限公司 卷积神经网络的参数优化方法及系统
CN107909041A (zh) * 2017-11-21 2018-04-13 清华大学 一种基于时空金字塔网络的视频识别方法
CN108229527A (zh) * 2017-06-29 2018-06-29 北京市商汤科技开发有限公司 训练及视频分析方法和装置、电子设备、存储介质、程序
CN108304806A (zh) * 2018-02-02 2018-07-20 华南理工大学 一种基于对数路径积分特征和卷积神经网络的手势识别方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506740B (zh) * 2017-09-04 2020-03-17 北京航空航天大学 一种基于三维卷积神经网络和迁移学习模型的人体行为识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326939A (zh) * 2016-08-31 2017-01-11 深圳市诺比邻科技有限公司 卷积神经网络的参数优化方法及系统
CN108229527A (zh) * 2017-06-29 2018-06-29 北京市商汤科技开发有限公司 训练及视频分析方法和装置、电子设备、存储介质、程序
CN107909041A (zh) * 2017-11-21 2018-04-13 清华大学 一种基于时空金字塔网络的视频识别方法
CN108304806A (zh) * 2018-02-02 2018-07-20 华南理工大学 一种基于对数路径积分特征和卷积神经网络的手势识别方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, XUDONG ET AL.: "Review of Object Detection Based on Convolutional Neural Networks", APPLICATION RESEARCH OF COMPUTERS, vol. 34, no. 10, 31 October 2017 (2017-10-31), pages 2881, ISSN: 1001-3695 *

Also Published As

Publication number Publication date
CN108961317A (zh) 2018-12-07

Similar Documents

Publication Publication Date Title
Huang et al. An lstm approach to temporal 3d object detection in lidar point clouds
Chen et al. Research on image inpainting algorithm of improved GAN based on two-discriminations networks
Ge et al. An attention mechanism based convolutional LSTM network for video action recognition
CN107067011B (zh) 一种基于深度学习的车辆颜色识别方法与装置
Yang et al. Multi-view CNN feature aggregation with ELM auto-encoder for 3D shape recognition
CN107154023A (zh) 基于生成对抗网络和亚像素卷积的人脸超分辨率重建方法
CN108710906B (zh) 基于轻量级网络LightPointNet的实时点云模型分类方法
CN110728219A (zh) 基于多列多尺度图卷积神经网络的3d人脸生成方法
CN113870335A (zh) 一种基于多尺度特征融合的单目深度估计方法
CN114418030A (zh) 图像分类方法、图像分类模型的训练方法及装置
WO2020019397A1 (zh) 一种视频深度分析的方法与系统
Naeem et al. T-VLAD: Temporal vector of locally aggregated descriptor for multiview human action recognition
CN103268484A (zh) 用于高精度人脸识别的分类器设计方法
Zou et al. A new approach for small sample face recognition with pose variation by fusing Gabor encoding features and deep features
CN116311186A (zh) 一种基于改进Transformer模型的植物叶片病变识别方法
Bulat et al. Matrix and tensor decompositions for training binary neural networks
Tan et al. Performance comparison of three types of autoencoder neural networks
Li et al. Image super-resolution reconstruction based on multi-scale dual-attention
CN117972138A (zh) 预训练模型的训练方法、装置和计算机设备
Gao et al. Adaptive random down-sampling data augmentation and area attention pooling for low resolution face recognition
Li et al. HoloParser: Holistic visual parsing for real-time semantic segmentation in autonomous driving
Xu et al. JCa2Co: A joint cascade convolution coding network based on fuzzy regional characteristics for infrared and visible image fusion
CN117315241A (zh) 一种基于transformer结构的场景图像语义分割方法
CN112686830A (zh) 基于图像分解的单一深度图的超分辨率方法
CN116453025A (zh) 一种缺帧环境下融合时空信息的排球比赛群体行为识别方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18928007

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18928007

Country of ref document: EP

Kind code of ref document: A1