WO2020113886A1 - 基于时空频域混合学习的行为特征提取方法、系统、装置 - Google Patents

基于时空频域混合学习的行为特征提取方法、系统、装置 Download PDF

Info

Publication number
WO2020113886A1
WO2020113886A1 PCT/CN2019/083357 CN2019083357W WO2020113886A1 WO 2020113886 A1 WO2020113886 A1 WO 2020113886A1 CN 2019083357 W CN2019083357 W CN 2019083357W WO 2020113886 A1 WO2020113886 A1 WO 2020113886A1
Authority
WO
WIPO (PCT)
Prior art keywords
domain
feature map
local
behavior
spatio
Prior art date
Application number
PCT/CN2019/083357
Other languages
English (en)
French (fr)
Inventor
胡古月
崔波
余山
Original Assignee
中国科学院自动化研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院自动化研究所 filed Critical 中国科学院自动化研究所
Publication of WO2020113886A1 publication Critical patent/WO2020113886A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Definitions

  • the invention belongs to the field of behavior recognition, and in particular relates to a method, system and device for extracting behavior features based on mixed learning in the spatio-temporal frequency domain.
  • Behavior recognition has a wide range of applications in the fields of intelligent monitoring, human-computer interaction, and automatic driving.
  • Behavior recognition includes behavior classification and behavior detection, specifically, behavior videos based on RGB, depth, and skeleton information collected with special collection equipment. Classify, locate and detect it.
  • Skeleton-based behavior recognition has aroused extensive interest in academia and industry in recent years because of its small computational overhead, concise representation, and robustness to changes in environment and appearance. Specifically, the skeleton behavior recognition is to realize the behavior recognition based on the video sequence composed of the 2D or 3D coordinates of the joint point of the target object in the environment.
  • Existing skeleton behavior recognition methods mainly use a local network with only a local affinity field stacked in the space-time domain to hierarchically extract the space-time features of the behavior sequence, and then recognize and detect the behavior.
  • Behaviors such as clapping hands, brushing teeth, shaking hands, etc. are rich in inherently distinguishing frequency features, and existing methods are limited to mining space-time patterns, ignoring the frequency-domain patterns inherent in the behavior, and previously hierarchically stacked in the space-time domain
  • the local network makes the semantic information can only be extracted at the upper layer, and the detailed information is mainly extracted at the bottom layer. The detailed information and the semantic information cannot be extracted and fused at the same time, which is not conducive to mining effective behavior features, which makes the skeleton behavior recognition accuracy low and cannot meet the requirements.
  • the present invention provides a behavior feature extraction method based on spatio-temporal frequency domain mixed learning, including:
  • Step S1 Obtain a skeleton-based video behavior sequence as the original video behavior sequence, perform a spatio-temporal domain adaptive transformation to obtain a first spatio-temporal domain behavior feature map;
  • Step S2 Send the first spatio-temporal domain behavior feature map into the frequency domain for frequency selection and then inversely transform back to the spatio-temporal domain, and add the first spatio-temporal domain behavior feature map in a residual manner to obtain a second spatio-temporal domain behavior feature map ;
  • Step S3 Perform local and non-local reasoning on the second spatio-temporal domain behavior feature map simultaneously, and add the first spatio-temporal domain behavior feature map in a residual manner to obtain a third spatio-temporal domain behavior feature map;
  • Step S4 Perform high-level local reasoning on the third spatio-temporal domain behavior feature map to obtain a fourth spatio-temporal domain behavior feature map;
  • Step S5 Globally pool the fourth time-space domain behavior feature map to obtain a behavior feature vector.
  • step S1 "time-space adaptive transformation"
  • the steps are:
  • Step S11 a convolutional network with a core of 1 or a fully connected network is used to adaptively augment the original video behavior sequence under K oblique coordinate systems to obtain augmented video behavior sequences under K coordinate systems , K is the hyperparameter.
  • Step S12 using a convolutional network with a core of 1 or a fully connected network to transform the number of joints and joint arrangement order of the skeletons in the augmented video behavior sequence to obtain the features of the augmented optimized video behavior sequence including structural information Figure, is the first time-space domain behavior characteristic diagram.
  • step S2 "send the first time-space domain behavior feature map into the frequency domain for frequency selection and then inversely transform back to the time-space domain, and add the first time-space domain behavior feature map in a residual manner. ", the steps are:
  • Step S21 the two-dimensional discrete Fourier transform is used to transform the feature map of each channel into the frequency domain, including the sine frequency domain feature map and the cosine frequency domain feature map;
  • the two-dimensional discrete fast Fourier transform can be used to realize the feature map transformation.
  • step S22 the sine frequency domain feature map and the cosine frequency domain feature map are respectively passed through an attention network to learn sine component attention weights and cosine component attention weights;
  • a channel average layer including a channel average layer, two fully connected layers, a softmax function and a channel replication layer.
  • step S23 the learned sine component attention weight and sine frequency domain feature map are used for point multiplication, and the cosine component attention weight and cosine frequency domain feature map are point multiplied to obtain the frequency-selected sine and cosine frequency domain feature map.
  • Step S24 Transform the sine and cosine frequency domain feature maps into the space-time domain using a two-dimensional discrete Fourier inverse transform, and add the first space-time domain behavior feature map in a residual manner to obtain a second space-time domain behavior feature map;
  • the inverse feature map inversion can be implemented using two-dimensional discrete fast Fourier transform.
  • step S3 "synchronously performing local and non-local reasoning on the second spatiotemporal domain behavior feature map"
  • the steps are:
  • Step S31 construct a neural network submodule y i with a local affinity field and a neural network submodule y′ i with a non-local affinity field:
  • x i represents the feature vector of the spatiotemporal domain feature map of the current layer network
  • y i and y′ i represent the feature vectors of the spatiotemporal domain feature map of the local and non-local affinity fields of the next layer network respectively
  • A(x i , x j ) is the binary transformation matrix that calculates the affinity between positions i and j
  • g(x i ) is the unary transformation function that calculates the feature embedding of x j , which is 1 or 1 ⁇ 1 by the convolution kernel Convolutional layer implementation
  • Z i (X) is the normalization factor, ⁇ enumerates all feature positions, and ⁇ i is the local domain.
  • the features extracted by the local and non-local affinity field neural network sub-modules have the right to superimpose to obtain the feature map, and batch normalize the feature map to reduce the feature drift, introduce a nonlinear unit, and then down-sample to reduce the feature map Resolution
  • Step S32 using the M1 local and non-local affinity field neural network sub-modules to calculate the affinity between position i and the neighbors in the local domain ⁇ i and the affinity between i and ⁇ for all possible positions, M1 Is a natural number greater than or equal to 1;
  • Step S33 the feature map inferred by the M1 partial and non-local affinity field neural network sub-modules and the first spatio-temporal domain feature map are added in a residual manner to obtain a third spatio-temporal domain behavior feature map.
  • step S4 "high-level local reasoning on the third spatio-temporal behavioral feature map"
  • the method is:
  • M2 constructed local affinity field neural network sub-modules to calculate the affinity between the position i of the third spatio-temporal domain behavior feature map group and the neighbors in the local domain ⁇ i , M2 is a natural number greater than or equal to 1;
  • the feature map after inference is the fourth time-space domain behavior feature map.
  • a method for extracting behavioral features based on spatio-temporal frequency-domain mixed learning includes:
  • Steps S1 to S5 according to any one of claims 1 to 5 are processed for the behavior sequence including position and speed, respectively, to obtain the feature vector corresponding to the speed and the feature vector corresponding to the position;
  • the feature vectors are spliced to obtain a spliced feature vector, and the extracted behavior feature vectors are a velocity feature vector, a position feature vector, and a spliced feature vector.
  • a behavior feature extraction system based on spatio-temporal frequency domain mixed learning which includes a video sequence acquisition module, an adaptive transformation module, a frequency selection module, local and non-local synchronous inference modules, and a high-level local inference module.
  • the video sequence acquisition module is configured to acquire a skeleton-based video behavior sequence as the original video behavior sequence
  • the adaptive transformation module is configured to extract the first spatio-temporal behavior characteristic map in the spatio-temporal domain through augmented optimization
  • the frequency selection module is configured to send the first time-space domain behavior feature map to the frequency-domain attention network for frequency selection, and transform the obtained frequency-domain behavior feature map to the time-space domain and the first time-space domain behavior feature map with residual Add the modes to get the second time-space domain behavior characteristic map;
  • the local and non-local synchronous inference module is configured to simultaneously perform local and non-local inference on the second spatio-temporal domain behavior feature map and add the first spatio-temporal domain behavior feature map in a residual manner to obtain the third spatio-temporal domain behavior Feature map
  • the high-level local inference module is configured to perform high-level local inference on the third spatio-temporal domain behavior feature map to obtain a fourth spatio-temporal domain behavior feature map;
  • the global pooling module is configured to globally pool the fourth time-space domain behavior feature map group to obtain corresponding behavior feature vectors
  • the stitching module is configured to stitch multi-channel features to obtain corresponding stitching feature vectors
  • the output module is configured to output the extracted behavior feature vector.
  • a storage device in which a plurality of programs are stored, and the programs are suitable to be loaded and executed by a processor to implement the above-described behavior feature extraction method based on spatio-temporal frequency domain mixed learning.
  • a processing device including a processor, adapted to execute each program; and a storage device, adapted to store multiple programs; the program is adapted to be loaded and executed by the processor to implement the above Behavior feature extraction method based on space-time frequency domain mixed learning.
  • the present invention breaks through the limitation of using only deep networks to mine spatio-temporal patterns of behavior skeleton sequences, fully exploits the discriminatory frequency patterns inherent in the behavior, and uses the attention mechanism to perform frequency-domain features on frequency-domain feature maps in the frequency domain. Attention distribution, through end-to-end learning, eventually learns to adaptively select effective frequency patterns.
  • the network module proposed by the invention to synchronize local and non-local affinity fields can be synchronized at each layer. Extracting and merging local details and global semantics, compared with traditional local networks, can effectively reduce the number of layers and parameters of the network.
  • the adaptive transformation network proposed by the present invention can transform the skeleton originally expressed in a single rectangular coordinate system into multiple oblique coordinate systems through learning to obtain a richer representation; at the same time, the skeleton transformation network also It can re-learn the optimal number of joints and joint arrangement order. Compared with the previous unstructured representation, it can learn more structured features, thereby improving the accuracy of feature extraction.
  • FIG. 1 is a schematic flowchart of a behavior feature extraction method based on mixed learning in the spatiotemporal frequency domain of the present invention
  • FIG. 2 is a schematic diagram of an overall framework of an embodiment of a method for behavior feature extraction based on mixed learning in the spatiotemporal frequency domain of the present invention
  • FIG. 3 is a schematic diagram of a frequency domain attention network structure of an embodiment of a behavior feature extraction method based on spatio-temporal frequency domain mixed learning of the present invention
  • FIG. 4 is a schematic diagram of a two-dimensional spatio-temporal non-local network plug-in according to an embodiment of a behavioral feature extraction method based on spatio-temporal frequency domain mixed learning of the present invention
  • FIG. 5 is a schematic diagram of a partial network module of an embodiment of a behavior feature extraction method based on mixed learning in the spatiotemporal frequency domain of the present invention
  • FIG. 6 is a schematic diagram of partial and non-local synchronization modules of an embodiment of a behavior feature extraction method based on mixed learning in the spatiotemporal frequency domain of the present invention
  • FIG. 7 is a schematic diagram of the affinity field of the local and non-local synchronization modules of the embodiment of the behavior feature extraction method based on mixed learning in the spatio-temporal frequency domain of the present invention.
  • Existing behavior recognition methods mainly use a local network with only a local affinity field stacked in the spatiotemporal domain to hierarchically extract spatiotemporal features of the behavior sequence, and then identify and detect the behavior, limited to mining spatiotemporal patterns, ignoring the intrinsic behavior
  • the semantic information can only be extracted at the upper layer, and the detailed information is mainly extracted at the bottom layer.
  • the detailed information and the semantic information cannot be synchronously fused, which is not conducive to mining effective behavioral features.
  • the technical scheme of the present invention adopts an attention mechanism to adaptively select an effective frequency mode in the frequency domain, and adopts a network with both local and non-local affinity fields in the space-time domain to perform space-time reasoning, so that the network can synchronously mine local parts at each layer module
  • the details and non-local semantic information effectively improve the accuracy of skeleton behavior feature extraction.
  • a method for extracting behavioral features based on mixed learning in the spatiotemporal frequency domain of the present invention includes:
  • Step S1 Obtain a skeleton-based video behavior sequence as the original video behavior sequence, perform a spatio-temporal domain adaptive transformation to obtain a first spatio-temporal domain behavior feature map;
  • Step S2 Send the first time-space domain behavior feature map into the frequency domain for frequency selection and then inversely transform back to the time-space domain, and add the first time-space domain behavior feature map in a residual manner to obtain a second space-time Domain behavior feature map;
  • Step S3 Perform local and non-local reasoning on the second spatio-temporal domain behavior feature map simultaneously, and add the first spatio-temporal domain behavior feature map in a residual manner to obtain a third spatio-temporal domain behavior feature map;
  • Step S4 Perform high-level local reasoning on the third spatiotemporal domain behavior feature map to obtain a fourth spatiotemporal domain behavior feature map;
  • Step S5 Globally pool the fourth spatio-temporal domain behavior feature map to obtain a behavior feature vector.
  • An embodiment of a method for extracting behavioral features based on spatio-temporal frequency domain mixed learning includes steps S1 to S5. Each step is described in detail as follows:
  • Step S1 Obtain a skeleton-based video behavior sequence as the original video behavior sequence, perform a spatio-temporal domain adaptive transformation, and obtain a first spatio-temporal domain behavior feature map.
  • Step S11 remember that the original video behavior sequence is X, the dimension is C0*T0*N0, C0 is the number of channels, T0 is the time dimension, and N0 is the number of spatial joint points;
  • a convolutional network with a core of 1 or a fully connected network is used to adaptively augment the original video behavior sequence under K oblique coordinate systems to obtain an augmented video behavior sequence under K coordinate systems, where K is Hyperparameter
  • Step S12 a multi-layer fully connected network is used to transform the number of joints and joint arrangement order of the skeletons in the augmented video behavior sequence to obtain a feature map of the augmented optimized video behavior sequence containing structural information, which is the first spatiotemporal domain Behavior characteristic graph X', the dimension is C'*T'*N', C'is the number of channels, T'is the time dimension, and N'is the number of spatial joint points.
  • Step S2 Send the first time-space domain behavior feature map into the frequency domain for frequency selection and then inversely transform back to the time-space domain, and add the first time-space domain behavior feature map in a residual manner to obtain a second space-time Domain behavior feature map.
  • step S21 the two-dimensional discrete Fourier transform (2D-DFT, 2D-Discrete Fourier Transform) is used to transform the feature map of each channel into the frequency domain, which is denoted as Y, as shown in equation (1):
  • c, u, v represent the channel of the frequency domain feature map, time frequency dimension, spatial frequency dimension
  • c, t, n represent the channel of the space-time domain feature map, time dimension, space dimension
  • T is the first time-space domain feature map The number of channels
  • N is the total number of points in the spatial dimension of the frequency domain feature map.
  • the two-dimensional discrete fast Fourier transform (2D-FFT, 2D-Fast Fourier Transformation) can be used to realize the feature map transformation.
  • the resulting frequency domain feature map Y contains two components, a sine frequency domain feature map F sin and a cosine frequency domain feature map F cos .
  • Step S22 constructing a frequency domain attention network, as shown in FIG. 3, including a channel average layer, two fully connected layers, a softmax function, and a channel replication layer.
  • the sine frequency domain characteristic graph F sin and the cosine frequency domain characteristic graph F cos are respectively learned through the attention network to learn the sine component attention weight M sin and the cosine component attention weight M cos .
  • Step S23 perform a point product using the learned sine attention weight M sin and the sine frequency domain feature map F sin, and perform a point product with the cosine component attention weight M cos and cosine frequency domain feature map F cos to select the discriminating frequency.
  • the component is denoted as F′ i , as shown in equation (2):
  • Step S24 the two-dimensional discrete Fourier inverse transform (2D-IDFT, 2D-Inverse Discrete Fourier Transform) is used to transform the sine and cosine frequency domain feature maps back to the space-time domain to obtain the space-time domain feature map X", as shown in equation (3) :
  • C”, T” and N are the number of channels of the space-time domain feature map X”, the total number of points in the time dimension and the total number of points in the spatial dimension.
  • two-dimensional discrete fast Fourier transform (2D-IFFT, 2D-Inverse Fast Fourier Transformation) can be used to realize the inverse transformation of the feature map.
  • Step S3 Perform local and non-local reasoning on the second spatio-temporal domain behavior feature map synchronously, and add the first spatio-temporal domain behavior feature map in a residual manner to obtain a third spatio-temporal domain behavior feature map.
  • Step S31 construct a neural network sub-module y i with a local affinity field, and a neural network sub-module y′ i with a non-local affinity field, as shown in equations (4) and (5):
  • x i represents the feature vector of the spatiotemporal domain feature map of the current layer network
  • y i and y′ i represent the feature vectors of the spatiotemporal domain feature map of the local and non-local affinity fields of the next layer network respectively
  • A(x i , x j ) is the binary transformation matrix that calculates the affinity between positions i and j
  • g(x i ) is the unary transformation function that calculates the feature embedding of x j , which is 1 or 1 ⁇ 1 by the convolution kernel Convolutional layer implementation
  • Z i (X) is the normalization factor, ⁇ enumerates all feature positions, and ⁇ i is the local domain.
  • O is the superimposed feature map
  • o non-local and o local are the output of the same layer of local and non-local affinity field neural network submodules
  • w is the linear transformation function, which is 1 or 1 ⁇ 1 by the convolution kernel
  • the convolutional layer implementation is used to measure the importance of non-local components relative to local components.
  • the obtained feature maps are batch-normalized to reduce feature drift, introduce nonlinear units, and then down-sample to reduce the resolution of the feature maps.
  • Step S32 using the M1 local and non-local affinity field neural network sub-modules to calculate the affinity between position i and the neighbors in the local domain ⁇ i and the affinity between i and ⁇ for all possible positions, M1 It is a natural number greater than or equal to 1.
  • Step S33 the feature map inferred by the M1 partial and non-local affinity field neural network sub-modules and the first spatio-temporal domain feature map are added in a residual manner to obtain a third spatio-temporal domain behavior feature map.
  • the local network module is shown in Figure 5. It contains three local time plugins (tLocal), spatial local plugins (sLocal) and spatiotemporal local plugins (stLocal).
  • the convolution kernel size of the three plug-ins is k ⁇ 1, 1 ⁇ k, k ⁇ k.
  • the non-local network also contains three plug-ins, namely, time non-local plug-in (tNon-Local), spatial non-local plug-in (sNon-Local) and spatio-temporal non-local plug-in (stNon-Local); among them, two-dimensional
  • tNon-Local time non-local plug-in
  • sNon-Local spatial non-local plug-in
  • stNon-Local spatio-temporal non-local plug-in
  • ⁇ , g, w are convolutional layers with different kernels of 1 ⁇ 1, ⁇ completes the function of affinity calculation, g completes the function of linear transformation, w measures the relative importance of non-local components; one-dimensional temporal non-local plug-in (tNon-Local) and one-dimensional spatial non-local plug-in (sNon-Local )
  • tNon-Local one-dimensional temporal non-local plug-in
  • sNon-Local one-dimensional spatial non-local plug-in
  • M1 local and non-local synchronous spatio-temporal network modules perform space-time domain reasoning, the affinity field of its local sub-modules continues to increase, the resolution of feature maps continues to decrease, and the semantic information has been well extracted.
  • the local spatio-temporal network module needs to be used to mine high-level spatio-temporal pattern features.
  • Step S4 Perform high-level local reasoning on the third spatio-temporal domain behavior feature map to obtain a fourth spatio-temporal domain behavior feature map.
  • the method is as follows:
  • M2 is a natural number greater than or equal to 1; after inference
  • the feature map of is the fourth time-space domain behavior feature map.
  • C ⁇ T ⁇ N is the dimension representation, and the input of the representative network is composed of three dimensions of channel C, time T and space N.
  • the three-dimensional tensor of C ⁇ TN, TN ⁇ TN represents a two-dimensional matrix with dimensions C ⁇ TN, TN ⁇ TN, and the values of C, T, and N in each submodule are different.
  • Step S5 Globally pool the fourth time-space domain behavior feature map to obtain a feature vector f p .
  • the original skeleton-based video behavior sequence is differentiated to obtain velocity information in the time dimension, and a behavior sequence including position and velocity is constructed.
  • Steps S1 to S5 of any one of claims 1 to 5 are processed for the behavior sequence channels of position and velocity, respectively, to obtain a feature vector f p corresponding to velocity and a feature vector f v corresponding to position.
  • the feature vectors are spliced to obtain a spliced feature vector f c
  • the extracted behavior feature vectors are a velocity feature vector f p , a position feature vector f v and a spliced feature vector f c .
  • the feature vectors f p , f v and f c are passed through the speed, position and splicing feature branches in the virtual multi-task network to obtain the prediction probabilities p p , p v and p c of behaviors belonging to each category.
  • the predicted losses L p , L v and L c of the three branches are calculated.
  • the cross-entropy loss function is used for calculation, as shown in equation (7):
  • b is the one-hot category label with real behavior
  • N C is the total number of behavior categories.
  • ⁇ p , ⁇ v and ⁇ c are three hyperparameters, which control the weight of each information channel. Use the total loss to optimize the entire network until it reaches the optimum.
  • the classification result is obtained based on the prediction probability p c of the stitching channel, that is, the category with the largest prediction probability in p c is directly taken as the behavior classification result output for the video behavior.
  • a behavioral feature extraction system based on mixed learning in the spatiotemporal frequency domain of the third embodiment of the present invention includes a video sequence acquisition module, an adaptive transformation module, a frequency selection module, local and non-local synchronous inference modules, a high-level local inference module, and a global pool Module, splicing module, multi-task network module, output module;
  • the video sequence acquisition module is configured to acquire a skeleton-based video behavior sequence as the original video behavior sequence
  • the adaptive transformation module is configured to extract the first spatio-temporal behavior characteristic map in the spatio-temporal domain through augmented optimization
  • the frequency selection module is configured to send the first time-space domain behavior feature map to the frequency-domain attention network for frequency selection, and transform the obtained frequency-domain behavior feature map to the time-space domain and the first time-space domain behavior feature map with residual Add the modes to get the second time-space domain behavior characteristic map;
  • the local and non-local synchronous inference module is configured to simultaneously perform local and non-local inference on the second spatio-temporal domain behavior feature map and add the first spatio-temporal domain behavior feature map in a residual manner to obtain the third spatio-temporal domain behavior Feature map
  • the high-level local inference module is configured to perform high-level local inference on the third spatio-temporal domain behavior feature map to obtain a fourth spatio-temporal domain behavior feature map;
  • the global pooling module is configured to globally pool the fourth time-space domain behavior feature map group to obtain corresponding behavior feature vectors
  • the stitching module is configured to stitch multi-channel features to obtain corresponding stitching feature vectors
  • the output module is configured to output the extracted behavior feature vector.
  • the behavioral feature extraction system based on the mixed learning in space-time and frequency domains provided in the above embodiments is only exemplified by the division of the above functional modules.
  • the above-mentioned functions can be allocated by different
  • the function module is completed, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined.
  • the modules in the above embodiments may be combined into one module, or may be further split into multiple sub-modules to complete all or Some features.
  • the names of the modules and steps involved in the embodiments of the present invention are only for distinguishing each module or step, and are not regarded as an improper limitation of the present invention.
  • a storage device of a fourth example of the present invention wherein a plurality of programs are stored, the programs are suitable to be loaded and executed by a processor to implement the above-described behavior feature extraction method based on spatio-temporal frequency domain mixed learning.
  • a processing device of a fifth example of the present invention includes a processor and a storage device; the processor is adapted to execute each program; the storage device is adapted to store multiple programs; the program is adapted to be processed by It is loaded and executed to implement the above-mentioned behavior feature extraction method based on spatio-temporal frequency domain mixed learning.
  • space-time frequency domain is "space-time domain” and "frequency domain”
  • space-time domain is a coordinate system that describes the relationship between a mathematical function or physical signal to pure time, pure space or space-time
  • frequency domain is A coordinate system used to describe the frequency characteristics of a signal.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Complex Calculations (AREA)

Abstract

一种基于时空频域混合学习的行为特征提取方法、系统、装置,方法包括:获取基于骨架的视频行为序列,通过变换网络提取时空域行为特征图;输入频域注意网络进行频率选择后逆变回时空域,与时空域行为特征图相加;同步进行局部和非局部推理,并进行高层局部推理;将推理得到的时空域行为特征图全局池化,得到视频行为序列的行为特征向量。

Description

基于时空频域混合学习的行为特征提取方法、系统、装置 技术领域
本发明属于行为识别领域,具体涉及一种基于时空频域混合学习的行为特征提取方法、系统、装置。
背景技术
行为识别在智能监控、人机交互和自动驾驶等领域有着广泛的应用,行为识别包括行为分类和行为检测,具体来说就是用专用采集设备采集的基于RGB、深度、骨架等信息的行为视频,对其进行分类、定位和检测。基于骨架的行为识别由于计算开销小,表示简洁,且对环境、外貌等变化较为鲁棒,近年来引起学术界和产业界的广泛兴趣。具体地,骨架行为识别就是根据目标物体在环境中的关节点的2D或者3D坐标构成的视频序列,来实现对行为的识别。
现有的骨架行为识别方法主要采用在时空域堆叠只具有局部亲和场的局部网络来分层地提取行为序列的时空特征,进而对行为进行识别和检测。像拍手、刷牙、握手等这些行为富含内在的有区分性的频率特征,而现有的方法局限于挖掘时空模式,忽略了行为中内在的频域模式,并且先前在时空域的层级式堆叠局部网络,使得语义信息只能在高层提取,细节信息又主要在底层提取,细节信息和语义信息不能同步提取和融合,不利于挖掘有效的行为特征,使得骨架行为识别精度低,无法满足要求。
发明内容
为了解决现有技术中的上述问题,即为了解决行为特征提取精度低的问题,本发明提供了一种基于时空频域混合学习的行为特征提取方法,包括:
步骤S1,获取基于骨架的视频行为序列,作为原始视频行为序列,进行时空域自适应变换,得到第一时空域行为特征图;
步骤S2,将第一时空域行为特征图送入频域进行频率选择后逆变换回时空域,与第一时空域行为特征图以残差的方式相加,得到第二时空域行为特征图;
步骤S3,对第二时空域行为特征图同步进行局部和非局部推理,并与第一时空域行为特征图以残差的方式相加,得到第三时空域行为特征图;
步骤S4,对第三时空域行为特征图进行高层局部推理,得到第四时空域行为特征图;
步骤S5,将第四时空域行为特征图全局池化,得到行为特征向量。
在一些优选的实施例中,步骤S1中“时空域自适应变换”,其步骤为:
步骤S11,采用核为1的卷积网络或全连接网络对所述原始视频行为序列在K个斜坐标系下进行坐标系统的自适应增广,得到K个坐标系统下的增广视频行为序列,K为超参数。
步骤S12,利用采用核为1的卷积网络或全连接网络对所述增广视频行为序列中的骨架进行关节数目和关节排列顺序进行变换,得到包含结构信息的增广优化视频行为序列的特征图,为第一时空域行为特征图。
在一些优选的实施例中,步骤S2中“将第一时空域行为特征图送入频域进行频率选择后逆变换回时空域,与第一时空域行为特征图以残差的方式相加”,其步骤为:
步骤S21,利用二维离散傅立叶变换将每个通道的特征图分别变换到频域,包含正弦频域特征图和余弦频域特征图;
考虑到计算效率,可采用二维离散快速傅立叶变换实现特征图变换。
步骤S22,分别将所述正弦频域特征图和余弦频域特征图通过注意网络,学习出正弦成分注意力权重和余弦成分注意力权重;
其中,注意网络,包括一个通道平均层、两个全连接层、一个softmax函数和一个通道复制层。
步骤S23,用学习到的正弦成分注意力权重和正弦频域特征图进行点乘,余弦成分注意力权重和余弦频域特征图进行点乘,得到频率选择后的正弦和余弦频域特征图。
步骤S24,利用二维离散傅立叶逆变换将正弦和余弦频域特征图变换到时空域,以残差的方式与第一时空域行为特征图相加,得到第二时空域行为特征图;
考虑到计算效率,可采用二维离散快速傅立叶逆变换实现特征图逆变换。
在一些优选的实施例中,步骤S3中“对第二时空域行为特征图同步进行局部和非局部推理”,其步骤为:
步骤S31,构建具有局部亲和场的神经网络子模块y i,具有非局部亲和场的神经网络子模块y′ i
Figure PCTCN2019083357-appb-000001
Figure PCTCN2019083357-appb-000002
其中,x i代表当前层网络的时空域特征图的特征向量;y i和y′ i分别代表下一层网络的局部和非局部亲和场的时空域特征图的特征向量;A(x i,x j)是计算位置i和j之间的亲和度的二元变换矩阵;g(x i)是计算x j的特征嵌入的一元变换函数,由卷积核为1或1×1的卷积层实现;Z i(X)为归一化因子,Ω枚举所有的特征位置,δ i为局部领域。
将局部和非局部亲和场神经网络子模块提取的特征有权叠加得到特征图,并对所述特征图进行批归一化减小特征漂移,引入非线性单元,再进行下采样降低特征图的分辨率;
步骤S32,采用M1个所述局部和非局部亲和场神经网络子模块计算位置i与局部领域δ i内的邻居之间的亲和度以及i与Ω中所有可能位置的亲和度,M1为大于或等于1的自然数;
步骤S33,将经过M1个局部和非局部亲和场神经网络子模块推理的特征图与第一时空域特征图以残差的方式相加,得到第三时空域行为特征图。
在一些优选的实施例中,步骤S4中“对第三时空域行为特征图进行高层局部推理”,其方法为:
采用M2个构建的局部亲和场神经网络子模块计算所述第三时空域行为特征图组位置i与局部领域δ i内的邻居之间的亲和度,M2为大于或等于1的自然数;推理后的特征图为第四时空域行为特征图。
本发明的另一方面,提出了一种基于时空频域混合学习的行为特征提取方法,包括:
对原始基于骨架的视频行为序列在时间维度上差分得到速度信息,构造包含位置和速度的行为序列;
分别对包含位置和速度的行为序列采用权利要求1-5任一项所述的步骤S1-步骤S5进行处理,得到对应速度的特征向量和对应位置的特征向量;
将所述特征向量拼接得到拼接特征向量,提取的行为特征向量为速度特征向量、位置特征向量和拼接特征向量。
本发明第三方面,提出了一种基于时空频域混合学习的行为特征提取系统,包括视频序列获取模块、自适应变换模块、频率选择模块、局部和非局部同步推理模块、高层局部推理模块、全局池化模块、拼接模块、输出模块;
所述视频序列获取模块,配置为获取基于骨架的视频行为序列,作为原始视频行为序列;
所述自适应变换模块,配置为在时空域通过增广优化的方式,提取第一时空域行为特征图;
所述频率选择模块,配置为将第一时空域行为特征图送入频域注意网络进行频率选择,将获得的频域行为特征图变换到时空域与第一时空域行为特征图以残差的方式相加,得到第二时空域行为特征图;
所述局部和非局部同步推理模块,配置为对第二时空域行为特征图同步进行局部和非局部推理,并与第一时空域行为特征图以残差的方式相加得到第三时空域行为特征图;
所述高层局部推理模块,配置为对第三时空域行为特征图进行高层局部推理,得到第四时空域行为特征图;
所述全局池化模块,配置为将第四时空域行为特征图组全局池化,得到对应的行为特征向量;
所述拼接模块,配置为将多通道特征拼接,得到相应的拼接特征向量;
所述输出模块,配置为将提取的行为特征向量输出。
本发明第四方面,提出了一种存储装置,其中存储有多条程序,所述程序适于由处理器加载并执行以实现上述的基于时空频域混合学习的行为特征提取方法。
本发明第五方面,提出了一种处理装置,包括处理器,适于执行各条程序;以及存储装置,适于存储多条程序;所述程序适于由处理器加载并执行以实现上述的基于时空频域混合学习的行为特征提取方法。
本发明的有益效果:
(1)本发明突破以前只用深度网络挖掘行为骨架序列的时空模式的局限,充分挖掘行为内在的有判别力的频率模式,利用注意机制,在频域对频域特征图进行频域特征的注意力分配,通过端到端的学习,最终学会对有效的频率模式进行自适应地选择。
(2)相比以前的局部网络只能在低层和高层网络分别异步地提取细节信息和语义信息,本发明提出的同步具有局部和非局部亲和场的网络模块在每一层都能同步的提取和融合局部细节和全局语义,相对传统的局部网络,可以有效降低网络的层数和参数。
(3)本发明提出的自适应变换网络,其坐标变换网络能够将原始在单一直角坐标系下表示的骨架通过学习变换到多个斜坐标系下,得到更丰富的表示;同时骨架变换网络还能重新学习最优的关节数目和关节排列顺序,相比于先前无结构的表示,可以学到更结构化的特征,进而提高特征提取精度。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:
图1是本发明基于时空频域混合学习的行为特征提取方法的流程示意图;
图2是本发明基于时空频域混合学习的行为特征提取方法实施例的总体框架示意图;
图3是本发明基于时空频域混合学习的行为特征提取方法实施例的频域注意网络结构示意图;
图4是本发明基于时空频域混合学习的行为特征提取方法实施例的二维时空非局部网络插件示意图;
图5是本发明基于时空频域混合学习的行为特征提取方法实施例的局部网络模块示意图;
图6是本发明基于时空频域混合学习的行为特征提取方法实施例的局部和非局部同步模块示意图;
图7是本发明基于时空频域混合学习的行为特征提取方法实施例的局部和非局部同步模块的亲和场示意图。
具体实施方式
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。
现有行为识别方法主要采用在时空域堆叠只具有局部亲和场的局部网络来分层地提取行为序列的时空特征,进而对行为进行识别和检测,局限于挖掘时空模式,忽略了行为中内在的频域模式,并且在时空域的层级式堆叠局部网络,使得语义信息只能在高层提取,细节信息又主要在底层提取,细节信息和语义信息不能同步融合,不利于挖掘有效的行为特征。本发明的技术方案在频域采用注意机制自适应的选择有效的频率模式,在时空域采用同时具有局部和非局部亲和场的网络进行时空推理,使网络在各层模块均能同步挖掘局部细节和非局部语义信息,从而有效的提高了骨架行为特征提取的精度。
本发明的一种基于时空频域混合学习的行为特征提取方法,包括:
步骤S1,获取基于骨架的视频行为序列,作为原始视频行为序列,进行时空域自适应变换,得到第一时空域行为特征图;
步骤S2,将所述第一时空域行为特征图送入频域进行频率选择后逆变换回时空域,与所述第一时空域行为特征图以残差的方式相加,得到第二时空域行为特征图;
步骤S3,对所述第二时空域行为特征图同步进行局部和非局部推理,并与所述第一时空域行为特征图以残差的方式相加,得到第三时空域行为特征图;
步骤S4,对所述第三时空域行为特征图进行高层局部推理,得到第四时空域行为特征图;
步骤S5将所述第四时空域行为特征图全局池化,得到行为特征向量。
为了更清晰地对本发明基于时空频域混合学习的行为识别方法进行说明,下面结合图1-图7对本方发明方法一种实施例中各步骤进行展开详述。
本方发明一种实施例的基于时空频域混合学习的行为特征提取方法,包括步骤S1-步骤S5,各步骤详细描述如下:
步骤S1,获取基于骨架的视频行为序列,作为原始视频行为序列,进行时空域自适应变换,得到第一时空域行为特征图。
步骤S11,记原始视频行为序列为X,维度为C0*T0*N0,C0为通道数,T0为时间维度,N0为空间关节点数目;
采用核为1的卷积网络或全连接网络对所述原始视频行为序列在K个斜坐标系下进行坐标系统的自适应增广,得到K个坐标系统下的增广视频行为序列,K为超参数;
步骤S12,利用多层全连接网络对所述增广视频行为序列中的骨架进行关节数目和关节排列顺序进行变换,得到包含结构信息的增广优化视频行为序列的特征图,为第一时空域行为特征图X',维度为C'*T'*N',C'为通道数,T'为时间维度,N'为空间关节点数目。
步骤S2,将所述第一时空域行为特征图送入频域进行频率选择后逆变换回时空域,与所述第一时空域行为特征图以残差的方式相加,得到第二时空域行为特征图。
步骤S21,利用二维离散傅立叶变换(2D-DFT,2D-Discrete Fourier Transform)将每个通道的特征图分别变换到频域,记为Y,如式(1)所示:
Figure PCTCN2019083357-appb-000003
其中,c,u,v代表频域特征图的通道,时间频率维度,空间频率维度;c,t,n代表时空域特征图的通道,时间维度,空间维度;T为第一时空域特征图的通道数;N为频域特征图空间维度总点数。
考虑到计算效率,可采用二维离散快速傅立叶变换(2D-FFT,2D-Fast Fourier Transformation)实现特征图变换。
最终得到的频域特征图Y共包含两个成分,一个正弦频域特征图F sin,一个余弦频域特征图F cos
步骤S22,构建频域注意网络,如图3所示,包括一个通道平均层、两个全连接层、一个softmax函数和一个通道复制层。
分别将正弦频域特征图F sin和余弦频域特征图F cos通过注意网络,学习出正弦成分注意力权重M sin和余弦成分注意力权重M cos
步骤S23,用学习到的正弦注意权重M sin和正弦频域特征图F sin进行点乘,余弦成分注意力权重M cos和余弦频域特征图F cos进行点乘,选出有判别力的频率分量,记为F′ i,如式(2)所示:
F i'=F i⊙M i,i∈{sin,cos}      式(2)
步骤S24,利用二维离散傅立叶逆变换(2D-IDFT,2D-Inverse Discrete Fourier Transform)将正弦和余弦频域特征图变换回时空域,得到时空域特征图X”,如式(3)所示:
X”=X'+iift2(F′ sin+F′ cos),X”∈R C”×T”×N”    式(3)
其中,C”、T”和N”分别为时空域特征图X”的通道数,时间维度总点数和空间维度总点数。
考虑到计算效率,可采用二维离散快速傅立叶逆变换(2D-IFFT,2D-Inverse Fast Fourier Transformation)实现特征图逆变换。
以残差的方式将X”与第一时空域行为特征图相加,得到第二时空域行为特征图。
步骤S3,对所述第二时空域行为特征图同步进行局部和非局部推理,并与所述第一时空域行为特征图以残差的方式相加,得到第三时空域行为特征图。
步骤S31,构建具有局部亲和场的神经网络子模块y i,具有非局部亲和场的神经网络子模块y′ i,如式(4)和式(5)所示:
Figure PCTCN2019083357-appb-000004
Figure PCTCN2019083357-appb-000005
其中,x i代表当前层网络的时空域特征图的特征向量;y i和y′ i分别代表下一层网络的局部和非局部亲和场的时空域特征图的特征向量;A(x i,x j)是计算位置i和j之间的亲和度的二元变换矩阵;g(x i)是计算x j的特征嵌入的一元变换函数,由卷积核为1或1×1的卷积层实现;Z i(X)为归一化因子,Ω枚举所有的特征位置,δ i为局部领域。
将局部和非局部亲和场神经网络子模块提取的特征有权叠加,如式(6)所示:
O=wo non-local+o local       式(6)
其中,O为叠加后的特征图;o non-local和o local为同一层局部和非局部亲和场神经网络子模块的输出;w为线性变换函数,由卷积核为1或1×1的卷积层实现,用于衡量非局部成分相对局部成分的重要程度。
将得到的特征图进行批归一化减小特征漂移,引入非线性单元,再进行下采样降低特征图的分辨率。
步骤S32,采用M1个所述局部和非局部亲和场神经网络子模块计算位置i与局部领域δ i内的邻居之间的亲和度以及i与Ω中所有可能位置的亲和度,M1为大于或等于1的自然数。
步骤S33,将经过M1个局部和非局部亲和场神经网络子模块推理的特征图与第一时空域特征图以残差的方式相加,得到第三时空域行为特征图。
本实施例的局部网络原型为三个卷积神经网络,亲和度矩阵A(x i,x j)=1,g(x i)函数为线性变换函数。局部网络模块如图5所示,包含时间局部插件(tLocal)、空间局部插件(sLocal)和时空局部插件(stLocal)3个插件,三个插件的卷积核大小分别为k×1,1×k,k×k。类似地,非局部网络也包含3个插件,分别为,时间非局部插件(tNon-Local)、空间非局部插件(sNon-Local)和时空非局部插件(stNon-Local);其中,二维的时空非局部插件(stNon-Local)具体完成方式如图4所示,图中
Figure PCTCN2019083357-appb-000006
ψ、g,w均为不同的核为1×1的卷积层,
Figure PCTCN2019083357-appb-000007
ψ完成亲和度计算的功能,g完成线性变换的功能,w衡量非局部成分的相对重要性;一维的时间非局部插件(tNon-Local)和一维的空间非局部插件(sNon-Local)可采用相似的完成方式。由局部网络模块的3个插件和非局部网络模块的3个插件组合即可得到如图6所示局部和非局部同步模块(SLnL),其对应的亲和场图如图7所示。
经过M1个局部和非局部同步时空网络模块进行时空域推理后,其局部子模块的亲和场不断增大,特征图分辨率不断降低,语义信息已经得到了很好地提取。接下来仅需要采用局部时空网络模块进行高层时空模式特征的挖掘。
步骤S4,对所述第三时空域行为特征图进行高层局部推理,得到第四时空域行为特征图,其方法为:
采用M2个构建的局部亲和场神经子模块计算所述第三时空域行为特征图位置i与局部领域δ i内的邻居之间的亲和度,M2为大于或等于1的自然数;推理后的特征图为第四时空域行为特征图。
采用了M1个局部和非局部同步时空网络模块以及M2个局部亲和场神经子模块,C×T×N为维度示意,代表网络的输入是由通道C,时间T和空间N三个维度构成的三维张量,C×TN、TN×TN代表维度为C×TN、TN×TN的二维矩阵,各个子模块中C、T、N的取值并不相同。
步骤S5,将所述第四时空域行为特征图全局池化,得到特征向量f p
本发明第二实施例的基于时空频域混合学习的行为特征提取方法,包括:
对原始基于骨架的视频行为序列在时间维度上差分得到速度信息,构造包含位置和速度的行为序列。
分别对位置和速度的行为序列通道采用权利要求1-5任一项所述的步骤S1-步骤S5进行处理,得到对应速度的特征向量f p和对应位置的特征向量f v
将所述特征向量拼接得到拼接特征向量f c,提取的行为特征向量为速度特征向量f p、位置特征向量f v和拼接特征向量f c
为了进一步说明本发明基于时空频域混合学习的行为特征提取方法,下面结合特征向量在行为分类方面的应用,对本发明做进一步的说明:
将所述特征向量f p、f v和f c通过虚拟多任务网络中的速度、位置、拼接特征分支,得到行为属于每个类别的预测概率p p、p v和p c。训练阶段,利用预测概率和真实的行为类别,计算三个分支各自的预测的损失L p、L v和L c。本实施例采用交叉熵损失函数计算,如式(7)所示:
Figure PCTCN2019083357-appb-000008
其中,b为行为真实的one-hot类别标签,N C为总的行为类别数目。
多任务网络的总损失如式(8)所示:
L=λ pL pvL vcL c        式(8)
其中,λ p、λ v和λ c为三个超参数,控制每个信息通道的权重。利用总的损失优化整个网络直到达到最优。
测试(应用)阶段仅根据拼接通道的预测概率p c得到分类结果,即直接取p c中具有最大预测概率的类别作为对该视频行为输出的行为分类结果。
所属技术领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的第二实施例的基于时空频域混合学习的行为特征提取方法的步骤S1-步骤S5的具体工作过程及有关说明,可以参考前述第一实施例的基于时空频域混合学习的行为特征提取方法步骤对应过程,在此不再赘述。
本发明的第三实施例的基于时空频域混合学习的行为特征提取系统,包括视频序列获取模块、自适应变换模块、频率选择模块、局部和非局部同步推理模块、高层局部推理模块、全局池化模块、拼接模块、多任务网络模块、输出模块;
所述视频序列获取模块,配置为获取基于骨架的视频行为序列,作为原始视频行为序列;
所述自适应变换模块,配置为在时空域通过增广优化的方式,提取第一时空域行为特征图;
所述频率选择模块,配置为将第一时空域行为特征图送入频域注意网络进行频率选择,将获得的频域行为特征图变换到时空域与第一时空域行为特征图以残差的方式相加,得到第二时空域行为特征图;
所述局部和非局部同步推理模块,配置为对第二时空域行为特征图同步进行局部和非局部推理,并与第一时空域行为特征图以残差的方式相加得到第三时空域行为特征图;
所述高层局部推理模块,配置为对第三时空域行为特征图进行高层局部推理,得到第四时空域行为特征图;
所述全局池化模块,配置为将第四时空域行为特征图组全局池化,得到对应的行为特征向量;
所述拼接模块,配置为将多通道特征拼接,得到相应的拼接特征向量;
所述输出模块,配置为将提取的行为特征向量输出。
需要说明的是,上述实施例提供的基于时空频域混合学习的行为特征提取系统,仅以上述各功能模块的划分进行举例说明,在实际应用中,可以根据需要而将上述功能分配由不同的功能模块来完成,即将本发明实施例中的模块或者步骤再分解或者组合,例如,上述实施例的模块可以合并为一个模块,也可以进一步拆分成多个子模块,以完成以上描述的全部或者部分功能。对于本发明实施例中涉及的模块、步骤的名称,仅仅是为了区分各个模块或者步骤,不视为对本发明的不当限定。
本发明的第四实例的一种存储装置,其中存储有多条程序,所述程序适于由处理器加载并执行以实现上述的基于时空频域混合学习的行为特征提取方法。
本发明的第五实例的一种处理装置,包括处理器、存储装置;所述处理器,适于执行各条程序;所述存储装置,适于存储多条程序;所述程序适于由处理器加载并执行以实现上述的基于时空频域混合学习的行为特征提取方法。
所属技术领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的存储装置、处理装置的具体工作过程及有关说明,可以参考前述方法实施例中的对应过程,在此不再赘述
本领域技术人员应该能够意识到,结合本文中所公开的实施例描述的各示例的模块、方法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,软件模块、方法步骤对应的程序可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。为了清楚地说明电子硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以电子硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
术语“时空频域”为“时空域”和“频域”,“时空域”是描述数学函数或物理信号对纯时间、纯空间或时空间的关系的一种坐标系,“频域”是描述信号在频率方面特性时用到的一种坐标系。
术语“第一”、“第二”等是用于区别类似的对象,而不是用于描述或表示特定的顺序或先后次序。
术语“包括”或者任何其它类似用语旨在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备/装置不仅包括那些要素,而且还包括没有明确列出的其它要素,或者还包括这些过程、方法、物品或者设备/装置所固有的要素。
至此,已经结合附图所示的优选实施方式描述了本发明的技术方案,但是,本领域技术人员容易理解的是,本发明的保护范围显然不局限于这些具体实施方式。在不偏离本发明的原理的前提下,本领域技术人员可以对相关技术特征作出等同的更改或替换,这些更改或替换之后的技术方案都将落入本发明的保护范围之内。

Claims (9)

  1. 一种基于时空频域混合学习的行为特征提取方法,其特征在于,包括:
    步骤S1,获取基于骨架的视频行为序列,作为原始视频行为序列,进行时空域自适应变换,得到第一时空域行为特征图;
    步骤S2,将所述第一时空域行为特征图送入频域进行频率选择后逆变换回时空域,与所述第一时空域行为特征图以残差的方式相加,得到第二时空域行为特征图;
    步骤S3,对所述第二时空域行为特征图同步进行局部和非局部推理,并与所述第一时空域行为特征图以残差的方式相加,得到第三时空域行为特征图;
    步骤S4,对所述第三时空域行为特征图进行高层局部推理,得到第四时空域行为特征图;
    步骤S5,将所述第四时空域行为特征图全局池化,得到行为特征向量。
  2. 根据权利要求1所述的基于时空频域混合学习的行为特征提取方法,其特征在于,步骤S1中“时空域自适应变换”,其步骤为:
    步骤S11,采用卷积网络或全连接网络对所述原始视频行为序列在K个斜坐标系下进行坐标系统的自适应增广,得到K个坐标系统下的增广视频行为序列,K为超参数;
    步骤S12,利用多层全连接网络对所述增广视频行为序列中的骨架进行关节数目和关节排列顺序进行变换,得到包含结构信息的增广优化视频行为序列的特征图,为第一时空域行为特征图。
  3. 根据权利要求1所述的基于时空频域混合学习的行为特征提取方法,其特征在于,步骤S2中“将第一时空域行为特征图送入频域进行频率选择后逆变换回时空域,与第一时空域行为特征图以残差的方式相加”,其方法为:
    步骤S21,利用二维离散傅立叶变换将每个通道的特征图分别变换到频域,包含正弦频域特征图和余弦频域特征图;
    步骤S22,分别将所述正弦频域特征图和余弦频域特征图通过注意网络,学习出正弦成分注意力权重和余弦成分注意力权重;
    所述注意网络,包括一个通道平均层、两个全连接层、一个softmax函数和一个通道复制层;
    步骤S23,用学习到的正弦成分注意力权重和正弦频域特征图进行点乘,余弦成分注意力权重和余弦频域特征图进行点乘,得到频率选择后的正弦和余弦频域特征图;
    步骤S24,利用二维离散傅立叶逆变换将正弦和余弦频域特征图变换到时空域,以残差的方式与第一时空域行为特征图相加,得到第二时空域行为特征图。
  4. 根据权利要求1所述的基于时空频域混合学习的特征提取识别方法,其特征在于,步骤S3中“对第二时空域行为特征图同步进行局部和非局部推理”,其步骤为:
    步骤S31,构建具有局部亲和场的神经网络子模块y i,具有非局部亲和场的神经网络子模块y′ i
    Figure PCTCN2019083357-appb-100001
    Figure PCTCN2019083357-appb-100002
    其中,x i代表当前层网络的时空域特征图的特征向量;y i和y′ i分别代表下一层网络的局部和非局部亲和场的时空域特征图的特征向量;A(x i,x j)是计算位置i和j之间的亲和度的二元变换矩阵;g(x i)是计算x j的特征嵌入的一元变换函数,由卷积核为1或1×1的卷积层实现;Z i(X)为归一化因子,Ω枚举所有的特征位置,δ i为局部领域;
    将局部和非局部亲和场神经网络子模块提取的特征有权叠加得到特征图,并对所述特征图进行批归一化减小特征漂移,引入非线性单元,再进行下采样降低特征图的分辨率;
    步骤S32,采用M1个所述局部和非局部亲和场神经网络子模块计算位置i与局部领域δ i内的邻居之间的亲和度以及i与Ω中所有可能位置的亲和度,M1为大于或等于1的自然数;
    步骤S33,将经过M1个局部和非局部亲和场神经网络子模块推理的特征图与第一时空域特征图以残差的方式相加,得到第三时空域行为特征图。
  5. 根据权利要求4所述的基于时空频域混合学习的行为特征提取方法,其特征在于,步骤S4中“对第三时空域行为特征图进行高层局部推理”,其方法为:
    采用M2个构建的局部亲和场神经子模块计算所述第三时空域行为特征图组位置i与局部领域δ i内的邻居之间的亲和度,M2为大于或等于1的自然数;推理后的特征图为第四时空域行为特征图。
  6. 一种基于时空频域混合学习的行为特征提取方法,其特征在于,包括:
    对原始基于骨架的视频行为序列在时间维度上差分得到速度信息,构造包含位置和速度的行为序列;
    分别对位置和速度的行为序列通道采用权利要求1-5任一项所述的步骤S1-步骤S5进行处理,得到对应速度的特征向量和对应位置的特征向量;
    将所述特征向量拼接得到拼接特征向量,提取的行为特征向量为速度特征向量、位置特征向量和拼接特征向量。
  7. 一种基于时空频域混合学习的行为特征提取系统,其特征在于,包括视频序列获取模块、自适应变换模块、频率选择模块、局部和非局部同步推理模块、高层局部推理模块、全局池化模块、拼接模块、输出模块;
    所述视频序列获取模块,配置为获取基于骨架的视频行为序列,作为原始视频行为序列;
    所述自适应变换模块,配置为在时空域通过增广优化的方式,提取第一时空域行为特征图;
    所述频率选择模块,配置为将第一时空域行为特征图送入频域注意网络进行频率选择,将获得的频域行为特征图变换到时空域与第一时空域行为特征图相加,得到第二时空域行为特征图;
    所述局部和非局部同步推理模块,配置为对第二时空域行为特征图同步进行局部和非局部推理,并与第一时空域行为特征图以残差的方式相加得到第三时空域行为特征图;
    所述高层局部推理模块,配置为对第三时空域行为特征图进行高层局部推理,得到第四时空域行为特征图;
    所述全局池化模块,配置为将第四时空域行为特征图组全局池化,得到对应的行为特征向量;
    所述拼接模块,配置为将多通道特征拼接,得到相应的拼接特征向量;
    所述输出模块,配置为将提取的行为特征向量输出。
  8. 一种存储装置,其中存储有多条程序,其特征在于,所述程序适于由处理器加载并执行以实现权利要求1-6任一项所述的基于时空频域混合学习的行为特征提取方法。
  9. 一种处理装置,包括
    处理器,适于执行各条程序;以及
    存储装置,适于存储多条程序;
    其特征在于,所述程序适于由处理器加载并执行以实现:
    权利要求1-6任一项所述的基于时空频域混合学习的行为特征提取方法。
PCT/CN2019/083357 2018-12-07 2019-04-19 基于时空频域混合学习的行为特征提取方法、系统、装置 WO2020113886A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811494799.9 2018-12-07
CN201811494799.9A CN109711277B (zh) 2018-12-07 2018-12-07 基于时空频域混合学习的行为特征提取方法、系统、装置

Publications (1)

Publication Number Publication Date
WO2020113886A1 true WO2020113886A1 (zh) 2020-06-11

Family

ID=66254092

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/083357 WO2020113886A1 (zh) 2018-12-07 2019-04-19 基于时空频域混合学习的行为特征提取方法、系统、装置

Country Status (2)

Country Link
CN (1) CN109711277B (zh)
WO (1) WO2020113886A1 (zh)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111260774A (zh) * 2020-01-20 2020-06-09 北京百度网讯科技有限公司 生成3d关节点回归模型的方法和装置
CN111815604A (zh) * 2020-07-08 2020-10-23 讯飞智元信息科技有限公司 高炉风口监测方法、装置、电子设备和存储介质
CN112653899A (zh) * 2020-12-18 2021-04-13 北京工业大学 一种基于联合注意力ResNeSt的复杂场景下网络直播视频特征提取方法
CN113177528A (zh) * 2021-05-27 2021-07-27 南京昊烽信息科技有限公司 基于多任务学习策略训练网络模型的车牌识别方法及系统
CN113269218A (zh) * 2020-12-30 2021-08-17 威创集团股份有限公司 基于改进的vlad算法的视频分类方法
CN113408448A (zh) * 2021-06-25 2021-09-17 之江实验室 一种三维时空对象局部特征提取和对象识别的方法与装置
CN113468954A (zh) * 2021-05-20 2021-10-01 西安电子科技大学 基于多通道下局部区域特征的人脸伪造检测方法
CN113516028A (zh) * 2021-04-28 2021-10-19 南通大学 一种基于混合注意力机制的人体异常行为识别方法及系统
CN114004859A (zh) * 2021-11-26 2022-02-01 山东大学 基于多视图融合网络的超声心动左心房图分割方法及系统
CN114039871A (zh) * 2021-10-25 2022-02-11 中山大学 一种蜂窝流量预测的方法、系统、装置及介质
CN114913565A (zh) * 2021-01-28 2022-08-16 腾讯科技(深圳)有限公司 人脸图像检测方法、模型训练方法、装置及存储介质
CN115375980A (zh) * 2022-06-30 2022-11-22 杭州电子科技大学 基于区块链的数字图像的存证系统及其存证方法
CN117176270A (zh) * 2023-09-05 2023-12-05 浙江畅能数智科技有限公司 一种带信号监测功能的室分天线

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110378208B (zh) * 2019-06-11 2021-07-13 杭州电子科技大学 一种基于深度残差网络的行为识别方法
CN110222653B (zh) * 2019-06-11 2020-06-16 中国矿业大学(北京) 一种基于图卷积神经网络的骨架数据行为识别方法
CN110287836B (zh) * 2019-06-14 2021-10-15 北京迈格威科技有限公司 图像分类方法、装置、计算机设备和存储介质
CN110516599A (zh) * 2019-08-27 2019-11-29 中国科学院自动化研究所 基于渐进式关系学习的群体行为识别模型及其训练方法
US11468680B2 (en) * 2019-08-27 2022-10-11 Nec Corporation Shuffle, attend, and adapt: video domain adaptation by clip order prediction and clip attention alignment
CN110826462A (zh) * 2019-10-31 2020-02-21 上海海事大学 一种非局部双流卷积神经网络模型的人体行为识别方法
CN115100740B (zh) * 2022-06-15 2024-04-05 东莞理工学院 一种人体动作识别和意图理解方法、终端设备及存储介质
CN117576467B (zh) * 2023-11-22 2024-04-26 安徽大学 一种融合频率域和空间域信息的农作物病害图像识别方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292247A (zh) * 2017-06-05 2017-10-24 浙江理工大学 一种基于残差网络的人体行为识别方法及装置
CN108021889A (zh) * 2017-12-05 2018-05-11 重庆邮电大学 一种基于姿态外形和运动信息的双通道红外行为识别方法
CN108921087A (zh) * 2018-06-29 2018-11-30 国家计算机网络与信息安全管理中心 视频理解方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120058824A1 (en) * 2010-09-07 2012-03-08 Microsoft Corporation Scalable real-time motion recognition
US20160042227A1 (en) * 2014-08-06 2016-02-11 BAE Systems Information and Electronic Systems Integraton Inc. System and method for determining view invariant spatial-temporal descriptors for motion detection and analysis
US10509957B2 (en) * 2016-02-05 2019-12-17 University Of Central Florida Research Foundation, Inc. System and method for human pose estimation in unconstrained video
CN106056135B (zh) * 2016-05-20 2019-04-12 北京九艺同兴科技有限公司 一种基于压缩感知的人体动作分类方法
CN107330362B (zh) * 2017-05-25 2020-10-09 北京大学 一种基于时空注意力的视频分类方法
CN107680119A (zh) * 2017-09-05 2018-02-09 燕山大学 一种基于时空上下文融合多特征及尺度滤波的跟踪算法
CN108022254B (zh) * 2017-11-09 2022-02-15 华南理工大学 一种基于征点辅助的时空上下文目标跟踪方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292247A (zh) * 2017-06-05 2017-10-24 浙江理工大学 一种基于残差网络的人体行为识别方法及装置
CN108021889A (zh) * 2017-12-05 2018-05-11 重庆邮电大学 一种基于姿态外形和运动信息的双通道红外行为识别方法
CN108921087A (zh) * 2018-06-29 2018-11-30 国家计算机网络与信息安全管理中心 视频理解方法

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111260774B (zh) * 2020-01-20 2023-06-23 北京百度网讯科技有限公司 生成3d关节点回归模型的方法和装置
CN111260774A (zh) * 2020-01-20 2020-06-09 北京百度网讯科技有限公司 生成3d关节点回归模型的方法和装置
CN111815604A (zh) * 2020-07-08 2020-10-23 讯飞智元信息科技有限公司 高炉风口监测方法、装置、电子设备和存储介质
CN112653899B (zh) * 2020-12-18 2022-07-12 北京工业大学 一种基于联合注意力ResNeSt的复杂场景下网络直播视频特征提取方法
CN112653899A (zh) * 2020-12-18 2021-04-13 北京工业大学 一种基于联合注意力ResNeSt的复杂场景下网络直播视频特征提取方法
CN113269218A (zh) * 2020-12-30 2021-08-17 威创集团股份有限公司 基于改进的vlad算法的视频分类方法
CN114913565B (zh) * 2021-01-28 2023-11-17 腾讯科技(深圳)有限公司 人脸图像检测方法、模型训练方法、装置及存储介质
CN114913565A (zh) * 2021-01-28 2022-08-16 腾讯科技(深圳)有限公司 人脸图像检测方法、模型训练方法、装置及存储介质
CN113516028A (zh) * 2021-04-28 2021-10-19 南通大学 一种基于混合注意力机制的人体异常行为识别方法及系统
CN113516028B (zh) * 2021-04-28 2024-01-19 南通大学 一种基于混合注意力机制的人体异常行为识别方法及系统
CN113468954A (zh) * 2021-05-20 2021-10-01 西安电子科技大学 基于多通道下局部区域特征的人脸伪造检测方法
CN113468954B (zh) * 2021-05-20 2023-04-18 西安电子科技大学 基于多通道下局部区域特征的人脸伪造检测方法
CN113177528B (zh) * 2021-05-27 2024-05-03 南京昊烽信息科技有限公司 基于多任务学习策略训练网络模型的车牌识别方法及系统
CN113177528A (zh) * 2021-05-27 2021-07-27 南京昊烽信息科技有限公司 基于多任务学习策略训练网络模型的车牌识别方法及系统
CN113408448A (zh) * 2021-06-25 2021-09-17 之江实验室 一种三维时空对象局部特征提取和对象识别的方法与装置
CN114039871B (zh) * 2021-10-25 2022-11-29 中山大学 一种蜂窝流量预测的方法、系统、装置及介质
CN114039871A (zh) * 2021-10-25 2022-02-11 中山大学 一种蜂窝流量预测的方法、系统、装置及介质
CN114004859A (zh) * 2021-11-26 2022-02-01 山东大学 基于多视图融合网络的超声心动左心房图分割方法及系统
CN115375980B (zh) * 2022-06-30 2023-05-09 杭州电子科技大学 基于区块链的数字图像的存证系统及其存证方法
CN115375980A (zh) * 2022-06-30 2022-11-22 杭州电子科技大学 基于区块链的数字图像的存证系统及其存证方法
CN117176270A (zh) * 2023-09-05 2023-12-05 浙江畅能数智科技有限公司 一种带信号监测功能的室分天线
CN117176270B (zh) * 2023-09-05 2024-03-19 浙江畅能数智科技有限公司 一种带信号监测功能的室分天线及其监测方法

Also Published As

Publication number Publication date
CN109711277A (zh) 2019-05-03
CN109711277B (zh) 2020-10-27

Similar Documents

Publication Publication Date Title
WO2020113886A1 (zh) 基于时空频域混合学习的行为特征提取方法、系统、装置
CN109614981A (zh) 基于斯皮尔曼等级相关的卷积神经网络的电力系统智能故障检测方法及系统
CN109902583B (zh) 一种基于双向独立循环神经网络的骨架手势识别方法
CN113095254B (zh) 一种人体部位关键点的定位方法及系统
CN112733656A (zh) 基于多流空间注意力图卷积sru网络的骨架动作识别方法
Chen et al. Pointgpt: Auto-regressively generative pre-training from point clouds
CN110390294A (zh) 一种基于双向长短期记忆神经网络的目标跟踪方法
Dong et al. Robotic grasp detection based on transformer
Li Hierarchical Edge Aware Learning for 3D Point Cloud
Shi et al. Object Detection Based on Swin Deformable Transformer-BiPAFPN-YOLOX
Wu et al. Cloud robot: semantic map building for intelligent service task
CN117218351A (zh) 基于局部和全局上下文感知的三维点云语义分割方法
Nikulchev et al. Identification of structural model for chaotic systems
Fu et al. SAGN: Semantic adaptive graph network for skeleton-based human action recognition
Zhang et al. Two-stage domain adaptation for infrared ship target segmentation
CN115273081A (zh) 基于自适应特征采样的点云语义分割方法及系统
Yan et al. A hybrid Siamese network with spatiotemporal enhancement and two-level feature fusion for remote sensing image change detection
Yu et al. A Light-Weighted Hypergraph Neural Network for Multimodal Remote Sensing Image Retrieval
BanTeng et al. Channel-wise dense connection graph convolutional network for skeleton-based action recognition
CN114067125A (zh) 基于全推理神经网络的目标检测方法、系统及装置
Song Contextual awareness service of internet of things user interaction mode in intelligent environment
Shi et al. DAHT-Net: Deformable Attention-Guided Hierarchical Transformer Network Based on Remote Sensing Image Change Detection
Xu et al. Scale‐Adaptive Kernel Correlation Filter with Maximum Posterior Probability Estimation and Combined Features for Visual Target Tracking
Li et al. Improve the performance of CenterNet through hybrid attention mechanism CBAM
Lei et al. Audio-Visual Scene Classification Based on Multi-modal Graph Fusion.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19893116

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19893116

Country of ref document: EP

Kind code of ref document: A1