CN106909938B

CN106909938B - Perspective-independent behavior recognition method based on deep learning network

Info

Publication number: CN106909938B
Application number: CN201710082263.5A
Authority: CN
Inventors: 王传旭; 胡国锋; 刘继超; 杨建滨; 孙海峰; 崔雪红; 李辉; 刘云
Original assignee: Qingdao University of Science and Technology
Current assignee: Qingdao Shengruida Technology Co ltd
Priority date: 2017-02-16
Filing date: 2017-02-16
Publication date: 2020-02-21
Anticipated expiration: 2037-02-16
Also published as: CN106909938A

Abstract

本发明提出一种基于深度学习网络的视角无关性行为识别方法，包括以下步骤：将某一视角下的视频帧图像录入，采用深度学习的方式进行底层特征提取和加工；对得到的底层特征进行建模，按时间顺序得到立方体模型；将所有视角的立方体模型转化为一个视角不变的柱体特征空间映射，后将其输入到分类器中进行训练，得到视频行为视角无关性分类器。本发明的技术方案采用深度学习网络对多视角下的人体行为进行分析，提升了分类模型的鲁棒性；尤其适合基于大数据进行训练、学习，能够很好地发挥出其的优点。The present invention provides a method for recognizing perspective-independent behavior based on a deep learning network, which includes the following steps: recording a video frame image from a certain perspective, and using deep learning to extract and process underlying features; Modeling, the cube model is obtained in chronological order; the cube model of all perspectives is transformed into a perspective-invariant cylindrical feature space map, and then it is input into the classifier for training, and the video behavior perspective-independent classifier is obtained. The technical scheme of the present invention adopts a deep learning network to analyze human behavior under multiple perspectives, which improves the robustness of the classification model; it is especially suitable for training and learning based on big data, and can well exert its advantages.

Description

Perspective-independent behavior recognition method based on deep learning network

技术领域technical field

本发明计算机视觉技术领域，特别是指一种基于深度学习网络的视角无关性行为识别方法。The present invention relates to the technical field of computer vision, in particular to a method for recognizing perspective-independent behavior based on a deep learning network.

背景技术Background technique

随着信息技术的飞速发展，计算机视觉伴随着VR、AR以及人工智能等概念的出现迎来了最好的发展时期，作为计算机视觉领域最重要的视频行为分析也越来越受到国内外学者的青睐。视频监控、人机交互、医疗看护、视频检索等一系列的领域中，视频行为分析占据了很大的比重。例如现在比较流行的无人驾驶汽车项目，视频行为分析非常具有挑战性。由于人体动作的复杂性和多样性的特点，再加上多个视角下人体自遮挡、多尺度以及视角旋转、平移等因素的影响，使得视频行为识别的难度非常大。如何能够精确地识别实际生活中多个角度下人体行为，并对人体行为进行分析，一直都是非常重要的研究课题，并且社会对行为分析的要求也越来越高。With the rapid development of information technology, computer vision has ushered in the best period of development with the emergence of concepts such as VR, AR, and artificial intelligence. As the most important video behavior analysis in the field of computer vision, it is increasingly being studied by scholars at home and abroad. favor. In a series of fields such as video surveillance, human-computer interaction, medical care, and video retrieval, video behavior analysis occupies a large proportion. For example, in the popular driverless car project, video behavior analysis is very challenging. Due to the complexity and diversity of human actions, coupled with the influence of human self-occlusion, multi-scale, and perspective rotation and translation from multiple perspectives, it is very difficult to recognize video behaviors. How to accurately identify human behavior from multiple angles in real life and analyze human behavior has always been a very important research topic, and the society has higher and higher requirements for behavior analysis.

传统的研究方法包含以下几种：Traditional research methods include the following:

基于时空特征点：对提取到的视频帧图像提取其中的时空特征点，然后时空特征点建模、分析，最后进行分类。Based on spatiotemporal feature points: extract the spatiotemporal feature points in the extracted video frame images, then model and analyze the spatiotemporal feature points, and finally classify them.

基于人体骨架：通过算法或者深度相机提取到人体骨架信息，然后通过对骨架信息进行描述、建模，继而对视频行为分类。Based on human skeleton: extract human skeleton information through algorithms or depth cameras, and then describe and model the skeleton information, and then classify video behaviors.

基于时空特征点和骨架信息的行为分析方法，在传统单视角下或者单人模式下取得了显著地成果，但是针对现在像大街、机场、车站等行人流量比较大的地区或者人体遮挡、光照变化、视角变换等一系列复杂问题的出现，单纯的使用这两种分析方法在实际生活中效果往往达不到人们的要求，有时算法的鲁棒性也很差。The behavior analysis method based on spatiotemporal feature points and skeleton information has achieved remarkable results in the traditional single-view or single-person mode, but for areas with high pedestrian traffic such as streets, airports, and stations, or human occlusion and illumination changes The appearance of a series of complex problems such as the transformation of perspective, the simple use of these two analysis methods in real life often fails to meet people's requirements, and sometimes the robustness of the algorithm is also very poor.

发明内容SUMMARY OF THE INVENTION

为了解决以上现有技术存在的缺陷，本发明提出一种基于深度学习网络的视角无关性行为识别方法，采用深度学习网络对多视角下的人体行为进行分析，提升分类模型的鲁棒性；尤其深度学习网络适合基于大数据进行训练、学习，能够很好地发挥出其的优点。In order to solve the above-mentioned defects in the prior art, the present invention proposes a method for recognizing perspective-independent behavior based on a deep learning network, which adopts a deep learning network to analyze human behaviors from multiple perspectives, so as to improve the robustness of the classification model; The deep learning network is suitable for training and learning based on big data, and can give full play to its advantages.

本发明的技术方案是这样实现的：The technical scheme of the present invention is realized as follows:

一种基于深度学习网络的视角无关性行为识别方法，包括利用训练样本集获得分类器的训练过程及利用分类器识别测试样本的识别过程；A perspective-independent behavior recognition method based on a deep learning network, comprising a training process of obtaining a classifier by using a training sample set and a recognition process of using the classifier to identify test samples;

所述训练过程包括以下步骤：The training process includes the following steps:

S1)将某一视角下的视频帧图像Image 1到Image i按照时间顺序进行输入；S1) the video frame images Image 1 to Image i under a certain viewing angle are input in chronological order;

S2)对步骤S1)输入的图像采用CNN(Convolutional Neural Network，卷积神经网络)进行底层特征提取并对其进行池化，将池化后的底层特征采用STN(Spatial TransformNetworks，空间转换网络)进行强化；S2) Use CNN (Convolutional Neural Network, convolutional neural network) to extract the underlying features of the image input in step S1) and pool them, and use STN (Spatial Transform Networks, spatial transformation network) for the pooled underlying features. strengthen;

S3)对步骤S2)强化后的特征图像(Feature Map)进行池化并输入RNN(RecurrentNeural Network，递归神经网络层)进行时间建模，获得时序关联的立方体模型；S3) pooling the enhanced feature image (Feature Map) in step S2) and inputting the RNN (Recurrent Neural Network, recurrent neural network layer) for time modeling to obtain a cube model associated with time series;

S4)重复步骤S1)至S3)得到多个视角下同一个行为的空间立方体模型，将各视角的空间立方体模型转化为一个视角不变的柱体特征空间映射，并将其作为该类行为的训练样本输入到分类器中进行训练；S4) Repeat steps S1) to S3) to obtain a space cube model of the same behavior under multiple viewing angles, convert the space cube model of each viewing angle into a cylindrical feature space map with a constant viewing angle, and use it as the model of this type of behavior The training samples are input into the classifier for training;

S5)重复以上各步骤，得到各种行为的视角无关性分类器；S5) Repeat the above steps to obtain perspective-independent classifiers of various behaviors;

所述识别过程包括以下步骤：The identification process includes the following steps:

S6)录入某一视角下的视频帧图像，采用上述步骤S1)至S3)对其进行底层特征提取和建模，得到该视角下的空间立方体模型；S6) input the video frame image under a certain angle of view, adopt above-mentioned steps S1) to S3) to carry out underlying feature extraction and modeling to it, obtain the space cube model under this angle of view;

S7)将步骤S6)得到的空间立方体模型转化为一个视角不变的柱体特征空间映射，并将其输入到分类器中进行识别得到视频行为类别。S7) Convert the space cube model obtained in step S6) into a perspective-invariant cylindrical feature space map, and input it into a classifier for identification to obtain a video behavior category.

上述技术方案中，步骤S2)优选采用三层卷积操作来提取底层特征；步骤S2)和步骤S3)优选采用最大池化方法对特征图像进行降维操作。In the above technical solution, step S2) preferably adopts a three-layer convolution operation to extract underlying features; steps S2) and S3) preferably adopt a maximum pooling method to perform dimension reduction operation on the feature image.

上述技术方案中，步骤S3)得到的是同一个行为某一个视角下的空间立方体模型，反复操作步骤S1)至S3)得到多个视角下同一个行为的空间立方体模型。In the above technical solution, step S3) obtains a space cube model of the same behavior under a certain viewing angle, and repeats steps S1) to S3) to obtain space cube models of the same behavior under multiple viewing angles.

本发明的技术方案中，优选采用LSTM网络(Long-Short Term Memory，简称LSTM)进行时间建模，因为深度学习网络的后向传播过程采用的是随机梯度下降法，采用LSTM中的特殊门操作，可以防止各层的梯度消失问题。In the technical solution of the present invention, the LSTM network (Long-Short Term Memory, LSTM for short) is preferably used for time modeling, because the back-propagation process of the deep learning network adopts the stochastic gradient descent method, and adopts the special gate operation in the LSTM. , which can prevent the gradient disappearance problem of each layer.

上述技术方案中，步骤S4)具体包括：In the above technical solution, step S4) specifically includes:

S41)重复操作步骤S1)至S3)，得到同一个行为各视角的空间立方体模型，并将其整合到以x，y，z为坐标轴的圆柱体空间中，圆柱体空间表示各视角下运动特征的轨迹描述；S41) Repeat the operation steps S1) to S3) to obtain the same spatial cube model of each viewing angle, and integrate it into the cylindrical space with x, y, z as the coordinate axes, and the cylindrical space represents the movement under each viewing angle Trajectory description of the feature;

S42)对步骤S41)得到的模型采用公式：S42) adopt the formula to the model obtained in step S41):

进行极坐标变换，得到角度不变的柱体空间映射。Perform polar coordinate transformation to obtain an angle-invariant cylinder space mapping.

上述技术方案中，还包括：S0)构建数据集，本发明优选采用IXMAS数据集。In the above technical solution, it also includes: S0) constructing a data set, and the present invention preferably adopts the IXMAS data set.

与现有技术相比较，本发明的技术方案有以下不同：Compared with the prior art, the technical scheme of the present invention has the following differences:

1、使用CNN的方法对底层特征进行特征提取，得到全局的特征而不是传统方法所得到的关键点。1. Use the CNN method to perform feature extraction on the underlying features to obtain global features instead of key points obtained by traditional methods.

2、使用STN方法对得到的全局特征进行特征强化，而不是对得到的特征直接进行建模。2. Use the STN method to perform feature enhancement on the obtained global features instead of directly modeling the obtained features.

3、使用LSTM网络对经过强化以及降维操作以后的全局特征进行时间建模，加入重要的时间信息，使其具有时间关联性。3. Use the LSTM network to model the global features after strengthening and dimensionality reduction operations, and add important time information to make it temporally relevant.

4、使用极坐标变换对同一个行为各视角的空间立方体模型进行坐标变换，得到角度不变的柱体空间映射，再由CNN完成训练和分类识别。4. Use polar coordinate transformation to perform coordinate transformation on the space cube model of the same behavior and different viewing angles to obtain a cylindrical space mapping with constant angle, and then complete the training and classification and recognition by CNN.

本发明的优点在于：使用CNN的方法得出的是全局高级特征，经过STN的特征强化，对实际生活中的视频具有很好的鲁棒性，然后使用RNN网络建立时间信息，最后经过极坐标变换对多视角中的不同特征进行融合，使用CNN对得到的角度不变的描述符进行训练与分类，而不用使用传统的骨架和关键点提取操作，全局特征得到的特征更全面；RNN网络获得帧间时间信息，使得行为描述地更加完整，适用性更强。The advantages of the present invention are: the global high-level features are obtained by using the CNN method, which has good robustness to the video in real life after the feature enhancement of STN, and then uses the RNN network to establish time information, and finally passes through the polar coordinates. The transformation fuses different features in multiple perspectives, and uses CNN to train and classify the obtained angle-invariant descriptors without using traditional skeleton and key point extraction operations, and the global features obtain more comprehensive features; RNN network obtains Inter-frame time information makes the behavior description more complete and more applicable.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明训练过程的流程示意图；1 is a schematic flowchart of a training process of the present invention;

图2为本发明识别过程的流程示意图；Fig. 2 is the schematic flow chart of the identification process of the present invention;

图3为一般人体行为识别流程示意图；Figure 3 is a schematic diagram of a general human behavior recognition process;

图4为简化的底层特征的提取与建模流程图；Fig. 4 is a simplified flow chart of extraction and modeling of underlying features;

图5为一般CNN的处理流程图；Fig. 5 is the processing flow chart of general CNN;

图6为一般RNN简化结构图；Figure 6 is a simplified structural diagram of a general RNN;

图7为LSTM框图；Figure 7 is a block diagram of LSTM;

图8为对各个视角进行融合分类的流程图；Fig. 8 is the flow chart of fusion classification to each angle of view;

图9为图8中的Motion History Volume经过极坐标变换以后的模型示意图。FIG. 9 is a schematic diagram of the model of the Motion History Volume in FIG. 8 after polar coordinate transformation.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

如图1及图2所示，本发明的基于深度学习网络的视角无关性行为识别方法，包括利用训练样本集获得分类器的训练过程及利用分类器识别测试样本的识别过程；As shown in FIG. 1 and FIG. 2 , the method for recognizing perspective-independent behavior based on a deep learning network of the present invention includes a training process for obtaining a classifier by using a training sample set and a recognition process for using the classifier to identify test samples;

所述训练过程如图1所示，包括以下步骤：The training process is shown in Figure 1 and includes the following steps:

S2)对步骤S1)输入的图像采用CNN进行底层特征提取并对其进行池化，将池化后的底层特征采用STN进行强化；S2) using CNN to extract and pool the underlying features of the image input in step S1), and using STN to strengthen the underlying features after the pooling;

S3)对步骤S2)强化后的特征图像进行池化并输入RNN进行时间建模，获得时序关联的立方体模型；S3) pooling the enhanced feature image in step S2) and inputting RNN for time modeling to obtain a cube model associated with time sequence;

S5)重复以上各步骤，得到各种行为的视角无关性分类器。S5) Repeat the above steps to obtain perspective-independent classifiers for various behaviors.

所述识别过程如图2所示，包括以下步骤：The identification process is shown in Figure 2 and includes the following steps:

上述技术方案中，还包括：S0)构建数据集。In the above technical solution, it also includes: S0) constructing a data set.

本发明优选采用IXMAS数据集，数据集包含五个不同视角、12个人每人14个动作，每个动作重复三次。使用其中的11个人作为训练数据集，剩余的1人作为测试数据集。The present invention preferably adopts the IXMAS data set, the data set includes five different perspectives, 12 people each with 14 actions, and each action is repeated three times. Use 11 of them as the training dataset and the remaining 1 as the test dataset.

具体的，例如要识别“跑步”这个行为，首先采集五种视角下12个人的跑步视频，其中11个人的跑步视频作为训练数据集，剩余1人作为验证数据集。首先将某一个人的一视角下的跑步视频帧图像按照上述步骤S1)至S3)进行操作，最终得到的是该视角下的“跑步”视频行为的时序关联的立方体模型，即在该视角下“跑步”行为的空间立方体模型；然后重复步骤S1)至S3)依次得到其他四种视角下“跑步”行为的空间立方体模型；将以上五种视角下“跑步”行为的空间立方体模型转化为一个视角不变的柱体特征空间映射，并将其作为这个人的“跑步”这种类别行为的训练样本，输入分类器训练；经过多次不同人的训练样本训练后，得到“跑步”行为的视角无关性分类器。同理，可以构建各种视频行为的视角无关性分类器。Specifically, for example, to identify the behavior of "running", first collect running videos of 12 people from five perspectives, of which 11 people's running videos are used as the training data set, and the remaining 1 person is used as the verification data set. First, operate the running video frame image from a perspective of a certain person according to the above steps S1) to S3), and finally obtain a cube model related to the time series of the "running" video behavior in this perspective, that is, in this perspective The space cube model of the "running" behavior; then repeat steps S1) to S3) to obtain the space cube model of the "running" behavior under the other four perspectives in turn; transform the space cube model of the "running" behavior under the above five perspectives into a The perspective-invariant cylindrical feature space mapping, and use it as the training sample of the person's "running" category of behavior, and input it into classifier training; after training with different people's training samples, the "running" behavior is obtained View-independent classifier. Similarly, view-independent classifiers for various video behaviors can be constructed.

当进行识别时，执行上述步骤S6)和S7)，首先将测试样本中的一个人的某一个视角下的视频帧图像按照上述步骤S1)至S3)进行操作，得到该视角下该行为的空间立方体模型，再经过极坐标变换转化为一个柱体特征空间映射，将其输入分类器中识别出行为类别。其他视角的识别过程与此同。When performing identification, the above steps S6) and S7) are performed, firstly, the video frame image under a certain viewing angle of a person in the test sample is operated according to the above steps S1) to S3), and the space of the behavior under the viewing angle is obtained. The cube model is then transformed into a cylindrical feature space map through polar coordinate transformation, which is input into the classifier to identify the behavior category. The identification process for other perspectives is the same.

为了更好地理解和阐述本发明的技术方案，以下通过对上述技术方案涉及到的有关技术进行详细讲解和分析。In order to better understand and illustrate the technical solutions of the present invention, the following describes and analyzes the related technologies involved in the above-mentioned technical solutions in detail.

本发明的方法模型包含两个主要阶段，一是对底层特征提取、建模；第二是对各个视角进行融合、分类，主要的创新工作如下。The method model of the present invention includes two main stages, one is to extract and model the underlying features; the second is to fuse and classify various perspectives, and the main innovative work is as follows.

人体行为识别一般的流程如图3所示，该图中特征提取与特征表示阶段是行为识别的重点，这一阶段的结果将最终影响识别的精确度，以及算法的鲁棒性，本发明采用了深度学习的方法进行特征提取。The general process of human behavior recognition is shown in Figure 3. In this figure, the feature extraction and feature representation stage is the focus of behavior recognition. The results of this stage will ultimately affect the accuracy of recognition and the robustness of the algorithm. The present invention adopts Feature extraction using deep learning methods.

如图4所示为简化的底层特征提取与建模流程图。Figure 4 shows a simplified low-level feature extraction and modeling flowchart.

本发明的技术方案中，采用的深度学习框架是Caffe，图4中的某一视角下的视频帧Image 1到Image i是按照时间顺序输入到网络中。首先对输入图像使用CNN进行特征提取，然后使用STN对特征进行强化，使其对平移、尺度变化、角度变化具有一定的鲁棒性，然后对特征图像(Feature Map)进行池化操作，这里采用的是最大池化方法，然后将经过池化操作的特征图像输入到RNN层中进行时间建模，最后得到带有帧间时间关联性的特征图像序列(Feature Maps Sequences)。In the technical solution of the present invention, the adopted deep learning framework is Caffe, and the video frames Image 1 to Image i under a certain viewing angle in FIG. 4 are input into the network in chronological order. First, use CNN to extract features from the input image, then use STN to enhance the features to make it robust to translation, scale changes, and angle changes, and then perform the pooling operation on the feature image (Feature Map). The method is the maximum pooling method, and then the feature images that have undergone the pooling operation are input into the RNN layer for temporal modeling, and finally a feature image sequence (Feature Maps Sequences) with temporal correlation between frames is obtained.

本发明的技术方案采用三层卷积操作来提取底层特征，然后通过最大池化方法对特征进行降维操作。将池化以后的特征图像输入到STN层中对特征进行强化操作，STN网络的功能是能够使得到的特征具有对平移、旋转和尺度变化具有鲁棒性。然后将STN输出的特征图像进行最大池化，再次进行降维处理，然后输入到RNN网络中使其置入时间信息，最后按时间顺序，将得到的Feature Maps组合成空间立方体。本发明中用到的RNN网络为LSTM网络，因为深度学习网络的后向传播过程采用的是随机梯度下降法，采用LSTM中的特殊门操作，可以防止各层的梯度消失问题。The technical scheme of the present invention adopts three-layer convolution operation to extract the underlying features, and then performs dimension reduction operation on the features through the maximum pooling method. The feature image after pooling is input into the STN layer to enhance the feature. The function of the STN network is to make the obtained feature robust to translation, rotation and scale changes. Then, the feature images output by STN are subjected to maximum pooling, and the dimensionality reduction process is performed again, and then input to the RNN network to place time information. Finally, the obtained Feature Maps are combined into space cubes in chronological order. The RNN network used in the present invention is the LSTM network, because the back propagation process of the deep learning network adopts the stochastic gradient descent method, and the special gate operation in the LSTM can prevent the gradient disappearance problem of each layer.

上述技术方案中，CNN是近年来发展起来的发展起来、并引起重视的高效识别方法。20世纪60年代，Hubel和Wiesel在研究猫脑皮层中用于局部敏感和方向选择的神经元时发现其独特的网络结构可以有效地降低反馈神经网络的复杂性，继而提出了CNN。现在，CNN已经成为众多科学领域的研究热点之一，特别是在模式分类领域，由于该网络避免了对图像的复杂前期预处理，可以直接输入原始图像，因而得到了更为广泛的应用。Among the above technical solutions, CNN is an efficient identification method that has been developed in recent years and has attracted attention. In the 1960s, Hubel and Wiesel discovered that their unique network structure can effectively reduce the complexity of the feedback neural network when they studied the neurons used for local sensitivity and direction selection in the cat cerebral cortex, and then proposed CNN. Now, CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification, because the network avoids the complex pre-processing of the image and can directly input the original image, so it has been more widely used.

一般地，CNN的基本结构包括两层，其一为特征提取层，每个神经元的输入与前一层的局部接受域相连，并提取该局部的特征。一旦该局部特征被提取后，它与其它特征间的位置关系也随之确定下来；其二是特征映射层，网络的每个计算层由多个特征映射组成，每个特征映射是一个平面，平面上所有神经元的权值相等。Generally, the basic structure of CNN includes two layers, one of which is a feature extraction layer, the input of each neuron is connected to the local receptive field of the previous layer, and the local features are extracted. Once the local feature is extracted, the positional relationship between it and other features is also determined; the second is the feature mapping layer, each computing layer of the network consists of multiple feature maps, each feature map is a plane, All neurons in the plane have equal weights.

本发明的技术方案中就是使用特征映射层，提取视频帧图像中的全局底层特征，而后对底层特征进行更深层次的处理。In the technical scheme of the present invention, a feature mapping layer is used to extract the global underlying features in the video frame images, and then the underlying features are processed in a deeper level.

CNN的一般化处理流程如图5所示。The generalized processing flow of CNN is shown in Figure 5.

本发明的技术方案要使用的层就是在经过卷积以后得到的Feature Map，我们忽略后面的池化和全连接层。CNN得到的是单张图像的特征信息，而要处理的是视频信息，因此需要引入时间信息，所以单纯的使用CNN不能达到处理视频行为的要求。The layer to be used in the technical solution of the present invention is the Feature Map obtained after convolution, and we ignore the subsequent pooling and fully connected layers. What CNN obtains is the feature information of a single image, and the video information to be processed, so time information needs to be introduced, so the simple use of CNN cannot meet the requirements of processing video behavior.

上述技术方案中，RNN或者叫做循环神经网络是在前馈神经网络(Feed-forwardNeural Networks，简称FNNs)的基础上发展而来。不同于传统的FNNs，RNN引入了定向循环，能够处理那些输入之间前后关联的问题。RNN包含输入单元(Input units)，输入集标记为{x₀，x₁，…，x_t-1，x_t，x_t+1，…}，而输出单元(Output units)的输出集则被标记为{o₀，o₁，…，o_t-1，o_t，o_t+1，…}。RNN还包含隐含单元(Hidden units)，我们将其输出集标记为{s₀，s₁，…，s_t-1，s_t，S_t+1，…}，这些隐含单元完成了最为主要的工作。In the above technical solutions, RNN or recurrent neural network is developed on the basis of Feed-forward Neural Networks (FNNs for short). Unlike traditional FNNs, RNNs introduce directed loops, which are able to deal with the contextual correlation between those inputs. RNN contains input units (Input units), the input sets are labeled {x ₀ , x ₁ , ..., x _t-1 , x _t , x _t+1 , ...}, and the output units of the output units are Labeled {o ₀ , o ₁ , ..., o _t-1 , o _t , o _t+1 , ... }. The RNN also contains hidden units, and we label its output set as {s ₀ , s ₁ , …, s _t-1 , s _t , S _t+1 , …}, these hidden units complete the most Main job.

如图6所示为一般的RNN简化结构，图6中，有一条单向流动的信息流是从输入单元到达隐含单元的，与此同时另一条单向流动的信息流从隐含单元到达输出单元。在某些情况下，RNN会打破后者的限制，引导信息从输出单元返回隐含单元，这些被称为“BackProjections”，并且隐含层的输入还包括上一隐含层的状态，即隐含层内的节点可以自连也可以互连。因此，在隐含层就实现了时间信息的连接，不需要再额外的考虑时间信息的问题。这也是RNN在处理视频行为特征时的一大优势。因此，一般带有时序信息的处理，在深度学习中都是交给RNN来处理。Figure 6 shows the simplified structure of a general RNN. In Figure 6, there is a one-way flow of information from the input unit to the hidden unit, while another one-way flow of information arrives from the hidden unit output unit. In some cases, RNN will break the latter limitation and guide the information from the output unit back to the hidden unit, these are called "BackProjections", and the input of the hidden layer also includes the state of the previous hidden layer, that is, the hidden layer. Nodes within the containing layer can be self-connected or interconnected. Therefore, the connection of time information is realized in the hidden layer, and there is no need to additionally consider the problem of time information. This is also a big advantage of RNNs when dealing with video behavioral features. Therefore, the processing with time series information is generally handed over to RNN in deep learning.

在RNN的基础上又发展了一个新的处理时间信息的模型：长段时间记忆(Long-Short Term Memory，简称LSTM)。因为深度学习网络后向传播采用的随机梯度下降法，因此，RNN会出现一种梯度消失的问题，也就是后面时间的节点对于前面时间的节点感知力下降。所以LSTM引入一个核心元素就是Cell。LSTM的大致框图如图7所示。On the basis of RNN, a new model for processing time information is developed: Long-Short Term Memory (LSTM). Because of the stochastic gradient descent method used in the back-propagation of the deep learning network, RNN will have a problem of gradient disappearance, that is, the nodes at the later time are less sensitive to the nodes at the previous time. So a core element introduced by LSTM is Cell. The rough block diagram of LSTM is shown in Figure 7.

图8所示为对各个视角进行融合分类的流程图。FIG. 8 is a flow chart showing the fusion and classification of various perspectives.

按照图4的方法得到多个视角下同一个动作的空间立方体模型，然后将各视角的空间立方体模型整合到以x，y，z为坐标轴的圆柱体空间中，圆柱体空间表示各视角下运动特征的轨迹描述，然后使用数学方法进行极坐标变换，将之转化到r，θ，z坐标轴的空间中，公式如下所示：According to the method in Figure 4, the space cube model of the same action under multiple viewing angles is obtained, and then the space cube model of each viewing angle is integrated into the cylindrical space with x, y, and z as the coordinate axes. The trajectory description of the motion feature, and then use the mathematical method to perform polar coordinate transformation, and convert it into the space of r, θ, z coordinate axes, the formula is as follows:

然后得到角度不变的柱体空间映射(Invariant Cylinder Space Map)，最后将得到的柱体空间映射输入到分类器中，得到行为类别，这里使用CNN的方式进行分类，区别于SVM分类器，因为CNN最开始是用来分类使用的。图8中的Motion History Volume(运动历史柱)以及经过极坐标变换以后的模型如图9所示。Then, the Invariant Cylinder Space Map is obtained, and finally the obtained cylinder space map is input into the classifier to obtain the behavior category. Here, the CNN method is used for classification, which is different from the SVM classifier, because CNN was originally used for classification. The Motion History Volume in Figure 8 and the model after polar coordinate transformation are shown in Figure 9.

本发明的技术方案采用深度学习的方法提取到的底层信息比传统方法的时空特征点以及骨架信息更加高级并且骨棒性更好。The underlying information extracted by the deep learning method in the technical solution of the present invention is more advanced than the spatiotemporal feature points and skeleton information of the traditional method, and the bone stickiness is better.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the scope of the present invention. within the scope of protection.

Claims

1. a method for recognizing behavior that is independent of perspective based on a deep learning network, comprising using a training sample set to obtain a training process of a classifier and using a classifier to identify a test sample identification process; it is characterized in that:

The training process includes the following steps:

S1) the video frame images Image 1 to Image i under a certain viewing angle are input in chronological order;

S2) using CNN to extract and pool the underlying features of the image input in step S1), and using STN to strengthen the underlying features after the pooling;

S3) pooling the enhanced feature image in step S2) and inputting RNN for time modeling to obtain a cube model associated with time sequence;

S4) Repeat steps S1) to S3) to obtain a space cube model of the same behavior under multiple viewing angles, convert the space cube model of each viewing angle into a cylindrical feature space map with a constant viewing angle, and use it as the model of this type of behavior The training samples are input into the classifier for training;

S5) Repeat steps S1)-S4) for various behaviors of other different categories to obtain perspective-independent classifiers corresponding to various behaviors;

The identification process includes the following steps:

S6) input the video frame image under a certain angle of view, adopt above-mentioned steps S1) to S3) to carry out underlying feature extraction and modeling to it, obtain the space cube model under this angle of view;

S7) Convert the space cube model obtained in step S6) into a cylindrical feature space map, and input it into the classifier for identification to obtain the video behavior category.

2. the perspective-independent behavior identification method based on deep learning network according to claim 1, is characterized in that:

Step S2) adopts a three-layer convolution operation to extract the underlying features.

3. the perspective-independent behavior identification method based on deep learning network according to claim 2, is characterized in that:

Steps S2) and S3) use the maximum pooling method to perform dimension reduction operations on the feature images.

4. the perspective-independent behavior identification method based on deep learning network according to claim 1, is characterized in that:

Step S3) adopts LSTM network to perform temporal modeling.

5. the perspective-independent behavior identification method based on deep learning network according to claim 1, is characterized in that, step S4) specifically comprises:

S41) Repeat the operation steps S1) to S3) to obtain the same spatial cube model of each viewing angle, and integrate it into the cylindrical space with x, y, z as the coordinate axes, and the cylindrical space represents the movement under each viewing angle Trajectory description of the feature;

S42) adopt the formula to the model obtained in step S41):

Perform polar coordinate transformation to obtain an angle-invariant cylinder space mapping.

6. The perspective-independent behavior identification method based on a deep learning network according to claim 1, characterized in that, further comprising:

S0) Construct the dataset.