CN112580442B

CN112580442B - Behavior identification method based on multi-dimensional pyramid hierarchical model

Info

Publication number: CN112580442B
Application number: CN202011398484.1A
Authority: CN
Inventors: 黄倩; 李畅; 陈斯斯; 李兴; 毛莺池
Original assignee: Nanjing Huiying Electronic Technology Co ltd; Hohai University HHU
Current assignee: Nanjing Huiying Electronic Technology Co ltd; Hohai University HHU
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2022-08-09
Anticipated expiration: 2040-12-02
Also published as: CN112580442A

Abstract

The invention discloses a behavior identification method based on a multi-dimensional pyramid hierarchical model, which is characterized in that a multi-dimensional pyramid hierarchical model containing space dimensionality and time dimensionality is constructed to model behaviors in a video so as to capture structured multi-scale features, and then behavior identification is carried out through a classifier. The invention fully describes the behavior characteristics under different scales from multiple dimensions, provides more discriminative additional information for behavior identification, and effectively improves the accuracy and robustness of behavior identification.

Description

Behavior identification method based on multi-dimensional pyramid hierarchical model

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a behavior recognition method.

Background

Behavior recognition is one of important research subjects in the field of computer vision, and has wide application prospects in the aspects of intelligent safety monitoring, novel human-computer interaction, intelligent traffic management, smart cities, smart homes and the like. Early behavior recognition techniques were based primarily on RGB data acquired by a common camera. These techniques are susceptible to external factors such as the shooting environment, lighting conditions, and wearing textures. With the increasing demand of intelligent behavior analysis, a series of behavior recognition technologies based on depth data, skeletal data and multi-mode fusion are generated under the promotion of big data and machine learning algorithms.

In order to construct a behavior recognition model based on depth video data, an intuitive method is to apply feature descriptor expansion commonly used in RGB image videos to the depth image videos. For this reason, many effective motion feature coding techniques are studied for describing depth sequences of motions, such as motion energy Map (MEI), motion history Map (MHI), and depth motion map (DMI). The skeleton-based method represents human body motion information through dynamic three-dimensional skeleton sequence data, and mainly aims to dig the relative positions of all key skeleton points for identification. Subsequently, a behavior recognition method based on multi-modal fusion data is concerned, and the method combines two or more kinds of data to perform behavior recognition, so that more complementary information can be provided for action description, and the accuracy of behavior recognition is improved.

Although many advances have been made in the research of behavior recognition, many problems still remain. Behaviors contain information in different dimensions, including space, time and the like, and behaviors in different dimensions also contain rich multi-scale information. When the dimensions and scales observed by the behavior people change, the presentation mode of the action also changes. However, in an unknown scene, computer vision cannot perceive the scale change as human eyes. The existing behavior recognition methods ignore multi-scale information of actions, so that the methods lack robustness and are difficult to apply to practical environments.

In summary, the main problem of the existing behavior identification method is that multi-scale motion features under different dimensions cannot be sufficiently extracted to identify similar behaviors. Therefore, designing a behavior model for describing different scale features under multiple dimensions and extracting structured multi-scale features from the behavior model is an urgent problem to be solved.

Disclosure of Invention

In order to solve the technical problems mentioned in the background art, the invention provides a behavior identification method based on a multi-dimensional pyramid hierarchical model.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

a behavior identification method based on a multi-dimensional pyramid hierarchical model comprises the following steps:

(1) constructing a multi-dimensional pyramid hierarchical model: projecting a depth video frame obtained by a depth camera onto a coordinate plane to obtain an action characteristic diagram, wherein the action characteristic diagram is used for representing a depth video sequence of each action sample, and generating a Gaussian pyramid as a space dimension pyramid through Gaussian low-pass filtering and downsampling; dividing the depth video sequence of each action sample into a plurality of partitions, and calculating an action characteristic map of each partition to construct a time dimension pyramid; the space dimension pyramid and the time dimension pyramid together form a multi-dimensional pyramid level model;

(2) extracting structured multi-scale features: firstly, sequentially extracting action features from bottom to top according to the hierarchical structure of a space dimension pyramid, then extracting the action features according to the hierarchical structure of a time dimension pyramid, and then cascading the action features extracted twice to generate space-time multi-scale action features;

(3) behavior recognition: and (3) inputting the multi-scale features extracted in the step (2) into a trained classifier or a neural network for classification to obtain a behavior recognition result.

Further, in the step (1), on the basis of generating the gaussian pyramid, a laplacian pyramid is further generated to enhance the multi-scale dynamic information as an optimized spatial dimension pyramid.

Further, in step (1), the process of constructing the spatial dimension pyramid is as follows:

(1a) projecting depth video frames obtained by a depth camera onto three orthogonal Cartesian planes, taking the minimum value of the same pixel position in a depth video sequence as the pixel value of an action characteristic diagram, and generating three action characteristic diagrams with different visual angles by each depth frame, wherein the action characteristic diagrams respectively correspond to a front view, a side view and a top view; performing brightness normalization on the generated action characteristic diagram, and cutting an interested area;

(1b) generating a Gaussian pyramid by performing Gaussian low-pass filtering and downsampling on the action characteristic diagram of each view angle;

(1c) and (3) obtaining a prediction pyramid by carrying out interpolation and Gaussian smoothing on each layer of the Gaussian pyramid, and correspondingly subtracting each layer of the prediction pyramid from the Gaussian pyramid to obtain the Laplacian pyramid.

Further, in step (1), the process of constructing the time-dimensional pyramid is as follows:

(1A) dividing the depth video sequence of each action sample into a plurality of partitions, wherein each partition comprises frames with the same or different numbers, is divided into different levels according to different dividing methods, the non-partitioned depth video sequence is regarded as 0 level, the non-partitioned depth video sequence is divided into two partitions which are regarded as 1 level, and the like;

(1B) and respectively calculating the action characteristic graph of each partition as a time dimension pyramid so as to capture sub-actions of different time scales in the depth video sequence.

Further, the specific process of step (2) is as follows:

(2a) normalizing the action characteristic graphs under the same visual angle to be the same size;

(2b) the method comprises the steps that action features extracted from action feature graphs of the same level in a spatial dimension pyramid are cascaded to obtain action features of three visual angles under the scale;

(2c) sequentially extracting action features of different levels from bottom to top according to the spatial wiki-character tower hierarchical structure and cascading to generate action features of different scales;

(2d) extracting multi-scale time features according to the levels of the time dimension pyramid, firstly extracting action features of level 0, then sequentially extracting action features of other levels, and cascading the action features in each level;

(2e) and (3) cascading the action characteristics in the steps (2c) and (2d) into structured space-time multi-scale action characteristics, and carrying out normalization and dimension reduction treatment on the structured space-time multi-scale action characteristics.

Further, the motion feature adopts a direction gradient histogram, a local binary pattern or a scale-invariant feature transform.

Further, the action feature map adopts a depth motion map, a motion energy map or a motion history map.

Further, the number of layers of the space dimension pyramid and the time dimension pyramid is determined according to the computing resources and the storage resources, and the CPU utilization rate, the memory occupancy rate, the video card performance and the GPU video memory utilization rate are used as evaluation indexes for measuring the computing resources and the storage resources.

Further, a four-layer space dimension pyramid and a two-layer time dimension pyramid are adopted; in addition, a higher level pyramid is used when the CPU utilization rate, the memory occupancy rate and the GPU video memory utilization rate are lower than 30% and the video card performance is superior to that in the standard state, and a lower level pyramid is used when the CPU utilization rate, the memory occupancy rate and the GPU video memory utilization rate are higher than 70% and the video card performance is lower than that in the standard state.

Further, in the step (3), the multi-scale features extracted in the step (2) are divided into a training set and a testing set, the classifier is initialized randomly at first, parameters in the classifier are trained according to cross entropy loss by using action samples of the training set, and then the testing set is input into the trained classifier to obtain a final behavior recognition result; the classifiers include, but are not limited to, extreme learning machines, support vector machines, and random forest classifiers.

Adopt the beneficial effect that above-mentioned technical scheme brought:

1. the invention provides a multi-dimensional pyramid hierarchical model, which is a modeling method for describing structured multi-scale features of an identified object in different dimensions. Firstly, the model can realize dynamic compression and expansion of dimensions and layers to meet the requirements of different application fields, and therefore, the model has wider applicability. Secondly, in each dimension, the model can increase the feature types by expanding the number of the child nodes, and the model can adjust the scale diversity of the features by setting the number of layers of the pyramid in the same dimension, so that the features of the identification object can be more fully mined and described. In addition, the model integrally presents a tree-shaped hierarchical structure, and structured multi-scale features can be effectively extracted.

2. The behavior identification method based on the multi-dimensional pyramid hierarchical model fully extracts the structured multi-scale action features in the space and time dimensions, captures more discriminative space-time information, has an important effect on solving the identification problem of similar behaviors and opposite behaviors, and improves the accuracy and robustness of behavior identification.

Drawings

FIG. 1 is a general framework schematic of the present invention;

FIG. 2 is a schematic diagram of a depth profile DMI in an embodiment;

FIG. 3 is a block diagram of an embodiment of a time dimension pyramid;

FIG. 4 is a diagram of a multi-dimensional pyramid hierarchy model in an embodiment.

Detailed Description

The technical scheme of the invention is explained in detail in the following with the accompanying drawings.

As shown in fig. 1, a depth motion map is used to represent a depth sequence of motion, and a gaussian pyramid (laplacian pyramid) is constructed as a spatial dimension pyramid to capture more discriminative spatial multi-scale motion information. Then, different levels of feature maps are generated as a time-dimensional pyramid by dividing the video sequence into different segments to capture the time-multiscale information of the motion. And calculating the action characteristics of the multi-dimensional pyramid hierarchical model, cascading to obtain the structured multi-scale action characteristics, and inputting into a classifier for behavior recognition. In addition, pyramids with other dimensions can be constructed to jointly form a multi-dimensional pyramid level model.

The multidimensional pyramid hierarchical model provides a modeling method for describing structured multi-scale features of an identified object in different dimensions, and dynamic compression and expansion of dimensions and layer numbers can be realized. The model integrally presents a tree-shaped hierarchical structure, and structured multi-scale features can be effectively extracted. The dimensionality comprises time, space and the like as parent nodes, and the video sequence can be divided according to the time sequence in the time dimensionality to further extract the characteristics of the whole and the local parts and the like as child nodes. Correspondingly, in the spatial dimension, the characteristics of static and dynamic states and the like can be further extracted as child nodes. In each dimension, the model can increase the types of the features by expanding the number of the child nodes, and the model can also set the layer number of the pyramid to adjust the scale diversity of the features in the same dimension. The multi-dimensional pyramid hierarchical model provided by the invention can effectively capture the multi-scale characteristics of the target under different dimensions and is suitable for various identification tasks.

The invention is further described below with reference to specific assays.

1. Generating a spatial dimension pyramid

The depth frames obtained by the depth camera are projected onto three orthogonal cartesian planes, so that three 2D motion maps, denoted as maps, are generated for each 3D depth frame _v (v ∈ { f, s, t }) corresponds to a front view, a side view, and a top view, respectively, as shown in FIG. 2. The DMI takes the minimum value of the same pixel position of the depth map sequence as the pixel value of the feature map. The depth sequence with the number of frames N can be calculated by the following equation:

DMI _v (i,j)＝255-min(map _v (i,j,t))

wherein map is _v (i, j, t) is the pixel value at the (i, j) position in the action diagram of the t-th frame under the v view angle. The resulting image may be intensity normalized by dividing each pixel value by the maximum value of all pixels in the image. In addition, the excess black pixels can be excluded by cropping the regions of interest of the DMI. The further normalization can reduce the intra-class difference and reduce the interference of the body type and the action amplitude on the action recognition.

And carrying out Gaussian pyramid decomposition on the DMI to generate a cluster of characteristic images with different scales for simulating scale change of human eye observation actions. A structured DMI multi-scale image set is obtained by gaussian filtering and downsampling, and each layer of the spatial pyramid is numbered from bottom to top, as shown in fig. 3. By G _l To represent the first layer of gaussian pyramid, i.e. G _l+1 Image scale ratio of layers G _l The layer is small. To obtain G _l+1 Pyramid image of layer, need to be for G _l The slice images are subjected to gaussian kernel convolution and downsampling. In general, the gray scale value of the corresponding coordinate (i, j) position of the ith layer image is:

wherein the content of the first and second substances,

is the convolution operator; l is the number of layers of the Gaussian pyramid; r _l And C _l Respectively corresponding line number and column number of the first layer characteristic diagram of the Gaussian pyramid;

is a gaussian window of size (2c +1) × (2c +1), which can be expressed as:

wherein m and n are respectively the number of rows and columns; σ is called a scale space factor, is a standard deviation of a gaussian normal distribution, and reflects the degree to which an image is blurred. The original feature map G ₁ As the lowest layer of the Gaussian pyramid, G can be obtained by calculation in sequence according to a formula ₂ ,G ₃ ,...,G _L Forming an L-level Gaussian pyramid. A series of images { I } generated by the above-described Gaussian convolution kernel and downsampling operation ₁ ,I ₂ ,I ₃ ,...,I _L And a Gaussian pyramid forming the DMI is taken as a space dimension pyramid and is marked as GP-DMI. The pyramid algorithm reduces the filter bandwidth limit between levels by eight degrees and reduces the sampling interval by the same factor. The frequency of the down-sampling is related to the size of the original image. Maximum number of layers L of Gaussian pyramid constructed from images of size M × N _max ＝log ₂ min(M,N)。

The gaussian pyramid decomposition inevitably leads to multiple-level growth of the action map, and the generated redundant static information reduces the accuracy rate of behavior identification. In order to solve the problems, a Laplacian pyramid is further generated to obtain a more compact and more discriminant multi-scale action feature map so as to reduce the interference of redundant static information on behavior identification. Pyramid of GaussLayer i characteristic diagram G of _l By interpolation, i.e. inserting 0 in even rows and columns, and then filtering with Gaussian kernel, a feature map G with the same size as the lower layer can be obtained _l ^* ：

Wherein

Further, the generation process of the laplacian pyramid can be expressed as:

wherein L is the layer number of the top layer of the Laplacian pyramid, LP _l Is the l-th layer image of the laplacian pyramid decomposition. It should be noted that: in order to maintain the integrity of the motion information in the feature map, the top-level image of the gaussian pyramid is directly taken as the top-level map of the laplacian pyramid. The laplacian pyramid is an optimized space-dimensional pyramid and is denoted as LP-DMI. FIG. 4 illustrates an example motion sample generated spatially-dimensional pyramid.

2. Generating a time dimension pyramid

The video sequence is first divided into a number of portions or partitions containing an equal number of frames, and into different levels according to the different partitions, with the non-partitioned one being considered level 0, the average of the two partitions being level 1, and so on. And respectively calculating the action characteristic diagram DMI of each partition as a time dimension pyramid according to the method in the step 1 to capture the sub-actions of different time scales in the video sequence, and recording the sub-actions as HP-DMI. The structure of the generated time-dimensional pyramid is shown in fig. 4. In addition, pyramids with other dimensions can be constructed to jointly form a multi-dimensional pyramid level model.

According to computing resource and storage in specific implementationAnd selecting the layer number of the multi-dimensional pyramid hierarchical model according to the resource and practical application requirements. And using the CPU utilization rate, the memory occupancy rate, the performance of the display card, the GPU display memory utilization rate and the like as evaluation indexes for measuring computing resources. The method comprises the steps of firstly opening a cmd window, using a nivdia-smi command to check the use condition of computing resources, and then evaluating the performance of the current computer according to various indexes to determine the layer number of each dimension pyramid. The CPU utilization rate, the memory occupancy rate and the GPU video memory utilization rate are lower than 30%, and the performance of the video card is superior to P ₂ The higher-level pyramid is recommended to be used, and the CPU utilization rate, the memory occupancy rate and the GPU video memory utilization rate are higher than 70%, and the performance of the video card is worse than P ₈ It is recommended to use a low level pyramid. In other cases, a four-layer space dimension pyramid and a two-layer time dimension pyramid are recommended, and the dimension can be adjusted according to the actual application scene.

3. Extracting structured multi-scale features

The HOG represents the distribution of gradient and edge information in the local image, can well describe gradient change and enhance the outline information of the image, and therefore, the HOG features of multiple scales are selected and extracted for motion classification. Besides the first time, the features such as Local Binary Pattern (LBP), Scale Invariant Feature Transform (SIFT) and the like can also be selected. Firstly, the feature maps with different sizes and the same visual angle are normalized to be the same size by a method of copying adjacent pixels, so that the problem of over-small pictures caused by over decomposition layers is avoided. And cascading HOGs extracted from LP-DMIs at the same level in the time dimension pyramid to obtain the action characteristics of three visual angles at the scale. And sequentially extracting the action features of different levels from bottom to top according to the hierarchical structure of the LP-DMI, and cascading to generate the action features of different scales, namely the spatial multi-scale features. And extracting multi-scale time features according to the HP-DMI hierarchy, firstly extracting the features of the 0-level HP-DMI, and then sequentially extracting the features of other hierarchies. The Nth level comprises N characteristic graphs, and all the sub-action characteristics in each level are sequentially extracted and then cascaded. Features extracted from the multidimensional pyramid hierarchy model are concatenated into structured multi-scale features where further feature processing can be performed. Firstly, the motion characteristics are normalized by using a maximum and minimum value method, and then the dimension of the motion characteristics is reduced by using a Principal Component Analysis (PCA) algorithm. Other normalization and dimension reduction methods may also be used.

4. Behavior recognition

Dividing the processed behavior characteristics into a training set and a testing set, firstly carrying out random initialization on an ELM (extreme learning machine), using the motion samples of the training set to train parameters in extreme learning according to cross entropy loss, then using the prediction results of the motion samples in the testing set as final recognition results and evaluating the effectiveness of the method. The ELM classifier used may also be replaced with a support vector machine, a random forest, etc. classifier, and other deep networks.

The embodiments are only for illustrating the technical idea of the present invention, and the technical idea of the present invention is not limited thereto, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the scope of the present invention.

Claims

1. A behavior identification method based on a multi-dimensional pyramid hierarchical model is characterized by comprising the following steps:

(1) constructing a multi-dimensional pyramid hierarchical model: projecting a depth video frame obtained by a depth camera onto a coordinate plane to obtain an action characteristic diagram, wherein the action characteristic diagram is used for representing a depth video sequence of each action sample, and then generating a Gaussian pyramid as a spatial dimension pyramid through Gaussian low-pass filtering and downsampling operation; dividing the depth video sequence of each action sample into a plurality of partitions, and calculating an action characteristic map of each partition to construct a time dimension pyramid; the space dimension pyramid and the time dimension pyramid together form a multi-dimensional pyramid level model; on the basis of generating the Gaussian pyramid, further generating a Laplacian pyramid to enhance multi-scale dynamic information to serve as an optimized space dimension pyramid; the process of constructing the spatial dimension pyramid is as follows:

(1b) gaussian low-pass filtering and down-sampling operation are carried out on the action characteristic diagram of each view angle to generate a Gaussian pyramid;

(1c) obtaining a prediction pyramid by carrying out interpolation and Gaussian smoothing on each layer of the Gaussian pyramid, and correspondingly subtracting each layer of the prediction pyramid from the Gaussian pyramid to obtain a Laplacian pyramid;

(3) and (3) behavior recognition: and (3) inputting the multi-scale features extracted in the step (2) into a trained classifier or a neural network for classification to obtain a behavior recognition result.

2. The behavior recognition method based on the multi-dimensional pyramid hierarchy model of claim 1, wherein in the step (1), the process of constructing the time-dimensional pyramid is as follows:

3. The behavior recognition method based on the multidimensional pyramid hierarchy model as claimed in claim 1, wherein the specific process of the step (2) is as follows:

4. The behavior recognition method based on the multi-dimensional pyramid hierarchy model according to claim 3, wherein the action features are transformed by a histogram of oriented gradients, a local binary pattern or a scale-invariant feature.

5. The behavior recognition method based on the multidimensional pyramid hierarchy model according to claim 1, wherein the motion feature map is a depth motion map, a motion energy map or a motion history map.

6. The behavior recognition method based on the multi-dimensional pyramid hierarchy model according to claim 1, wherein the number of layers of the space-dimensional pyramid and the time-dimensional pyramid is determined according to computing resources and storage resources, and a CPU utilization rate, a memory occupancy rate, a video card performance, and a GPU video memory utilization rate are used as evaluation indexes for measuring the computing resources and the storage resources.

7. The behavior recognition method based on the multi-dimensional pyramid hierarchy model of claim 6, characterized in that a four-level spatial dimension pyramid and a two-level temporal dimension pyramid are adopted; in addition, a higher level pyramid is used when the CPU utilization rate, the memory occupancy rate and the GPU video memory utilization rate are lower than 30% and the video card performance is superior to that in the standard state, and a lower level pyramid is used when the CPU utilization rate, the memory occupancy rate and the GPU video memory utilization rate are higher than 70% and the video card performance is lower than that in the standard state.

8. The behavior recognition method based on the multi-dimensional pyramid hierarchical model according to claim 1, wherein in step (3), the multi-dimensional features extracted in step (2) are divided into a training set and a testing set, a classifier is initialized randomly at first, parameters in the classifier are trained according to cross entropy loss by using motion samples of the training set, and then the testing set is input into the trained classifier to obtain a final behavior recognition result; the classifiers include, but are not limited to, extreme learning machines, support vector machines, and random forest classifiers.